Assignment Web Crawler

Web
Crawler
The assignment is about web crawler and comprises of two parts.
Web Crawler Part a: Write a class called WebCrawler, whose main method takes two arguments. The first argument is a URL, and the second argument is an integer depth. The crawler will read the web page at that URL, count the number of occurrences of each word on the page, and store that information in a database. If the depth is greater than 1, it will also search all pages linked from that page. It will continue crawling recursively until it has searched all pages up to and including the specified depth. You must do the search in a breadth-first manner. Tasks: 1. Create a table structure for your database, according to the table definition below. 2. Write the WebCrawler class, whose main method populates your database. Table definition: You must follow this table design in order to make it easy for the TA to test your application. There should be two tables
table 1: "url_entries" | url (VARCHAR) | date (DATE) | crawl_id (INT) | table 2: "wordcount" | url (VARCHAR) | word (VARCHAR) | count (INT) |
Table 1 contains an url, the date that the url was indexed, and during which web crawl it was indexed. Details (need to Catered): 1. The web crawler should never search a page twice in the same crawl. 2. If it searches a page that was searched on a previous crawl, the new results should completely replace the old results, including occurrences of words that used to be on the page and are no longer there. 3. When processing links:
4.
5. 6.
7.
1. Do follow absolute urls and relative urls. 2. Do not try to follow https:// urls You must parse the web pages 1. Don't count occurrences of words inside tags 2. Find all the links so that you can go to the next search depth. 3. you can use regular expressions or a simple state machine to parse the pages 4. you must handle even the tricky cases: 1. <a name="..." href="..."> 2. <a href="..."> href </a> 3. <a href="href=..."> 4. frames (follow the src links) 5. For any other tricky cases, you can ask if you are not sure whether you should handle them. Of course you cannot handle everything, since you are writing a simple WebCrawler, not Google. 6. Assume html tags to be well-formed, have closing tags, and that no links will be nested. Your parsers should handle web pages that are in xhtml style. See the following link regarding html and xhtml.Differences with HTML4 The following characters should be counted as word delimiters ~,.!?;:\t\n\r\f'`\"()[]{}|\\/=-_& (Notice that some of these are quoted with a backslash, like tab, newline, and backslash itself.) All error handling cases should print a short message explaining the problem, and then continue crawling the web if possible. 1. missing or invalid arguments passed to the main method 2. invalid link found on a webpage 3. page not found 4. Anything that might reasonably occur, given the architecture of your program. At the end of the program, the crawler should print five short messages: (Notice that the sum of the last four should equal the first number.) 1. 2. 3. 4. 5. How many urls were examined? How many web pages were successfully crawled? How many links were inaccessible? How many links were malformed? How many links the crawler chose not to follow because it had already searched that page via a different link.
Recommended Suggestion:
You may, if you have time and if you feel like it, write a GUI interface for the WebCrawler. This might show how many pages its completed out of how many links found so far at each level of depth, and keep a running total of pages searched, inaccessible pages, and duplicate pages. If you feel comfortable writing this GUI, it will help you to debug the WebCrawler.
================
WebCrawler Part b. (Search Engine). Write a web application to search the database that you populated in part a. Tasks: 1. Write a Servlet (named WebCrawlerServlet) to be the controller layer for the web app. 2. Write some helper classes for the Servlet to be the model layer for the web app. 3. Write an index.jsp page to be the view layer for the web app. (More details below). 4. Set up a Tomcat web server to run your web app, as described in uploaded Presentation at Slate (You may even Google it!) Proposed design You must create a design that follows the MVC design pattern. If you choose to not follow the proposed design (see image below), you should motivate your design in a convincing way. It is important not to include control layer features into the view layer.
Details (Need to catered):
1. The same jsp file (index.jsp) should both allow the use to search and show them the results after they have done a search. 2. The web app will understand the text typed into the search field as space separated words to search for. We will only type in words and spaces - no other funny characters. 3. The search will be an OR of the words typed in. So if I type in "Islamabad Pakistan", I expect the results to show all the pages that have instances of the word "Islamabad", instances of the word "Pakistan", or both. 4. The search results will be ordered by the sum total of instances of words searched for. For example if I searched for " Islamabad Pakistan ", and page A has 12 copies of the word " Islamabad ", page B has 5 copies of the word " Pakistan ", and page C has 7 copies of the word " Islamabad " and also 9 copies of the word " Pakistan", the page results should be in the order C,A,B. 5. The search results must be returned to the jsp (from the control layer) in the form of an XML Document (org.jdom.Document). 1. You must use the org.jdom packages documentation can be downloadable from API (the ones provided in the jar file attached). 2. Your XML Document must match the .dtd spec provided above. 3. Let the Servlet save the Document as a session attribute called "XMLDocument". 6. Pagination: The search results should only show 10 results per page. (See more details below) 7. Components of your jsp that are always there (place them in this order on the page, top to bottom): 1. Search form (text prompt, text field, and "Do Search" button) 2. Status line (one of the following three possibilities): 1. "Please do a search." (If they have not yet searched in this session.) 2. "No results found for your search. Please do another search." 3. "Results found: [fill in the number of results found here]" 8. Components of your jsp that are there if non-empty results have been found (place them in this order on the page, top to bottom, below the components that are always there): 1. Pages list: numbered links to each page of these results. For example, if you have 52 urls found in the search, and you're currently on page 3 of the results, your page list should look like this: "1 2 3 4 5 6". Clicking on 1 would show you links 1-10, etc., clicking on 6 would show you results 51 and 52. 2. Current page status: "Page [fill in the current page number] shows results [fill in the number of the first result shown] to [fill in the number of the last result shown]. 3. 10 results, separated by "<hr>"s. Each result should have 2 lines:
1. first line example: "1 - http://www.tourism.gov.pk 2. second line example: "( Islamabad, 7) (Pakistan, 2)" This means that the first result from your search is the tourism url shown above, and that the page at that url has 7 repetitions of the word Islamabad and 2 repetitions of the word "Pakistan". The "1 - at the beginning of the first line means that this is the first result (that is, it is the result at index 0). Clicking the url should take you to the page. Bonus Part: Handling the SQL-injection threat Whenever you create a computer system there are several security issues to handle. This is especially true when dealing with web applications sending SQL queries to an RDBM. The power of SQL becomes its weakness if an intruder succeeds in sending his own malicious SQL statement to our database Although we will not test your solution with malicious input, we still encourage you to implement basic protection against SQL-injection and we consider it as bonus. Before inserting user input into an SQL statement you should at least 1. Remove single apostrophes, or replace them with double apostrophes. 2. remove semi colons (;) and double dashes (--) There are other measures you also probably should consider, but the ones above are the basic protection that everyone should consider whenever creating a database driven web app. If you do a search on "SQL injection" you will find plenty of reading on this issue and that will lead you to the bonus.
Assignment Given Date: 15 March 2011. Due Date: 28th March 2011 At 5:00 PM.

Assignment Web Crawler

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Assignment Web Crawler

Încărcat de

Drepturi de autor:

Formate disponibile

Web

Details (Need to catered):

S-ar putea să vă placă și