Sunteți pe pagina 1din 16

Introduction to Web Crawling and

Regular Expression

CSC4170 Web Intelligence and Social Computing


Tutorial 1

Tutor: Tom Chao Zhou


Email: czhou@cse.cuhk.edu.hk
Outline

 Course & Tutors Information


 Introduction to Web Crawling
 Utilities of a crawler
 Features of a crawler
 Architecture of a crawler
 Introduction to Regular Expression
 Appendix
Course and Tutors Information

 Course homepage:
 http://wiki.cse.cuhk.edu.hk/irwin.king/teaching/csc4170/2009
 Tutors:
 Xin Xin
 Email: xxin@cse.cuhk.edu.hk
 Venue: Room 101
 Tom (me)
 Email: czhou@cse.cuhk.edu.hk
 Venue: Room 114A
Utilities of a crawler

 Web crawler, spider.


 Definition:
 A Web crawler is a computer program that browses the World
Wide Web in a methodical, automated manner. (Wikipedia)
 Utilities:
 Gather pages from the Web.
 Support a search engine, perform data mining and so on.
 Object:
 Text, video, image and so on.
 Link structure.
Features of a crawler

 Must provide:
 Robustness: spider traps
 Infinitely deep directory structures: http://foo.com/bar/foo/bar/foo/...
 Pages filled a large number of characters.
 Politeness: which pages can be crawled, and which cannot
 robots exclusion protocol: robots.txt
 http://blog.sohu.com/robots.txt
 User-agent: *
 Disallow: /manage/
Features of a crawler (Cont’d)

 Should provide:
 Distributed
 Scalable
 Performance and efficiency
 Quality
 Freshness
 Extensible
Architecture of a crawler
Doc Robots URL
Fingerprint templates set

DNS

Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch

URL Frontier
Architecture of a crawler (Cont’d)
Doc Robots URL
Fingerprint templates set

DNS

Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch

URL Frontier

URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a
seed set is stored in URL Frontier, and a crawler begins by taking a URL from
the seed set.
DNS: domain name service resolution. Look up IP address for domain names.
Fetch: generally use the http protocol to fetch the URL.
Parse: the page is parsed. Texts (images, videos, and etc.) and Links are
extracted.
Architecture of a crawler (Cont’d)
Doc Robots URL
Fingerprint templates set

DNS

Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch

URL Frontier

Content Seen?: test whether a web page with the same content has already been seen
at another URL. Need to develop a way to measure the fingerprint of a web page.
URL Filter:
Whether the extracted URL should be excluded from the frontier (robots.txt).
URL should be normalized (relative encoding).
en.wikipedia.org/wiki/Main_Page
<a href="/wiki/Wikipedia:General_disclaimer" title="Wikipedia:General
disclaimer">Disclaimers</a>
Dup URL Elim: the URL is checked for duplicate elimination.
Architecture of a crawler (Cont’d)

 Other issues:
 Housekeeping tasks:
 Log crawl progress statistics: URLs crawled, frontier size, etc.
(Every few seconds)
 Checkpointing: a snapshot of the crawler’s state (the URL frontier) is
committed to disk. (Every few hours)
 Priority of URLs in URL frontier:
 Change rate.
 Quality.
 Politeness:
 Avoid repeated fetch requests to a host within a short time span.
 Otherwise: blocked 
Regular Expression

 Usage:
 Regular expressions provide a concise and flexible means for identifying
strings of text of interest, such as particular characters, words or patterns
of characters.

 Today’s target:
 Introduce the basic principle.

 A tool to verify the regular expression: Regex Tester


 http://www.dotnet2themax.com/blogs/fbalena/PermaLink,guid,13bce26d-
7755-441e-92b3-1eb5f9e859f9.aspx
Regular Expression

 Metacharacter
 Similar to the wildcard in Windows, e.g.: *.doc

 Target: Detect the email address


Regular Expression

 \b: stands for the beginning or end of a Word.


 E.g.: \bhi\b find hi accurately
 \w: matches letters, or numbers, or underscore.
 .: matches everything except the newline
 *: content before * can be repeated any number of times
 \bhi\b.*\bLucy\b
 +: content before + can be repeated one or more times
 []: match characters in it
 E.g: \b[aeiou]+[a-zA-Z]*\b
 {n}: repeat n times
 {n,}: repeat n or more times
 {n,m}: repeat n to m times
Regular Expression

 Target: Detect the email address


 Specifications:
 A@B
 A: combinations English characters a to z, or digits, or . or _ or
% or + or –
 B: cse.cuhk.edu.hk or cuhk.edu.hk (English characters)
 Answer:
 \b[a-z0-9._%+-]+@[a-z.]+\.[a-z]{2}\b
Appendix

 Mercator Crawler:
 http://mias.uiuc.edu/files/tutorials/mercator.pdf
 Regular Expression tutorial:
 http://www.regular-expressions.info/tutorial.html
 Questions?

S-ar putea să vă placă și