Web Crawler

Introduction to Web Crawling and
Regular Expression
CSC4170 Web Intelligence and Social Computing

Tutorial 1
Tutor: Tom Chao Zhou

Email: czhou@cse.cuhk.edu.hk
Outline
 Course & Tutors Information

 Introduction to Web Crawling
 Utilities of a crawler
 Features of a crawler
 Architecture of a crawler
 Introduction to Regular Expression
 Appendix
Course and Tutors Information
 Course homepage:
 http://wiki.cse.cuhk.edu.hk/irwin.king/teaching/csc4170/2009
 Tutors:
 Xin Xin
 Email: xxin@cse.cuhk.edu.hk
 Venue: Room 101
 Tom (me)
 Email: czhou@cse.cuhk.edu.hk
 Venue: Room 114A
Utilities of a crawler
 Web crawler, spider.

 Definition:
 A Web crawler is a computer program that browses the World
Wide Web in a methodical, automated manner. (Wikipedia)
 Utilities:
 Gather pages from the Web.
 Support a search engine, perform data mining and so on.
 Object:
 Text, video, image and so on.
 Link structure.
Features of a crawler
 Must provide:
 Robustness: spider traps
 Infinitely deep directory structures: http://foo.com/bar/foo/bar/foo/...
 Pages filled a large number of characters.
 Politeness: which pages can be crawled, and which cannot
 robots exclusion protocol: robots.txt
 http://blog.sohu.com/robots.txt
 User-agent: *
 Disallow: /manage/
Features of a crawler (Cont’d)
 Should provide:
 Distributed
 Scalable
 Performance and efficiency
 Quality
 Freshness
 Extensible
Architecture of a crawler
Doc Robots URL
Fingerprint templates set
DNS
Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch
URL Frontier
Architecture of a crawler (Cont’d)
Doc Robots URL
DNS
Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch
URL Frontier
URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a
seed set is stored in URL Frontier, and a crawler begins by taking a URL from
the seed set.
DNS: domain name service resolution. Look up IP address for domain names.
Fetch: generally use the http protocol to fetch the URL.
Parse: the page is parsed. Texts (images, videos, and etc.) and Links are
extracted.
Doc Robots URL
DNS
Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch
URL Frontier
Content Seen?: test whether a web page with the same content has already been seen
at another URL. Need to develop a way to measure the fingerprint of a web page.
URL Filter:
Whether the extracted URL should be excluded from the frontier (robots.txt).
URL should be normalized (relative encoding).
en.wikipedia.org/wiki/Main_Page
<a href="/wiki/Wikipedia:General_disclaimer" title="Wikipedia:General
disclaimer">Disclaimers</a>
Dup URL Elim: the URL is checked for duplicate elimination.
 Other issues:
 Housekeeping tasks:
 Log crawl progress statistics: URLs crawled, frontier size, etc.
(Every few seconds)
 Checkpointing: a snapshot of the crawler’s state (the URL frontier) is
committed to disk. (Every few hours)
 Priority of URLs in URL frontier:
 Change rate.
 Quality.
 Politeness:
 Avoid repeated fetch requests to a host within a short time span.
 Otherwise: blocked 
Regular Expression
 Usage:
 Regular expressions provide a concise and flexible means for identifying
strings of text of interest, such as particular characters, words or patterns
of characters.
 Today’s target:
 Introduce the basic principle.
 A tool to verify the regular expression: Regex Tester

 http://www.dotnet2themax.com/blogs/fbalena/PermaLink,guid,13bce26d-
7755-441e-92b3-1eb5f9e859f9.aspx
Regular Expression
 Metacharacter
 Similar to the wildcard in Windows, e.g.: *.doc
 Target: Detect the email address

Regular Expression
 \b: stands for the beginning or end of a Word.

 E.g.: \bhi\b find hi accurately
 \w: matches letters, or numbers, or underscore.
 .: matches everything except the newline
 *: content before * can be repeated any number of times
 \bhi\b.*\bLucy\b
 +: content before + can be repeated one or more times
 []: match characters in it
 E.g: \b[aeiou]+[a-zA-Z]*\b
 {n}: repeat n times
 {n,}: repeat n or more times
 {n,m}: repeat n to m times
Regular Expression
 Target: Detect the email address

 Specifications:
 A@B
 A: combinations English characters a to z, or digits, or . or _ or
% or + or –
 B: cse.cuhk.edu.hk or cuhk.edu.hk (English characters)
 Answer:
 \b[a-z0-9._%+-]+@[a-z.]+\.[a-z]{2}\b
Appendix
 Mercator Crawler:
 http://mias.uiuc.edu/files/tutorials/mercator.pdf
 Regular Expression tutorial:
 http://www.regular-expressions.info/tutorial.html
 Questions?

Web Crawler

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Web Crawler

Încărcat de

Drepturi de autor:

Formate disponibile

Introduction to Web Crawling and

CSC4170 Web Intelligence and Social Computing

Tutor: Tom Chao Zhou

 Course & Tutors Information

 Web crawler, spider.

 A tool to verify the regular expression: Regex Tester

 Target: Detect the email address

 \b: stands for the beginning or end of a Word.

 Target: Detect the email address

S-ar putea să vă placă și