Documente Academic
Documente Profesional
Documente Cultură
Regular Expression
Course homepage:
http://wiki.cse.cuhk.edu.hk/irwin.king/teaching/csc4170/2009
Tutors:
Xin Xin
Email: xxin@cse.cuhk.edu.hk
Venue: Room 101
Tom (me)
Email: czhou@cse.cuhk.edu.hk
Venue: Room 114A
Utilities of a crawler
Must provide:
Robustness: spider traps
Infinitely deep directory structures: http://foo.com/bar/foo/bar/foo/...
Pages filled a large number of characters.
Politeness: which pages can be crawled, and which cannot
robots exclusion protocol: robots.txt
http://blog.sohu.com/robots.txt
User-agent: *
Disallow: /manage/
Features of a crawler (Cont’d)
Should provide:
Distributed
Scalable
Performance and efficiency
Quality
Freshness
Extensible
Architecture of a crawler
Doc Robots URL
Fingerprint templates set
DNS
Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch
URL Frontier
Architecture of a crawler (Cont’d)
Doc Robots URL
Fingerprint templates set
DNS
Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch
URL Frontier
URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a
seed set is stored in URL Frontier, and a crawler begins by taking a URL from
the seed set.
DNS: domain name service resolution. Look up IP address for domain names.
Fetch: generally use the http protocol to fetch the URL.
Parse: the page is parsed. Texts (images, videos, and etc.) and Links are
extracted.
Architecture of a crawler (Cont’d)
Doc Robots URL
Fingerprint templates set
DNS
Dup
Content URL
Parse URL
www Seen? Filter
Elim
Fetch
URL Frontier
Content Seen?: test whether a web page with the same content has already been seen
at another URL. Need to develop a way to measure the fingerprint of a web page.
URL Filter:
Whether the extracted URL should be excluded from the frontier (robots.txt).
URL should be normalized (relative encoding).
en.wikipedia.org/wiki/Main_Page
<a href="/wiki/Wikipedia:General_disclaimer" title="Wikipedia:General
disclaimer">Disclaimers</a>
Dup URL Elim: the URL is checked for duplicate elimination.
Architecture of a crawler (Cont’d)
Other issues:
Housekeeping tasks:
Log crawl progress statistics: URLs crawled, frontier size, etc.
(Every few seconds)
Checkpointing: a snapshot of the crawler’s state (the URL frontier) is
committed to disk. (Every few hours)
Priority of URLs in URL frontier:
Change rate.
Quality.
Politeness:
Avoid repeated fetch requests to a host within a short time span.
Otherwise: blocked
Regular Expression
Usage:
Regular expressions provide a concise and flexible means for identifying
strings of text of interest, such as particular characters, words or patterns
of characters.
Today’s target:
Introduce the basic principle.
Metacharacter
Similar to the wildcard in Windows, e.g.: *.doc
Mercator Crawler:
http://mias.uiuc.edu/files/tutorials/mercator.pdf
Regular Expression tutorial:
http://www.regular-expressions.info/tutorial.html
Questions?