Mining the Web: Discovering Knowledge from Hypertext Data

Ebook551 pages8 hours

Mining the Web: Discovering Knowledge from Hypertext Data

Name: Mining the Web: Discovering Knowledge from Hypertext Data
Brand: Elsevier Science
Rating: 3.9 (10 reviews)

By Soumen Chakrabarti

Rating: 4 out of 5 stars

4/5

()

Read preview

About this ebook

Mining the Web: Discovering Knowledge from Hypertext Data is the first book devoted entirely to techniques for producing knowledge from the vast body of unstructured Web data. Building on an initial survey of infrastructural issues-including Web crawling and indexing-Chakrabarti examines low-level machine learning techniques as they relate specifically to the challenges of Web mining. He then devotes the final part of the book to applications that unite infrastructure and analysis to bring machine learning to bear on systematically acquired and stored data. Here the focus is on results: the strengths and weaknesses of these applications, along with their potential as foundations for further progress. From Chakrabarti's work-painstaking, critical, and forward-looking-readers will gain the theoretical and practical understanding they need to contribute to the Web mining effort.

* A comprehensive, critical exploration of statistics-based attempts to make sense of Web Mining.
* Details the special challenges associated with analyzing unstructured and semi-structured data.
* Looks at how classical Information Retrieval techniques have been modified for use with Web data.
* Focuses on today's dominant learning methods: clustering and classification, hyperlink analysis, and supervised and semi-supervised learning.
* Analyzes current applications for resource discovery and social network analysis.
* An excellent way to introduce students to especially vital applications of data mining and machine learning technology.

Skip carousel

LanguageEnglish

PublisherElsevier Science

Release dateOct 16, 2002

ISBN9780080511726

Author

Soumen Chakrabarti

Soumen Chakrabarti is assistant Professor in Computer Science and Engineering at the Indian Institute of Technology, Bombay. Prior to joining IIT, he worked on hypertext databases and data mining at IBM Almaden Research Center. He has developed three systems and holds five patents in this area. Chakrabarti has served as a vice-chair and program committee member for many conferences, including WWW, SIGIR, ICDE, and KDD, and as a guest editor of the IEEE TKDE special issue on mining and searching the Web. His work on focused crawling received the Best Paper award at the 8th International World Wide Web Conference (1999). He holds a Ph.D. from the University of California, Berkeley.

Related authors

Skip carousel

Related to Mining the Web

Related ebooks

Skip carousel

Mastering Social Media Mining with Python
Ebook
Mastering Social Media Mining with Python
byMarco Bonzanini
Rating: 5 out of 5 stars
5/5
Python Web Scraping - Second Edition
Ebook
Python Web Scraping - Second Edition
byKatharine Jarmul
Rating: 5 out of 5 stars
5/5
Blockchain Basics: A Non-Technical Introduction in 25 Steps
Ebook
Blockchain Basics: A Non-Technical Introduction in 25 Steps
byDaniel Drescher
Rating: 5 out of 5 stars
5/5
Introduction to Algorithms for Data Mining and Machine Learning
Ebook
Introduction to Algorithms for Data Mining and Machine Learning
byXin-She Yang
Rating: 0 out of 5 stars
0 ratings
Web Scraping with Python
Ebook
Web Scraping with Python
byRichard Lawson
Rating: 4 out of 5 stars
4/5
Relational Database Design and Implementation: Clearly Explained
Ebook
Relational Database Design and Implementation: Clearly Explained
byJan L. Harrington
Rating: 0 out of 5 stars
0 ratings
Beginning Oracle Database 12c Administration: From Novice to Professional
Ebook
Beginning Oracle Database 12c Administration: From Novice to Professional
byIgnatius Fernandez
Rating: 0 out of 5 stars
0 ratings
Data Mining: Practical Machine Learning Tools and Techniques
Ebook
Data Mining: Practical Machine Learning Tools and Techniques
byIan H. Witten
Rating: 4 out of 5 stars
4/5
Numerical Python: A Practical Techniques Approach for Industry
Ebook
Numerical Python: A Practical Techniques Approach for Industry
byRobert Johansson
Rating: 0 out of 5 stars
0 ratings
Sockets, Shellcode, Porting, and Coding: Reverse Engineering Exploits and Tool Coding for Security Professionals
Ebook
Sockets, Shellcode, Porting, and Coding: Reverse Engineering Exploits and Tool Coding for Security Professionals
byJames C Foster
Rating: 3 out of 5 stars
3/5
Joe Celko's SQL Programming Style
Ebook
Joe Celko's SQL Programming Style
byJoe Celko
Rating: 4 out of 5 stars
4/5
Database: Principles Programming Performance
Ebook
Database: Principles Programming Performance
byPatrick O'Neil
Rating: 5 out of 5 stars
5/5
Big Data Analytics Using Splunk: Deriving Operational Intelligence from Social Media, Machine Data, Existing Data Warehouses, and Other Real-Time Streaming Sources
Ebook
Big Data Analytics Using Splunk: Deriving Operational Intelligence from Social Media, Machine Data, Existing Data Warehouses, and Other Real-Time Streaming Sources
byPeter Zadrozny
Rating: 0 out of 5 stars
0 ratings
Violent Python: A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers
Ebook
Violent Python: A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers
byTJ O'Connor
Rating: 4 out of 5 stars
4/5
Special Ops: Host and Network Security for Microsoft Unix and Oracle: Host and Network Security for Microsoft Unix and Oracle
Ebook
Special Ops: Host and Network Security for Microsoft Unix and Oracle: Host and Network Security for Microsoft Unix and Oracle
bySyngress
Rating: 4 out of 5 stars
4/5
Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration
Ebook
Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration
byEarl Cox
Rating: 5 out of 5 stars
5/5
Understanding Network Hacks: Attack and Defense with Python
Ebook
Understanding Network Hacks: Attack and Defense with Python
byBastian Ballmann
Rating: 0 out of 5 stars
0 ratings
Google's PageRank and Beyond: The Science of Search Engine Rankings
Ebook
Google's PageRank and Beyond: The Science of Search Engine Rankings
byAmy N. Langville
Rating: 4 out of 5 stars
4/5
Taming Text: How to Find, Organize, and Manipulate It
Ebook
Taming Text: How to Find, Organize, and Manipulate It
byGrant Ingersoll
Rating: 0 out of 5 stars
0 ratings
Web of Deception: Misinformation on the Internet
Ebook
Web of Deception: Misinformation on the Internet
byAnne P. Mintz
Rating: 4 out of 5 stars
4/5
Deep Belief Nets in C++ and CUDA C: Volume 1: Restricted Boltzmann Machines and Supervised Feedforward Networks
Ebook
Deep Belief Nets in C++ and CUDA C: Volume 1: Restricted Boltzmann Machines and Supervised Feedforward Networks
byTimothy Masters
Rating: 0 out of 5 stars
0 ratings
Data Mining: Concepts and Techniques
Ebook
Data Mining: Concepts and Techniques
byJiawei Han
Rating: 4 out of 5 stars
4/5
Big Data: Principles and best practices of scalable realtime data systems
Ebook
Big Data: Principles and best practices of scalable realtime data systems
byJames Warren
Rating: 4 out of 5 stars
4/5
Business Modeling and Data Mining
Ebook
Business Modeling and Data Mining
byDorian Pyle
Rating: 3 out of 5 stars
3/5
Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp
Ebook
Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp
byPeter Norvig
Rating: 4 out of 5 stars
4/5
Introduction to Python 2018 Edition
Ebook
Introduction to Python 2018 Edition
byMark Lassoff
Rating: 4 out of 5 stars
4/5
SQL A Complete Guide - 2021 Edition
Ebook
SQL A Complete Guide - 2021 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
Ebook
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
byalasdair gilchrist
Rating: 5 out of 5 stars
5/5
Assessing and Improving Prediction and Classification: Theory and Algorithms in C++
Ebook
Assessing and Improving Prediction and Classification: Theory and Algorithms in C++
byTimothy Masters
Rating: 0 out of 5 stars
0 ratings
Beginning T-SQL
Ebook
Beginning T-SQL
byKathi Kellenberger
Rating: 0 out of 5 stars
0 ratings

Internet & Web For You

Skip carousel

More Porn - Faster!: 50 Tips & Tools for Faster and More Efficient Porn Browsing
Ebook
More Porn - Faster!: 50 Tips & Tools for Faster and More Efficient Porn Browsing
bySue Denim
Rating: 3 out of 5 stars
3/5
Python: Learn Python in 24 Hours
Ebook
Python: Learn Python in 24 Hours
byAlex Nordeen
Rating: 4 out of 5 stars
4/5
Introduction to Internet Scams and Fraud: Credit Card Theft, Work-At-Home Scams and Lottery Scams
Ebook
Introduction to Internet Scams and Fraud: Credit Card Theft, Work-At-Home Scams and Lottery Scams
byDueep J. Singh
Rating: 4 out of 5 stars
4/5
Wireless Hacking 101
Ebook
Wireless Hacking 101
byKarina Astudillo
Rating: 4 out of 5 stars
4/5
Social Engineering: The Science of Human Hacking
Ebook
Social Engineering: The Science of Human Hacking
byChristopher Hadnagy
Rating: 3 out of 5 stars
3/5
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
Ebook
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
byJoseph Labrecque
Rating: 5 out of 5 stars
5/5
Coding All-in-One For Dummies
Ebook
Coding All-in-One For Dummies
byNikhil Abraham
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Podcasting For Dummies
Ebook
Podcasting For Dummies
byTee Morris
Rating: 4 out of 5 stars
4/5
Hacking With Kali Linux : A Comprehensive, Step-By-Step Beginner's Guide to Learn Ethical Hacking With Practical Examples to Computer Hacking, Wireless Network, Cybersecurity and Penetration Testing
Ebook
Hacking With Kali Linux : A Comprehensive, Step-By-Step Beginner's Guide to Learn Ethical Hacking With Practical Examples to Computer Hacking, Wireless Network, Cybersecurity and Penetration Testing
byPeter Bradley
Rating: 5 out of 5 stars
5/5
Cybersecurity For Dummies
Ebook
Cybersecurity For Dummies
byJoseph Steinberg
Rating: 4 out of 5 stars
4/5
Coding For Dummies
Ebook
Coding For Dummies
byNikhil Abraham
Rating: 5 out of 5 stars
5/5
Learn HTML Programming in 7 Days: Ultimate Beginners Guide to Build and Design Your Own Website
Ebook
Learn HTML Programming in 7 Days: Ultimate Beginners Guide to Build and Design Your Own Website
byAustin Myers
Rating: 4 out of 5 stars
4/5
The Digital Marketing Handbook: A Step-By-Step Guide to Creating Websites That Sell
Ebook
The Digital Marketing Handbook: A Step-By-Step Guide to Creating Websites That Sell
byRobert W. Bly
Rating: 5 out of 5 stars
5/5
Hacking : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Ethical Hacking
Ebook
Hacking : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Ethical Hacking
byKevin Clark
Rating: 5 out of 5 stars
5/5
How to Disappear and Live Off the Grid: A CIA Insider's Guide
Ebook
How to Disappear and Live Off the Grid: A CIA Insider's Guide
byJohn Kiriakou
Rating: 0 out of 5 stars
0 ratings
Remote/WebCam Notarization : Basic Understanding
Ebook
Remote/WebCam Notarization : Basic Understanding
byJeannie Eunice Franks
Rating: 3 out of 5 stars
3/5
The Internet Is Not What You Think It Is: A History, a Philosophy, a Warning
Ebook
The Internet Is Not What You Think It Is: A History, a Philosophy, a Warning
byJustin Smith-Ruiu
Rating: 4 out of 5 stars
4/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Six Figure Blogging Blueprint
Ebook
Six Figure Blogging Blueprint
byRaza Imam
Rating: 5 out of 5 stars
5/5
The Beginner's Affiliate Marketing Blueprint
Ebook
The Beginner's Affiliate Marketing Blueprint
byAlex M
Rating: 4 out of 5 stars
4/5
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
Ebook
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
byTommy Swindali
Rating: 4 out of 5 stars
4/5
200+ Ways to Protect Your Privacy: Simple Ways to Prevent Hacks and Protect Your Privacy--On and Offline
Ebook
200+ Ways to Protect Your Privacy: Simple Ways to Prevent Hacks and Protect Your Privacy--On and Offline
byJeni Rogers
Rating: 0 out of 5 stars
0 ratings
How To Make Money Blogging: How I Replaced My Day-Job With My Blog and How You Can Start A Blog Today
Ebook
How To Make Money Blogging: How I Replaced My Day-Job With My Blog and How You Can Start A Blog Today
byBob Lotich
Rating: 4 out of 5 stars
4/5
How To Start A Profitable Authority Blog In Under One Hour
Ebook
How To Start A Profitable Authority Blog In Under One Hour
byPassive Marketing
Rating: 5 out of 5 stars
5/5
Lying and Lie Detection: A CIA Insider's Guide
Ebook
Lying and Lie Detection: A CIA Insider's Guide
byJohn Kiriakou
Rating: 0 out of 5 stars
0 ratings
How To Start A Podcast
Ebook
How To Start A Podcast
byP Teague
Rating: 4 out of 5 stars
4/5
C++ Learn in 24 Hours
Ebook
C++ Learn in 24 Hours
byAlex Nordeen
Rating: 0 out of 5 stars
0 ratings
Ultimate Guide for Being Anonymous: Hacking the Planet, #4
Ebook
Ultimate Guide for Being Anonymous: Hacking the Planet, #4
bysparc Flow
Rating: 5 out of 5 stars
5/5
The Cyber Attack Survival Manual: Tools for Surviving Everything from Identity Theft to the Digital Apocalypse
Ebook
The Cyber Attack Survival Manual: Tools for Surviving Everything from Identity Theft to the Digital Apocalypse
byNick Selby
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
How Google Tracks Hackers: Tracking hacking groups has become a booming business. Dozens of so-called “threat intelligence” companies keep tabs on them and sell subscriptions to feeds where they provide customers with up to date information on what the most advanced cyber crimin...
Podcast episode
How Google Tracks Hackers: Tracking hacking groups has become a booming business. Dozens of so-called “threat intelligence” companies keep tabs on them and sell subscriptions to feeds where they provide customers with up to date information on what the most advanced cyber crimin...
byCYBER
0 ratings
0% found this document useful
It’s Not a Data Science Problem, It’s a Data Engineering Problem with Laurie Voss: Laurie Voss is a senior data analyst at Netlify, makers of a serverless platform designed to help teams build, deploy, and collaborate on web apps more effectively. Previously, Laurie worked as Chief Data Officer at npm, Inc., co-founded Snowball Factory,
Podcast episode
It’s Not a Data Science Problem, It’s a Data Engineering Problem with Laurie Voss: Laurie Voss is a senior data analyst at Netlify, makers of a serverless platform designed to help teams build, deploy, and collaborate on web apps more effectively. Previously, Laurie worked as Chief Data Officer at npm, Inc., co-founded Snowball Factory,
byScreaming in the Cloud
0 ratings
0% found this document useful
#059 - 10 Python clean code tips drawn from code reviews
Podcast episode
#059 - 10 Python clean code tips drawn from code reviews
byPybites Podcast
0 ratings
0% found this document useful
Streaming Data Pipelines Made SQL With Decodable: An interview with Eric Sammer about the difficulty of working with streaming engines at a low level of abstraction and how he and his team at Decodable are working to make development of streaming data pipelines as straightforward as writing SQL
Podcast episode
Streaming Data Pipelines Made SQL With Decodable: An interview with Eric Sammer about the difficulty of working with streaming engines at a low level of abstraction and how he and his team at Decodable are working to make development of streaming data pipelines as straightforward as writing SQL
byData Engineering Podcast
0 ratings
0% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
Jobs of Tomorrow: Windows Insider Podcast Episode 17
Podcast episode
Jobs of Tomorrow: Windows Insider Podcast Episode 17
byWindows Insider Podcast
100%
100% found this document useful
Massively Parallel Data Processing In Python Without The Effort Using Bodo: An interview about how Bodo converts standard Python code to native MPI automatically for massive speed ups in data processing workloads
Podcast episode
Massively Parallel Data Processing In Python Without The Effort Using Bodo: An interview about how Bodo converts standard Python code to native MPI automatically for massive speed ups in data processing workloads
byData Engineering Podcast
0 ratings
0% found this document useful
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
Podcast episode
Episode 19 (Python for Data Science - Python Files - Scripts and Modules)
byHow to Data (Joshiverse- Journey of a Budding Data Scientist)
0 ratings
0% found this document useful
Every commit is a gift: celebrating Maintainer Week with Brett Cannon
Podcast episode
Every commit is a gift: celebrating Maintainer Week with Brett Cannon
byThe Changelog: Software Development, Open Source
0 ratings
0% found this document useful
TCP & UDP: with Adam Woodbeck
Podcast episode
TCP & UDP: with Adam Woodbeck
byGo Time: Golang, Software Engineering
0 ratings
0% found this document useful
What is distributed computing?: Sometimes using a single computer just won't cut it, and buying time on a supercomputer can be prohibitively expensive. So what do you do next? Tune in and learn more about distributed computing in this podcast.
Podcast episode
What is distributed computing?: Sometimes using a single computer just won't cut it, and buying time on a supercomputer can be prohibitively expensive. So what do you do next? Tune in and learn more about distributed computing in this podcast.
byTechStuff
100%
100% found this document useful
Run your home on a Raspberry Pi: with Mike Riley, author of Portable Python Projects
Podcast episode
Run your home on a Raspberry Pi: with Mike Riley, author of Portable Python Projects
byThe Changelog: Software Development, Open Source
0 ratings
0% found this document useful
#122 How Organizations Can Bridge the Data Literacy Gap
Podcast episode
#122 How Organizations Can Bridge the Data Literacy Gap
byDataFramed
0 ratings
0% found this document useful
Iranian-Israeli cyber tensions rise. Decaf ransomware described. Philippine government phshbait. Unemployment due to cyberattack. Europol’s latest collars. Facebook rebrands as “Meta.”
Podcast episode
Iranian-Israeli cyber tensions rise. Decaf ransomware described. Philippine government phshbait. Unemployment due to cyberattack. Europol’s latest collars. Facebook rebrands as “Meta.”
byCyberWire Daily
0 ratings
0% found this document useful
Learning Long-Time Dependencies with RNNs w/ Konstantin Rusch - #484: Today we conclude our 2021 ICLR coverage joined by Konstantin Rusch, a PhD Student at ETH Zurich. In our conversation with Konstantin, we explore his recent papers, titled coRNN and uniCORNN respectively, which focus on a novel architecture of...
Podcast episode
Learning Long-Time Dependencies with RNNs w/ Konstantin Rusch - #484: Today we conclude our 2021 ICLR coverage joined by Konstantin Rusch, a PhD Student at ETH Zurich. In our conversation with Konstantin, we explore his recent papers, titled coRNN and uniCORNN respectively, which focus on a novel architecture of...
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Understanding Time-Series Database Patterns
Podcast episode
Understanding Time-Series Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
Ep. 33 - Code dependencies are the devil: Have you built your app on someone else's code? And beyond that, does the "secret sauce" of your product depend on external libraries or frameworks? While it's tempting to use the latest and greatest tech as soon as it comes out, that's not always a...
Podcast episode
Ep. 33 - Code dependencies are the devil: Have you built your app on someone else's code? And beyond that, does the "secret sauce" of your product depend on external libraries or frameworks? While it's tempting to use the latest and greatest tech as soon as it comes out, that's not always a...
byfreeCodeCamp Podcast
0 ratings
0% found this document useful
Putting machine learning into a database: Most data scientists bounce back and forth regula…
Podcast episode
Putting machine learning into a database: Most data scientists bounce back and forth regula…
byLinear Digressions
0 ratings
0% found this document useful
WBSP184: Grow Your Business by Learning the Best Practices of Data-Sharing Across Multiple Entities, a Live Interview w/ a Panel of Experts
Podcast episode
WBSP184: Grow Your Business by Learning the Best Practices of Data-Sharing Across Multiple Entities, a Live Interview w/ a Panel of Experts
byWBSRocks: Business Growth with ERP and Digital Transformation
0 ratings
0% found this document useful
Investing in Open-Source Data Tools - Bela Wiertz
Podcast episode
Investing in Open-Source Data Tools - Bela Wiertz
byDataTalks.Club
0 ratings
0% found this document useful
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
Podcast episode
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
byThe Web Platform Podcast
0 ratings
0% found this document useful
The Content Ops Challenge, Scaling Great Work, Saving the Internet, and the Risks of SEO-Only Content with David Baum (Relato): Alex and David Baum discuss the problems and challenges in content operations and the need for a new generation of software and tools to support the way people work today.
Podcast episode
The Content Ops Challenge, Scaling Great Work, Saving the Internet, and the Risks of SEO-Only Content with David Baum (Relato): Alex and David Baum discuss the problems and challenges in content operations and the need for a new generation of software and tools to support the way people work today.
byThe Long Game
0 ratings
0% found this document useful
72: Teaching and Learning Angular: Summary Kent C. Dodds (@kentcdodds) & Shai Reznik (@shai_reznik) join us for episode 72 about teaching and learning the popular Angular JavaScript Framework. These two veteran technologists provide great insights into how they teach code, what...
Podcast episode
72: Teaching and Learning Angular: Summary Kent C. Dodds (@kentcdodds) & Shai Reznik (@shai_reznik) join us for episode 72 about teaching and learning the popular Angular JavaScript Framework. These two veteran technologists provide great insights into how they teach code, what...
byThe Web Platform Podcast
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Understanding Graph Database Patterns
Podcast episode
Understanding Graph Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
How Copy.AI Seized First Mover Advantages and Got 1M Clicks in 5 Months [Jeremy Moser, Founder at uSERP.io]
Podcast episode
How Copy.AI Seized First Mover Advantages and Got 1M Clicks in 5 Months [Jeremy Moser, Founder at uSERP.io]
byHow the Fxck SEO Podcast
0 ratings
0% found this document useful
WBSP178: Grow Your Business by Learning the Importance of Source of Truth, a Live Interview w/ a Panel of Experts
Podcast episode
WBSP178: Grow Your Business by Learning the Importance of Source of Truth, a Live Interview w/ a Panel of Experts
byWBSRocks: Business Growth with ERP and Digital Transformation
0 ratings
0% found this document useful
BDTP. Survey-based Research Projects with Becky Lawlor: How do you conduct a survey-based research project? In this episode, we talk to Becky Lawlor, founder of Sparkifico. You'll learn about the benefits of doing research content, how to identify the angle for your research, what to include when identifying your audience, tools that you can use for conducting the survey, and more.
Podcast episode
BDTP. Survey-based Research Projects with Becky Lawlor: How do you conduct a survey-based research project? In this episode, we talk to Becky Lawlor, founder of Sparkifico. You'll learn about the benefits of doing research content, how to identify the angle for your research, what to include when identifying your audience, tools that you can use for conducting the survey, and more.
byUI Breakfast: UI/UX Design and Product Strategy
0 ratings
0% found this document useful

Skip carousel

Rokoko Studio 2.0
3D World
Article
Rokoko Studio 2.0
Feb 23, 2021
1 min read
Hack The System
Linux Format
Article
Hack The System
Dec 17, 2019
If you followed our handy guide on the previous pages, then you’ll already have augmented your Kali Light installation with Nmap, a port-scanning utility par excellence and a vital part of any pen-tester’s arsenal. As mentioned on the DVD page (there
4 min read
Sherlock
Linux Format
Article
Sherlock
May 31, 2022
1 min read
All about Ethernet
TechLife
Article
All about Ethernet
Jan 11, 2021
4 min read
The Coders Programming Themselves Out of a Job
The Atlantic
Article
The Coders Programming Themselves Out of a Job
Oct 2, 2018
8 min read
Payrolls Of The Future
Finweek - English
Article
Payrolls Of The Future
Mar 27, 2020
dave Ungerer, the founder of SimplePay, got the idea of starting an online payroll solution after seeing the system which the company he used to work for had to grapple with. “I felt terribly sorry for the payroll officer and realised I could make a
3 min read
People Are Secretly Buying Handguns On The Dark Web
Futurity
Article
People Are Secretly Buying Handguns On The Dark Web
Apr 24, 2019
2 min read
The Three Cornerstones of a Smart Business
Rotman Management
Article
The Three Cornerstones of a Smart Business
Jan 1, 2019
Adaptable Products. Algorithms cannot iterate without the products—the online consumer interface that delivers customer experience directly while gathering consumer feedback to adjust algorithm models. Google’s search bar is a classic example of prod
1 min read
“Wi-Fi Congestion In Suburban Areas Has Never Been So Bad, And Sorting Out The Mess Needs Tools”
PC Pro Magazine
Article
“Wi-Fi Congestion In Suburban Areas Has Never Been So Bad, And Sorting Out The Mess Needs Tools”
Apr 8, 2021
10 min read
Set Up A Secure Password Manager
Linux Format
Article
Set Up A Secure Password Manager
Feb 11, 2020
7 min read
4 Windows Command Prompt Tricks Everyone Should Know
PCWorld
Article
4 Windows Command Prompt Tricks Everyone Should Know
Feb 8, 2017
3 min read
The Future Of The Database
Linux Format
Article
The Future Of The Database
Aug 27, 2019
7 min read
Hacker’s Manual
Linux Format
Article
Hacker’s Manual
Nov 17, 2020
1 min read
Use Home Assistant NodeRED devices
Linux Format
Article
Use Home Assistant NodeRED devices
May 31, 2022
NodeRED is an open source graphical programming environment that can be used to complete a large number of tasks. Graphical programming languages tend to be more intuitive than textual programming languages and can be easier to learn. Support is buil
5 min read
The Hacker’s Pearl Harbor
Newsweek
Article
The Hacker’s Pearl Harbor
Nov 11, 2016
6 min read
Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
Online Research
Writing Magazine
Article
Online Research
Aug 6, 2020
The internet has the potential to give you much of the information, if not all the information, you need for your research, provided you know how to use it effectively. There are different ways to approach online research, but following a process tha
3 min read
Note-taking Applications For Family History
Family Tree UK
Article
Note-taking Applications For Family History
Mar 10, 2023
7 min read
4 Ways to Improve Your Content Marketing
ThinkSales
Article
4 Ways to Improve Your Content Marketing
Aug 9, 2018
4 min read
Shades Of Grey
Writing Magazine
Article
Shades Of Grey
Aug 3, 2023
3 min read
Why We Need To Fear The Risk Of AI Model Collapse
Evening Standard
Article
Why We Need To Fear The Risk Of AI Model Collapse
Dec 17, 2023
4 min read
Seven Ways To Future-proof Your SEO Strategy
Marketing
Article
Seven Ways To Future-proof Your SEO Strategy
Apr 8, 2018
Search engine optimisation (SEO) is always changing. To stay ahead of your competitors you need to be able to shift your SEO strategy. Expect to see mobile devices, artificial intelligence (AI) and voice search dominating the news. But what practical
3 min read
“Since Covid, Aren’t We All Bedroom Workers To A Greater Or Lesser Extent?”
PC Pro Magazine
Article
“Since Covid, Aren’t We All Bedroom Workers To A Greater Or Lesser Extent?”
Jan 6, 2022
If you’ve read this column over the past year or so, you’d be forgiven for thinking that I just spend all my time playing with phones, home automation and various gadgets. But if you cast your eyes over to the right for a moment, you’ll notice that u
9 min read
Work And Play
Writing Magazine
Article
Work And Play
Nov 4, 2021
3 min read
Browser wars 2020
APC
Article
Browser wars 2020
Nov 2, 2020
8 min read
Browser Wars 2020
Maximum PC
Article
Browser Wars 2020
May 26, 2020
8 min read
Recording research Findings
Writing Magazine
Article
Recording research Findings
Aug 5, 2021
3 min read
Search Engine Success
Home Business Magazine
Article
Search Engine Success
Dec 24, 2020
4 min read
Browser Wars 2020
Linux Format
Article
Browser Wars 2020
Jun 30, 2020
8 min read
Preparing Your Brand for Success in Web 3 Era
Techfastly
Article
Preparing Your Brand for Success in Web 3 Era
Jul 1, 2022
Over the last few years, technology has evolved significantly. It has had a massive impact on the manner people do business. In addition to Web 3.0, Web 1.0, Web 2.0 and social media are core technological waves affecting the marketing domain. Web 1.
5 min read

Related categories

Skip carousel

Reviews for Mining the Web

Rating: 3.9 out of 5 stars

4/5

10 ratings0 reviews

Book preview

Mining the Web - Soumen Chakrabarti

Instructions for online access

Thank you for your purchase. Please note that your purchase of this Elsevier eBook also includes access to an online version. Please click here (or go to ebooks.elsevier.com) to request an activation code and registration instructions in order to gain access to the web version.

Senior Editor Lothlórien Hornet

Publishing Services Manager Edward Wade

Editorial Assistant Corina Derman

Cover Design Ross Carron Design

Text Design Frances Baca Design

Cover Image Kimihiro Kuno/Photonica

Composition and Technical Illustration Windfall Software, using ZzTEX

Copyeditor Sharilyn Hovind

Proofreader Jennifer McClain

Indexer Steve Rath

Printer The Maple-Vail Book Manufacturing Group

Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration.

Morgan Kaufmann Publishers

An Imprint of Elsevier 340 Pine Street, Sixth Floor San Francisco, CA 94104-3205 www.mkp.com

Printed in the United States of America

07 5 4

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, recording, or otherwise—without the prior written permission of the publisher.

Permissions may be sought directly from Elsevier's Science and Technology Rights Department in Oxford, UK. Phone: (44) 1865 843830, Fax: (44) 1865 853333, e-mail: permissions@elsevier You may also complete your request on-line via the Elsevier homepage: http://www.elsevier selecting Customer Support and then Obtaining Permissions.

Library of Congress Control Number: 2002107241

ISBN-13: 978-1-55860-754-5

ISBN-10: 1-55860-754-4

This book is printed on acid-free paper.

Foreword

Jiawei Han

University of Illinois, Urbana-Champaign

The World Wide Web overwhelms us with immense amounts of widely distributed, interconnected, rich, and dynamic hypertext information. It has profoundly influenced many aspects of our lives, changing the ways we communicate, conduct business, shop, entertain, and so on. However, the abundant information on the Web is not stored in any systematically structured way, a situation which poses great challenges to those seeking to effectively search for high quality information and to uncover the knowledge buried in billions of Web pages. Web mining-or the automatic discovery of interesting and valuable information from the Web-has therefore become an important theme in data mining.

As a prominent researcher on Web mining, Soumen Chakrabarti has presented tutorials and surveys on this exciting topic at many international conferences. Now, after years of dedication, he presents us with this excellent book. Mining the Web: Discovering Knowledge from Hypertext Data is the first book solely dedicated to the theme of Web mining and if offers comprehensive coverage and a rigorous treatment. Chakrabarti starts with a thorough introduction to the infrastructure of the Web, including the mechanisms for Web crawling, Web page indexing, and keyword or similarity-based searching of Web contents. He then gives a systematic description of the foundations of Web mining, focusing on hypertext-based machine learning and data mining methods, such as clustering, collaborative filtering, supervised learning, and semisupervised learning. After that, he presents the application of these fundamental principles to Web mining itself-especially Web linkage analysis-introducing the popular PageRank and HITS algorithms that substantially enhance the quality of keyword-based Web searches.

If you are a researcher, a Web technology developer, or just an interested reader curious about how to explore the endless potential of the Web, you will find this book provides both a solid technical background and state-of-the-art knowledge on this fascinating topic. It is a jewel in the collection of data mining and Web technology books. I hope you enjoy it.

Preface

This book is about finding significant statistical patterns relating hypertext documents, topics, hyperlinks, and queries and using these patterns to connect users to information they seek. The Web has become a vast storehouse of knowledge, built in a decentralized yet collaborative manner. It is a living, growing, populist, and participatory medium of expression with no central editorship. This has positive and negative implications. On the positive side, there is widespread participation in authoring content. Compared to print or broadcast media, the ratio of content creators to the audience is more equitable. On the negative side, the heterogeneity and lack of structure makes it hard to frame queries and satisfy information needs. For many queries posed with the help of words and phrases, there are thousands of apparently relevant responses, but on closer inspection these turn out to be disappointing for all but the simplest queries. Queries involving nouns and noun phrases, where the information need is to find out about the named entity, are the simplest sort of information-hunting tasks. Only sophisticated users succeed with more complex queries—for instance, those that involve articles and prepositions to relate named objects, actions, and agents. If you are a regular seeker and user of Web information, this state of affairs needs no further description.

Detecting and exploiting statistical dependencies between terms, Web pages, and hyperlinks will be the central theme in this book. Such dependencies are also called patterns, and the act of searching for such patterns is called machine learning, or data mining. Here are some examples of machine learning for Web applications. Given a crawl of a substantial portion of the Web, we may be interested in constructing a topic directory like Yahoo!, perhaps detecting the emergence and decline of prominent topics with passing time. Once a topic directory is available, we may wish to assign freshly crawled pages and sites to suitable positions in the directory.

In this book, the data that we will mine will be very rich, comprising text, hypertext markup, hyperlinks, sites, and topic directories. This distinguishes the area of Web mining as a new and exciting field, although it also borrows liberally from traditional data analysis. As we shall see, useful information on the Web is accompanied by incredible levels of noise, but thankfully, the law of large numbers kicks in often enough that statistical analysis can make sense of the confusion. Our goal is to provide both the technical background and tools and tricks of the trade of Web content mining, which was developed roughly between 1995 and 2002, although it continues to advance. This book is addressed to those who are, or would like to become, researchers and innovative developers in this area.

Prerequisites and Contents

The contents of this book are targeted at fresh graduate students but are also quite suitable for senior undergraduates. The book is partly based on tutorials at SIGMOD 1999 and KDD 2000, a survey article in SIGKDD Explorations, invited lectures at ACL 1999 and ICDT 2001, and teaching a graduate elective at IIT Bombay in the spring of 2001. The general style is a mix of scientific and statistical programming with system engineering and optimizations. A background in elementary undergraduate statistics, algorithms, and networking should suffice to follow the material. The exposition also assumes that the reader is a regular user of search engines, topic directories, and Web content in general, and has some appreciation for the limitations of basic Web access based on clicking on links and typing keyword queries.

The chapters fall into three major parts. For concreteness, we start with some engineering issues: crawling, indexing, and keyword search. This part also gives us some basic know-how for efficiently representing, manipulating, and analyzing hypertext documents with computer programs. In the second part, which is the bulk of the book, we focus on machine learning for hypertext: the art of creating programs that seek out statistical relations between attributes extracted from Web documents. Such relations can be used to discover topic-based clusters from a collection of Web pages, assign a Web page to a predefined topic, or match a user’s interest to Web sites. The third part is a collection of applications that draw upon the techniques discussed in the first two parts.

To make the presentation concrete, specific URLs are indicated throughout, but there is no saying how long they will remain accessible on the Web. Luckily, the Internet Archive will let you view old versions of pages at www.archive.org/, provided this URL does not get dated.

Omissions

The field of research underlying this book is in rapid flux. A book written at this juncture is guaranteed to miss out on important areas. At some point a snapshot must be taken to complete the project. A few omissions, however, are deliberate. Beyond bare necessities, I have not engaged in a study of protocols for representing and transferring content on the Internet and the Web. Readers are assumed to be reasonably familiar with HTML. For the purposes of this book, you do not need to understand the XML (Extensible Markup Language) standard much more deeply than HTML. There is also no treatment of Web application services, dynamic site management, or associated networking and data-processing technology.

I make no attempt to cover natural language (NL) processing, natural language understanding, or knowledge representation. This is largely because I do not know enough about natural language processing. NL techniques can now parse relatively well-formed sentences in many languages, disambiguate polysemous words with high accuracy, tag words in running text with part-of-speech information, represent NL documents in a canonical machine-usable form, and perform NL translation. Web search engines have been slow to embrace NL processing except as an explicit translation service. In this book, I will make occasional references to what has been called ankle-deep semantics—techniques that leverage semantic databases (e.g., as a dictionary or thesaurus) in shallow, efficient ways to improve keyword search.

Another missing area is Web usage mining. Optimizing large, high-flux Web sites to be visitor-friendly is nontrivial. Monitoring and analyzing the behavior of visitors in the past may lead to valuable insights into their information needs, and help in continually adapting the design of the site. Several companies have built systems integrated with Web servers, especially the kind that hosts e-commerce sites, to monitor and analyze traffic and propose site organization strategies. The array of techniques brought to bear on usage mining has a large overlap with traditional data mining in the relational data-warehousing scenario, for which excellent texts already exist.

Acknowledgments

I am grateful to many people for making this work possible. I was fortunate to associate with Byron Dom, Inderjit Dhillon, Dharmendra Modha, David Gibson, Dimitrios Gunopulos, Jon Kleinberg, Kevin McCurley, Nimrod Megiddo, and Prabhakar Raghavan at IBM Almaden Research Center, where some of the inventions described in this book were made between 1996 and 1999. I also acknowledge the extremely stimulating discussions I have had with researchers at the then Digital System Research Center in Palo Alto, California: Krishna Bharat, Andrei Bröder, Monika Henzinger, Hannes Marais, and Mark Najork, some of whom have moved on to Google and AltaVista. Similar gratitude is also due to Gary Flake, C. Lee Giles, Steve Lawrence, and Dave Pennock at NEC Research, Princeton. Thanks also to Pedro Domingos, Susan Dumais, Ravindra Jaju, Ronny Lempel, David Lewis, Tom Mitchell, Mandar Mitra, Kunal Punera, Mehran Sahami, Eric Saund, and Amit Singhal for helpful discussions. Jiawei Han’s text on data mining and his encouragement helped me decide to write this book. Krishna Bharat, Lyle Ungar, Karen Watterson, Ian Witten, and other, anonymous, referees have greatly enhanced the quality of the manuscript.

Closer to home, Sunita Sarawagi and S. Sudarshan gave valuable feedback. Together with Pushpak Bhattacharya and Krithi Ramamritham, they kept up my enthusiasm during this long project in the face of many adversities. I am grateful to Tata Consultancy Services for their generous support through the Lab for Intelligent Internet Research during the preparation of the manuscript. T. P. Chandran offered invaluable administrative help. I thank Diane Cerra, Lothlórien Homet, Edward Wade, Mona Buehler, Corina Derman, and all the other members of the Morgan Kaufmann team for their patience with many delays in the schedule and their superb production job. I regret forgetting to express my gratitude to anyone else who has contributed to this work. The gratitude does live on in my heart. Finally, I wish to thank my wife, Sunita Sarawagi, and my parents, Sunil and Arati Chakrabarti, for their constant support and encouragement.

Cover

Title

Copyright

Foreword

Preface

Chapter 1: Introduction

1.1 Crawling and Indexing

1.2 Topic Directories

1.3 Clustering and Classification

1.4 Hyperlink Analysis

1.5 Resource Discovery and Vertical Portals

1.6 Structured vs. Unstructured Data Mining

1.7 Bibliographic Notes

Part I: Infrastructure

Chapter 2: Crawling the Web

2.1 HTML and HTTP Basics

2.2 Crawling Basics

2.3 Engineering Large-Scale Crawlers

2.4 Putting Together a Crawler

2.5 Bibliographic Notes

Chapter 3: Web Search and Information Retrieval

3.1 Boolean Queries and the Inverted Index

3.2 Relevance Ranking

3.3 Similarity Search

3.4 Bibliographic Notes

Part II: Learning

Chapter 4: Similarity and Clustering

4.1 Formulations and Approaches

4.2 Bottom-Up and Top-Down Partitioning Paradigms

4.3 Clustering and Visualization via Embeddings

4.4 Probabilistic Approaches to Clustering

4.5 Collaborative Filtering

4.6 Bibliographic Notes

Chapter 5: Supervised Learning

5.1 The Supervised Learning Scenario

5.2 Overview of Classification Strategies

5.3 Evaluating Text Classifiers

5.4 Nearest Neighbor Learners

5.5 Feature Selection

5.6 Bayesian Learners

5.7 Exploiting Hierarchy among Topics

5.8 Maximum Entropy Learners

5.9 Discriminative Classification

5.10 Hypertext Classification

5.11 Bibliographic Notes

Chapter 6: Semisupervised Learning

6.1 Expectation Maximization

6.2 Labeling Hypertext Graphs

6.3 Co-training

6.4 Bibliographic Notes

Part III: Applications

Chapter 7: Social Network Analysis

7.1 Social Sciences and Bibliometry

7.2 PageRank and HITS

7.3 Shortcomings of the Coarse-Grained Graph Model

7.4 Enhanced Models and Techniques

7.5 Evaluation of Topic Distillation

7.6 Measuring and Modeling the Web

7.7 Bibliographic Notes

Chapter 8: Resource Discovery

8.1 Collecting Important Pages Preferentially

8.2 Similarity Search Using Link Topology

8.3 Topical Locality and Focused Crawling

8.4 Discovering Communities

8.5 Bibliographic Notes

Chapter 9: The Future of Web Mining

9.1 Information Extraction

9.2 Natural Language Processing

9.3 Question Answering

9.4 Profiles, Personalization, and Collaboration

References

Index

About the Author

INTRODUCTION

The World Wide Web is the largest and most widely known repository of hypertext. Hypertext documents contain text and generally embed hyperlinks to other documents distributed across the Web. Today, the Web comprises billions of documents, authored by millions of diverse people, edited by no one in particular, and distributed over millions of computers that are connected by telephone lines, optical fibers, and radio modems. It is a wonder that the Web works at all. Yet it is rapidly assisting and supplementing newspapers, radio, television, and telephone, the postal system, schools and colleges, libraries, physical workplaces, and even the sites of commerce and governance.

A brief history of hypertext and the Web. Citation, a form of hyperlinking, is as old as written language itself. The Talmud, with its heavy use of annotations and nested commentary, and the Ramayana and Mahabharata, with their branching, nonlinear discourse, are ancient examples of hypertext. Dictionaries and encyclopedias can be viewed as a self-contained network of textual nodes joined by referential links. Words and concepts are described by appealing to other words and concepts. In modern times (1945), Vannevar Bush is credited with the first design of a photo-electrical-mechanical storage and computing device called a Memex (for memory extension), which could create and help follow hyperlinks across documents. Doug Engelbart and Ted Nelson were other early pioneers; Ted Nelson coined the term hypertext in 1965 [160] and created the Xanadu hypertext system with robust two-way hyperlinks, version management, controversy management, annotation, and copyright management.

In 1980 Tim Berners-Lee,a consultant with CERN (the European organization for nuclear research) wrote a program to create and browse along named, typed bidirectional links between documents in a collection. By 1990, a more general proposal to support hypertext repositories was circulating in CERN, and by late 1990, Berners-Lee had started work on a graphical user interface (GUI) to hypertext, and named the program the World Wide Web. By 1992, other GUI interfaces such as Erwise and Viola were available. The Midas GUI was added to the pool by Tony Johnson at the Stanford Linear Accelerator Center in early 1993.

In February 1993, Mark Andressen at NCSA (National Center for Supercomputing Applications, www.ncsa.uiuc.edu/) completed an initial version of Mosaic, a hypertext GUI that would run on the popular X Window System used on UNIX machines. Behind the scenes, CERN also developed and improved HTML, a markup language for rendering hypertext, and Http, the hypertext transport protocol for sending HTML and other data over the Internet, and implemented a server of hypertext documents called the CERN HTTPD. Although stand-alone hypertext browsing systems had existed for decades, this combination of simple content and transport standards, coupled with user-friendly graphic browsers led to the widespread adoption of this new hypermedium. Within 1993, Http traffic grew from 0.1% to over 1% of the Internet traffic on the National Science Foundation backbone. There were a few hundred Http servers by the end of 1993.

Between 1991 and 1994, the load on the first Http server (info.cern.ch) increased by a factor of one thousand (see Figure 1.1(a)). The year 1994 was a landmark: the Mosaic Communications Corporation (later Netscape) was founded, the first World Wide Web conference was held, and MIT and CERN agreed to set up the World Wide Web Consortium (W3C). The following years (1995–2001) featured breakneck innovation, irrational exuberance, and return to reason, as is well known. We will review some of the other major events later in this chapter.

A populist, participatory medium. As the Web has grown in size and diversity, it has acquired immense value as an active and evolving repository of knowledge. For the first time, there is a medium where the number of writers—disseminators of facts, ideas, and opinions—starts to approach the same order of magnitude as the number of readers. To date, both numbers still fall woefully short of representing all of humanity, and its languages, cultures, and aspirations. Still, it is a definitive move toward recording many areas of human thought and endeavor in a manner far more populist and accessible than before. Although media moguls have quickly staked out large territories in cyberspace as well, the political system is still far more anarchic (or democratic, depending on the point of view) than conventional print or broadcast media. Ideas and opinions that would never see the light of day in conventional media can live vibrant lives in cyberspace. Despite growing concerns in legal and political circles, censorship is limited in practice to only the most flagrant violations of proper conduct.

FIGURE 1.1 The early days of the Web: CERN Http traffic grows by a factor of 1000 between 1991 and 1994 (a) (image courtesy W3C); the number of servers grows from a few hundred to a million between 1991 and 1997 (b) (image courtesy Nielsen [165]).

Just as biochemical processes define the evolution of genes, mass media defines the evolution of memes, a word coined by Richard Dawkins to describe ideas, theories, habits, skills, languages, and artistic expressions that spread from person to person by imitation. Memes are replicators—just like genes—that are copied imperfectly and selected on the basis of viability, giving rise to a new kind of evolutionary process. Memes have driven the invention of writing, printing, and broadcasting, and now they have constructed the Internet. Free speech online, chain letters, and email viruses are some of the commonly recognized memes on the Internet, but there are many, many others. For any broad topic, memetic processes shape authorship, hyperlinking behavior, and resulting popularity of Web pages. In the words of Susan Blackmore, a prominent researcher of memetics, From the meme’s-eye point of view the World Wide Web is a vast playground for their own selfish propagation.

The crisis of abundance and diversity. The richness of Web content has also made it progressively more difficult to leverage the value of information. The new medium has no inherent requirements of editorship and approval from authority. Institutional editorship addresses policy more than form and content. In addition, storage and networking resources are cheap. These factors have led to a largely liberal and informal culture of content generation and dissemination. (Most universities and corporations will not prevent an employee from writing at reasonable length about their latest kayaking expedition on their personal homepage, hosted on a server maintained by the employer.) Because the Internet spans nations and cultures, there is little by way of a uniform civil code. Legal liability for disseminating unverified or even libelous information is practically nonexistent, compared to print or broadcast media.

Whereas the unregulated atmosphere has contributed to the volume and diversity of Web content, it has led to a great deal of redundancy and nonstandard form and content. The lowest common denominator model for the Web is rather primitive: the Web is a set of documents, where each document is a multiset (bag) of terms. Basic search engines let us find pages that contain or do not contain specified keywords and phrases. This leaves much to be desired. For most broad queries (e.g., java or kayaking) there are millions of qualifying pages. By qualifying, all we mean is that the page contains those, or closely related, keywords. There is little support to disambiguate short queries like java unless embedded in a longer, more specific query. There is no authoritative information about the reliability and prestige of a Web document or a site.

Uniform accessibility of documents intended for diverse readership also complicates matters. Outside cyberspace, bookstores and libraries are quite helpful in guiding people with different backgrounds to different floors, aisles, and shelves. The amateur gardener and a horticulture professional know well how to walk their different ways. This is harder on the Web, because most sites and documents are just as accessible as any other, and conventional search services to date have little support for adapting to the background of specific users.

The Web is also adversarial in that commercial interests routinely influence the operation of Web search and analysis tools. Most Web users pay only for their network access, and little, if anything, for the published content that they use. Consequently, the upkeep of content depends on the sale of products, services, and online advertisements.¹ This results in the introduction of a large volume of ads. Sometimes these are explicit ad hyperlinks. At other times they are more subtle, biasing apparently noncommercial matter in insidious ways. There are businesses dedicated exclusively to raising the rating of their clients as judged by prominent search engines, officially called search engine optimization.

The goal Of this book. In this book I seek to study and develop programs that connect people to the information they seek from the Web. The techniques that are examined will be more generally applicable to any hypertext corpus. I will call hypertext data semistructured or unstructured, because they do not have a compact or precise description of data items. Such a description is called a schema, which is mandatory for relational databases. The second major difference is that unstructured and semistructured hypertext has a very large number of attributes, if each lexical unit (word or token) is considered as a potential attribute. (We will return to a comparison of data mining for structured and unstructured domains in Section 1.6.)

The Web is used in many ways other than authoring and seeking information, but we will largely limit our attention to this aspect. We will not be directly concerned with other modes of usage, such as Web-enabled email, news, or chat services, although chances are some of the techniques we study may be applicable to those situations. Web-enabled transactions on relational databases are also outside the scope of this book.

We shall proceed from standard ways to access the Web (using keyword search engines) to relatively sophisticated analyses of text and hyperlinks in later chapters. Machine learning is a large and deep body of knowledge we shall tap liberally, together with overlapping and related work in pattern recognition and data mining. Broadly, these areas are concerned with searching for, confirming, explaining, and exploiting nonobvious and statistically significant constraints and patterns between attributes of large volumes of potentially noisy data. Identifying the main topic(s) discussed in a document, modeling a user’s topics of interest, and recommending content to a user based on past behavior and that of other users all come under the broad umbrella of machine learning and data mining. These are well-established fields, but the novelty of highly noisy hypertext data does necessitate some notable innovations.

The following sections briefly describe the sequence of material in the chapters of this book.

1.1 Crawling and Indexing

We shall visit a large variety of programs that process textual and hypertextual information in diverse ways, but the capability to quickly fetch a large number of Web pages into a local repository and to index them based on keywords is required in many applications. Large-scale programs that fetch tens of thousands of Web pages per second are called crawlers, spiders, Web robots, or bots. Crawling is usually performed to subsequently index the documents fetched. Together, a crawler and an index form key components of a Web search engine.

One of the earliest search engines to be built was Lycos, founded in January 1994, operational in June 1994, and a publicly traded company in April 1996. Lycos was born from a research project at Carnegie Mellon University by Dr. Michael Mauldin. Another search engine, WebCrawler, went online in spring 1994. It was also started as a research project, at the University of Washington, by Brian Pinkerton. During the spring of 1995, Louis Monier, Joella Paquette, and Paul Flaherty at Digital Equipment Corporation’s research labs developed AltaVista, one of the best-known search engines with claims to over 60 patents,the highest in this area so far. Launched in December 1995, AltaVista started fielding two million queries a day within three weeks.

Many others followed the search engine pioneers and offered various innovations. HotBot and Inktomi feature a distributed architecture for crawling and storing pages. Excite could analyze similarity between keywords and match queries to documents having no syntactic match. Many search engines started offering a more like this link with each response. Search engines remain some of the most visited sites today.

Chapter 2 looks at how to write crawlers of moderate scale and capability and addresses various performance issues. Then, Chapter 3 discusses how to process the data into an index suitable for answering queries. Indexing enables keyword and phrase queries and Boolean combinations of such queries. Unlike relational databases, the query engine cannot simply return all qualifying responses in arbitrary order. The user implicitly expects the responses to be ordered in a way that maximizes the chances of the first few responses satisfying the information need. These chapters can be skimmed if you are more interested in mining per se; more in-depth treatment of information retrieval (IR) can be found in many excellent classic texts cited later.

1.2 Topic Directories

The first wave of innovations was related to basic infrastructure comprising crawlers and search engines. Topic directories were the next significant feature to gain visibility.

In 1994, Jerry Yang and David Filo, Ph.D. students at Stanford University, created the Yahoo!² directory (www.yahoo.com/) to help their friends locate useful Web sites, growing by the thousands each day. Srinija Srinivasan, another Stanford alum, provided the expertise to create and maintain the treelike branching hierarchies that steer people to content. By April 1995, the project that had started out as a hobby was incorporated as a company.

The dazzling success of Yahoo! should not make one forget that organizing knowledge into ontologies is an ancient art, descended from philosophy and epistemology. An ontology defines a vocabulary, the entities referred to by elements in the vocabulary, and relations between the entities. The entities may be fine-grained, as in WordNet, a lexical network for English, or they may be relatively coarse-grained topics, as in the Yahoo! topic directory.

The paradigm of browsing a directory of topics arranged in a tree where children represent specializations of the parent topic is now pervasive. The average computer user is familiar with hierarchies of directories and files, and this familiarity carries over rather naturally to topic taxonomies. Following Yahoo!’s example, a large number of content portals have added support for hosting topic taxonomies. Some organizations (e.g., Yahoo!) employ a team of editors to maintain the taxonomy; others (e.g., About.com and the Open Directory Project (dmoz.org/)) are more decentralized and work through a loosely coupled network of volunteers. A large fraction of Web search engines now incorporate some form of taxonomy-based search as well.

Topic directories offer value in two forms. The obvious contribution is the cataloging of Web content, which makes it easier to search (e.g., SOCKS as in the firewall protocol is easier to distinguish from the clothing item). Collecting links into homogeneous clusters also offers an implicit more like this mechanism: once the user has located a few sites of interest, others belonging to the same, sibling, or ancestor categories may also be of interest. The second contribution is in the form of quality control. Because the links in a directory usually go through editorial scrutiny, however cursory, they tend to reflect the more authoritative and popular sections of the Web. As we shall see, both of these forms of human input can be exploited well by Web mining programs.

1.3 Clustering and Classification

Topic directories built with human effort (e.g., Yahoo! or the Open Directory) immediately raise a question: Can they be constructed automatically out of an amorphous corpus of Web pages, such as collected by a crawler? We study one aspect of this problem, called clustering, or unsupervised learning, in Chapter 4. Roughly speaking, a clustering algorithm discovers groups in the set of documents such that documents within a group are more similar than documents across groups.

Clustering is a classic area of machine learning and pattern recognition [72]. However, a few complications arise in the hypertext domain. A basic problem is that different people do not agree about what they expect a clustering algorithm to output for a given data set. This is partly because they are implicitly using different similarity measures, and it is difficult to guess what their similarity measures are because the number of attributes is so large.

Hypertext is also rich in features: textual tokens, markup tags, URLs, host names in

Enjoying the preview?

Page 1 of 1

Mining the Web: Discovering Knowledge from Hypertext Data

About this ebook

Soumen Chakrabarti

Related authors

Related to Mining the Web

Related ebooks

Internet & Web For You

Related podcast episodes

Related articles

Related categories

Reviews for Mining the Web

What did you think?

Book preview

Mining the Web - Soumen Chakrabarti

Instructions for online access

Foreword

Jiawei Han

University of Illinois, Urbana-Champaign

Preface

Prerequisites and Contents

Omissions

Acknowledgments

Table of Contents

INTRODUCTION

1.1 Crawling and Indexing

1.2 Topic Directories

1.3 Clustering and Classification