Sunteți pe pagina 1din 17

DATA MINING

LECTURE 1
Introduction
What is data mining?
• After years of data mining there is still no
unique answer to this question.

• A tentative definition:
Data mining is the use of efficient techniques for the
analysis of very large collections of data and the
extraction of useful and possibly unexpected patterns in
data.
Why Data Mining?

• The Explosive Growth of Data: from terabytes to petabytes


– Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets

3
Evolution of Sciences
• Before 1600, empirical science
• 1600-1950s, theoretical science
– Each discipline has grown a theoretical component. Theoretical models often motivate
experiments and generalize our understanding.
• 1950s-1990s, computational science
– Over the last 50 years, most disciplines have grown a third, computational branch (e.g.
empirical, theoretical, and computational ecology, or physics, or linguistics.)
– Computational Science traditionally meant simulation. It grew out of our inability to find
closed-form solutions for complex mathematical models.
• 1990-now, data science
– The flood of data from new scientific instruments and simulations
– The ability to economically store and manage petabytes of data online
– The Internet and computing Grid that makes all these archives universally accessible
– Scientific info. management, acquisition, organization, query, and visualization tasks scale
almost linearly with data volumes. Data mining is a major new challenge!
• Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm.
ACM, 45(11): 50-54, Nov. 2002
4
Evolution of Database Technology
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems

5
What Is Data Mining?

• Data mining (knowledge discovery from data)


– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
– Data mining: a misnomer?
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems

6
Data
• Multiple types of data: tables, time series, images,
graphs, etc

• Spatial and temporal aspects

• Interconnected data of different types:


– From the mobile phone we can collect, location of the
user, friendship information, check-ins to venues, opinions
through twitter, images though cameras, queries to search
engines
Example: transaction data
• Billions of real-life customers:
– WALMART: 20M transactions per day
– AT&T 300 M calls per day
– Credit card companies: billions of transactions per
day.

• The point cards allow companies to collect


information about specific users
Example: document data
• Web as a document repository: estimated 50
billions of web pages

• Wikipedia: 4 million articles (and counting)

• Online news portals: steady stream of 100’s of new


articles every day

• Twitter: ~300 million tweets every day


Example: network data
• Web: 50 billion pages linked via hyperlinks

• Facebook: 500 million users

• Twitter: 300 million users

• Instant messenger: ~1billion users

• Blogs: 250 million blogs worldwide, presidential


candidates run blogs
Example: environmental data
• Climate data (just an example)
http://www.ncdc.gov/oa/climate/ghcn-monthly/index.php

• “a database of temperature, precipitation and pressure


records managed by the National Climatic Data Center,
Arizona State University and the Carbon Dioxide Information
Analysis Center”

• “6000 temperature stations, 7500 precipitation stations,


2000 pressure stations”
– Spatiotemporal data
Attributes
So, what is Data?
Tid Refund Marital Taxable
• Collection of data objects and Status Income Cheat

their attributes 1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
• An attribute is a property or
4 Yes Married 120K No
characteristic of an object
5 No Divorced 95K Yes
– Examples: eye color of a person, Objects
6 No Married 60K No
temperature, etc.
7 Yes Divorced 220K No
– Attribute is also known as
variable, field, characteristic, or 8 No Single 85K Yes

feature 9 No Married 75K No

• A collection of attributes describe 10


10 No Single 90K Yes

an object
– Object is also known as record, Size: Number of objects
point, case, sample, entity, or Dimensionality: Number of attributes
instance Sparsity: Number of populated
object-attribute pairs
Data
• Data is/are the facts of the World. For
example, take yourself. You may be 5ft tall,
have brown hair and blue eyes. All of this is
“data”. You have brown hair whether this is
written down somewhere or not.
Information
• Information allows us to expand our knowledge
beyond the range of our senses. We can capture
data in information, then move it about so that
other people can access it at different times.
• If I take a picture of you, the photograph is
information. But what you look like is data.
Knowledge
Benefits of Data Mining

•Data mining technique helps companies to get knowledge-based


information.
•Data mining helps organizations to make the profitable
adjustments in operation and production.
•The data mining is a cost-effective and efficient solution compared
to other statistical data applications.
•Data mining helps with the decision-making process.
•Facilitates automated prediction of trends and behaviors as well as
automated discovery of hidden patterns.
•It can be implemented in new systems as well as existing platforms
•It is the speedy process which makes it easy for the users to
analyze huge amount of data in less time.
Applications of Data Mining
• Health Care
• Finance
• Retail Industry
• Telecommunication
• Text Mining & Web Mining
• Higher Education

S-ar putea să vă placă și