Sunteți pe pagina 1din 23

A Brief Introduction to

Data Mining (DM)


BS CS - V III
BY
SA N IA N AYA B
What is Data Mining?: Informal
“There are things that we know that we know…

there are things that we know that we don’t know…

there are things that we don’t know we don’t know.”


Donald Rumsfield

US Secretary of Defence

The methodology is to first convert “unknown unkowns” into “known unknowns” and then
finally to “known knowns”.
What is Data Mining?: Slightly Informal

Tell me something that I should know.


When you don’t know what you should be knowing, how do you write SQL?
You cant!!
What is Data Mining?: Formal
􀂃 Knowledge Discovery in Databases (KDD).
􀂃 Data mining digs out valuable non-trivial information from large multidimensional apparently
unrelated data bases (sets).
􀂃 It’s the integration of business knowledge, people, information, algorithms, statistics and
computing technology.
􀂃 Finding useful hidden patterns and relationships in data.
Why Data Mining?

Data Mining is the exploratory data analysis with little or no human interaction
using computationally feasible techniques, i.e., the attempt to find interesting
structures/patterns unknown a priori
Cont…
Data collected much faster than it can be processed or managed. NASA Earth Observation
System (EOS), will alone, collect 15 Peta bytes by 2007 (15,000,000,000,000,000 bytes).
􀂃 Much of which won't be used - ever!
􀂃 Much of which won't be seen - ever!
􀂃 Why not?
􀂃 There's so much volume, usefulness of some of it will never be discovered
SOLUTION: Reduce the volume and/or raise the information content by structuring, querying,
filtering, summarizing, aggregating, mining...
Claude Shannon's info. theory

More volume means less information


Claude Shannon’s theory states that as the volume increases the information content decreases
and vice versa.
Cont…
Data Mining is HOT!
􀂃 10 Hottest Jobs of year 2025
Time Magazine, 22 May, 2000

􀂃 10 emerging areas of technology


MIT’s Magazine of Technology Review, Jan/Feb, 2001

The TIME Magazine May 2000 issue has given a list of the ten hottest jobs of year 2025.
Data miners and knowledge engineers were at 5th and 6th position respectively.
Among the list of emerging technologies that will change the world, Data mining is at the 3rd
place.
Thus in view of the above facts, data miners have a long career in national as well as international
market as major companies both private and government are quickly adopting the technology
and many have already adopted.
How Data Mining is different?
􀂃 Knowledge Discovery
--Overall process of discovering useful knowledge
􀂃 Data Mining (Knowledge-driven exploration)
-- Query formulation problem.
-- Visualize and understand of a large data set.
-- Data growth rate too high to be handled manually.
􀂃 Data Warehouses (Data-driven exploration):
-- Querying summaries of transactions, etc. Decision support
􀂃 Traditional Database (Transactions):
-- Querying data in well-defined processes. Reliable storage
Data Mining Vs. Statistics
􀂃 Formal statistical inference is assumption driven i.e. a hypothesis is formed and validated
against the data.
􀂃 Data mining is discovery driven i.e. patterns and hypothesis are automatically extracted from
data.
􀂃 Said another way, data mining is knowledge driven, while statistics is human driven.
􀂃 Both resemble in exploratory data analysis, but statistics focuses on data sets far smaller than
used by data mining researchers.
􀂃 Statistics is useful for verifying relationships among few parameters when the relationships
are linear.
􀂃 Data mining builds much complex, predictive, nonlinear models which are used for predicting
behavior impacted by many factors.
What Can Data Mining Do
There are a number of data mining techniques and the selection of a particular technique is
highly application dependent, although other factors affect the selection process too.

So let’s look at some of the DM application areas or techniques.


• Classification
• Estimation
• Prediction
• Clustering
• Description
CLASSIFICATION

􀂃 Classification consists of examining the properties of a newly presented observation and


assigning it to a predefined class.
◦ 􀂃 Assigning customers to predefined customer segments (good vs. bad)
◦ 􀂃 Assigning keywords to articles
◦ 􀂃 Classifying credit applicants as low, medium, or high risk
◦ 􀂃 Classifying instructor rating as excellent, very good, good, fair, or poor
ESTIMATION
As opposed to discrete outcome of classification i.e. YES or NO, deals with continuous valued
outcomes
Example:
Building a model and assigning a value from 0 to 1 to each member of the set.
Then classifying the members into categories based on a threshold value.
As the threshold changes the class changes.
PREDICTION

Same as classification or estimation except records are classified according to some predicted
future behavior or estimated value.
◦ Using classification or estimation on a training example with known predicted values and historical data
a model is built.
◦ Then explain the known values, and use the model to predict future.

Example:
Predicting how much customers will spend during next 6 months.
MARKET BASKET ANALYSIS

Determining which things go together, e.g. items in a shopping cart at a super market.
◦ Used to identify cross-selling opportunities
◦ Design attractive packages or groupings of products and services or increasing price of some items etc.
CLUSTERING
Task of segmenting a heterogeneous population into a number of more homogenous sub-groups
or clusters.
Unlike classification, it does NOT depend on predefined classes.
It is up to you to determine what meaning, if any, to attached to resulting clusters.
It could be the first step to the market segmentation effort.

Examples of Clustering Applications(Read by Yourself)


DESCRIPTION

Describe what is going on in a complicated database so as to increase our understanding.

A good description of a behavior will suggest an explanation as well.


The metrics we use for comparison of
DM techniques are;

Read
Where does Data Mining fits in?
Main Types of Data Mining (Presentation
topic Will be)

Supervised Vs. Unsupervised Learning

S-ar putea să vă placă și