Introduction To Data Mining

WHAT IS DATA MINING?

Data mining is the process of discovering:
Meaningful new correlations
Patterns
Trends
by sifting through large amounts of data stored in

repositories, using pattern recognition
technologies as well as statistical and
mathematical techniques.

Data mining is the analysis of (often large)
observational data sets to find unsuspected

relationships and to summarize the data in novel
ways that are both understandable and useful to
the data owner

Data mining is interdisciplinary field bringing
together techniques from machine learning,

pattern recognition, statistics, databases and
visualization to address the issue of information
extraction from large data bases
Introduction
Data mining skills are in high demand as organizations
increasingly put data repositories online.

Effectively analyzing information from customers,
partners, and suppliers has become important to more
companies.
Many companies have implemented a data warehouse
strategy and are now starting to look at what they can do
with all that data
WHY DATA MINING?

The ongoing remarkable growth in the field of data mining
and knowledge discovery has been fueled by a fortunate

confluence of a variety of factors:
The storing of the data in data warehouses, so that the entire
enterprise has access to a reliable current database

The availability of increased access to data from Web navigation
and intranets
The competitive pressure to increase market share in a globalized
economy
The development of off-the-shelf commercial data mining software
suites
The tremendous growth in computing power and storage capacity
FALLACIES OF DATA MINING

Fallacy 1. There are data mining tools that we can turn
loose on our data repositories and use to find answers

to our problems.
Reality. There are no automatic data mining tools that will
solve your problems mechanically while you wait. Data

mining is a process.
Fallacy 2. The data mining process is autonomous,
requiring little or no human oversight.

Reality. As we saw above, the data mining process requires
significant human interactivity at each stage. Even after the

model is deployed, the introduction of new data often
requires an updating of the model. Continuous quality
monitoring and other evaluative measures must be assessed
by human analysts.

Fallacy 3. Data mining pays for itself quite quickly.
Reality. The return rates vary, depending on the startup
costs, analysis personnel costs, data warehousing
preparation costs, and so on.
Fallacy 4. Data mining software packages are
intuitive and easy to use.

Reality. Again, ease of use varies. However, data analysts
must combine subject matter knowledge with an

analytical mind and a familiarity with the overall
business or research model.

Fallacy 5. Data mining will identify the causes of our
business or research problems.
Reality. The knowledge discovery process will help you to
uncover patterns of behavior. Again, it is up to humans to

identify the causes.
Fallacy 6. Data mining will clean up a messy database
automatically.
Reality. Well, not automatically. As a preliminary phase in the
data mining process, data preparation often deals with data

that has not been examined or used in years. Therefore,
organizations beginning a newdata mining operation will
often be confronted with the problem of data that has been
lying around for years, is stale, and needs considerable
updating.
Knowledge Discovery Process

The Knowledge Discovery in Databases process comprises
of a few steps leading from raw data collections to some

form of new knowledge. The iterative process consists of
the following steps:
Data cleaning: also known as data cleansing, it is a phase in
which noise data and irrelevant data are removed from the
collection.
Data integration: at this stage, multiple data sources, often
heterogeneous, may be combined in a common source.
Data selection: at this step, the data relevant to the analysis
is decided on and retrieved from the data collection.
Data transformation: also known as data consolidation, it is
a phase in which the selected data is transformed into forms
appropriate for the mining procedure.

Data mining: it is the crucial step in which clever
techniques are applied to extract patterns potentially

useful.
Pattern evaluation: in this step, strictly interesting
patterns representing knowledge are identified based on
given measures.
Knowledge representation: is the final phase in which
the discovered knowledge is visually represented to the
user. This essential step uses visualization techniques to
help users understand and interpret the data mining
results.
WHAT TASKS CAN DATA MINING

ACCOMPLISH?
The following list shows the most common data
mining tasks.
Estimation
Prediction
Classification
Clustering
Association
Classification
There is a target categorical variable
Example:
income bracket: partitioned into 3 classes or categories:
high income, middle income, and low income
Classification
Classify the income brackets of persons not currently
in the database, based on other characteristics

associated with that person, such as age, gender, and
occupation
Classification Task
Classification - Steps
First : examine the data set containing both the predictor
variables and the (already classified) target variable i.e

income bracket.
The algorithm (software) learns about which
combinations of variables are associated with which
income brackets.
For example, older females may be associated with the high-
income bracket. This data set is called the training set.
Second: The algorithm would look at new records, for
which no information about income bracket is available.

Based on the classifications in the training set, the
algorithm would assign classifications to the new records.
For example, a 63-year-old female professor might be
classified in the high-income bracket.
Classification
For example, in the medical field, suppose that we are
interested in classifying
the type of drug a patient should be prescribed, based on
certain patient characteristics,
such as the age of the patient and the patients
sodium/potassium ratio. Figure 1.4 is
a scatter plot of patients sodium/potassium ratio against
patients ages for a sample
of 200 patients. The particular drug prescribed is
symbolized by the shade of the
points. Light gray points indicate drug Y; medium gray
points indicate drug A or X; dark gray points indicate drug
B or C.
Which drug should be prescribed for a young patient with a high
sodium/potassium ratio?
Which drug should be prescribed for older patients with low
sodium/potassium ratios?
Other Classification examples

Determining whether a particular credit card
transaction is fraudulent
Assessing whether a mortgage application is a
good or bad credit risk
Diagnosing whether a particular disease is present
Determining whether a will was written by the
actual deceased, or fraudulently by someone else
Identifying whether or not certain financial or
personal behavior indicates a possible terrorist
threat
Estimation
Estimation is similar to classification except that
the target variable is numerical rather than

categorical.
Example: estimating the systolic blood pressure
reading of a hospital patient, based on the
patients age, gender, body-mass index, and blood
sodium levels
The relationship between systolic blood pressure
and the predictor variables in the training set
would provide us with an estimation model.
We can then apply that model to new cases.
Examples of Estimation
Estimating the amount of money a randomly
chosen family of four will spend for back-to-school

shopping this new school season.
Estimating the grade-point average (GPA) of a
masters student, based on that students
undergraduate GPA.
Estimating the number of goals per game that
Ronaldinho will score during next UEFA, based on
his last goals.
Prediction
Prediction is similar to classification and estimation, except
that for prediction, the results lie in the future regardless

of numerical or categorical
Any of the methods and techniques used for classification
and estimation may also be used, for prediction.
Traditional statistical methods - point estimation and
confidence interval estimations, simple linear regression
and correlation, multiple regression
Data mining and knowledge discovery methods such as
neural network
Examples of Prediction
Predicting the price of a stock three months into
the future
Predicting the percentage increase in traffic deaths
next year if the speed limit is increased
Predicting the winner of this falls baseball World
Series, based on a comparison of team statistics
Predicting whether a particular molecule in drug
discovery will lead to a profitable new drug for a
pharmaceutical company
Examples of Prediction
Predicting the price of a stock three months into the future
Clustering
Clustering refers to the grouping of records, observations,
or cases into classes of similar objects.

A cluster is a collection of records that are similar to one
another, and dissimilar to records in other clusters.
Clustering differs from classification in that there is no
target variable for clustering.
The clustering task does not try to classify, estimate, or
predict the value of a target variable.
Instead, clustering algorithms seek to segment the entire
data set into relatively homogeneous subgroups or clusters.
demographic profile of each of the geographic areas in the country, as
defined by zip code. One of the clustering mechanisms they use is the
PRIZM segmentation system, which describes every U.S. zip code area
in terms of distinct lifestyle types
Clustering Example
clusters for zip code 90210, Beverly Hills, California.
Beverly Hills, CA 90210's most common PRIZM NE
Segments are:
Number
Name
16
Bohemian Mix
07
Money & Brains
03
Movers & Shakers
01
Upper Crust
04
Young Digerati
Examples of Clustering
Group of residents according to zip codes.
For accounting auditing purposes, to segment
financial behavior into benign and suspicious

categories
As a dimension-reduction tool when the data set has
hundreds of attributes
For gene expression clustering, where very large
quantities of genes may exhibit similar behavior
Association
The association task for data mining is the job of

finding which attributes go together.
Most prevalent in the business world, where it is
known as affinity analysis or market basket
analysis
The task of association seeks to uncover rules for
quantifying the relationship between two or more
attributes.
Association rules are of the form If antecedent,
then consequent, together with a measure of the
support and confidence associated with the rule.
Association
Another example, a particular supermarket may find
that of the 1000 customers shopping on a Thursday

night, 200 bought diapers, and of those 200 who
bought diapers, 50 bought Coke.
Thus, the association rule would be If buy diapers,
then buy coke with a support of 200/1000 = 20% and a
confidence of 50/200 = 25%.
Examples of Association
Investigating the proportion of subscribers to a companys
cell phone plan that respond positively to an offer of a

service upgrade
Examining the proportion of children whose parents read
to them who are themselves good readers
Predicting degradation in telecommunications networks
Finding out which items in a supermarket are purchased
together and which items are never purchased together
Determining the proportion of cases in which a new drug
will exhibit dangerous side effects
Bohemian Mix
Midscale, Middle Age Mix
A collection of mobile urbanites, Bohemian Mix
represents the nation's most liberal lifestyles. Its
residents are an ethnically diverse, progressive mix of
young singles, couples, and families ranging from
students to professionals. In their funky row houses and
apartments, Bohemian Mixers are the early adopters
who are quick to check out the latest movie, nightclub,
laptop, and microbrew.
Money & Brains

Upscale, Older Mix
The residents of Money & Brains seem to have it all: high
incomes, advanced degrees, and sophisticated tastes to
match their credentials. Many of these city dwellers are
married couples with few children who live in
fashionable homes on small, manicured lots.
Movers & Shakers

Wealthy, Middle Age w/o Kids
Movers & Shakers is home to America's up-and-coming business
class: a wealthy suburban world of dual-income couples who are

highly educated, typically between the ages of 35 and 54. Given its
high percentage of executives and white-collar professionals, there's
a decided business bent to this segment: members of Movers &
Shakers rank number one for owning a small business and having a
home office.
Upper Crust Wealthy

Older w/o Kids
The nation's most exclusive address, Upper Crust is the
wealthiest lifestyle in America--a haven for emptynesting couples between the ages of 45 and 64. No
segment has a higher concentration of residents earning
over $100,000 a year or possessing a postgraduate degree.
And none has a more opulent standard of living.
Young Digerati
Upscale, Younger Mix
Young Digerati are tech-savvy and live in fashionable
neighborhoods on the urban fringe. Affluent, highly
educated, and ethnically mixed, Young Digerati
communities are typically filled with trendy apartments
and condos, fitness clubs and clothing boutiques, casual
restaurants and all types of bars--from juice to coffee to
microbrew.

Introduction To Data Mining

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Introduction To Data Mining

Încărcat de

Drepturi de autor:

Formate disponibile

WHAT IS DATA MINING?

WHAT IS DATA MINING?

by sifting through large amounts of data stored in

WHAT IS DATA MINING?

observational data sets to find unsuspected

WHAT IS DATA MINING?

together techniques from machine learning,

increasingly put data repositories online.

WHY DATA MINING?

and knowledge discovery has been fueled by a fortunate

enterprise has access to a reliable current database

FALLACIES OF DATA MINING

loose on our data repositories and use to find answers

solve your problems mechanically while you wait. Data

Fallacy 2. The data mining process is autonomous,

requiring little or no human oversight.

significant human interactivity at each stage. Even after the

FALLACIES OF DATA MINING

intuitive and easy to use.

must combine subject matter knowledge with an

FALLACIES OF DATA MINING

business or research problems.

Reality. The knowledge discovery process will help you to

uncover patterns of behavior. Again, it is up to humans to

Fallacy 6. Data mining will clean up a messy database

Reality. Well, not automatically. As a preliminary phase in the

data mining process, data preparation often deals with data

Knowledge Discovery Process

Knowledge Discovery Process

of a few steps leading from raw data collections to some

Knowledge Discovery Process

techniques are applied to extract patterns potentially

WHAT TASKS CAN DATA MINING

Classify the income brackets of persons not currently

in the database, based on other characteristics

variables and the (already classified) target variable i.e

For example, older females may be associated with the high-

income bracket. This data set is called the training set.

Second: The algorithm would look at new records, for

which no information about income bracket is available.

classified in the high-income bracket.

Which drug should be prescribed for a young patient with a high

Other Classification examples

the target variable is numerical rather than

chosen family of four will spend for back-to-school

that for prediction, the results lie in the future regardless

Predicting the price of a stock three months into the future

or cases into classes of similar objects.

demographic profile of each of the geographic areas in the country, as

Beverly Hills, CA 90210's most common PRIZM NE

Money & Brains

Movers & Shakers

financial behavior into benign and suspicious

The association task for data mining is the job of

that of the 1000 customers shopping on a Thursday

cell phone plan that respond positively to an offer of a

Money & Brains

Movers & Shakers

class: a wealthy suburban world of dual-income couples who are

Upper Crust Wealthy

S-ar putea să vă placă și