Sunteți pe pagina 1din 27

Another Look at Data Mining

Why do we mine?
What do we mine?
How do we mine?
What is Data Mining

➤ Data mining discovers meaningful new


correlations, hidden patterns and
relationships in your data
➤ Conceptual descendent of statistics
➤ Combines machine learning,statistics,and
databases
➤ Knowledge discovery:process of building
and implementing a data mining solution
CS753 Dr. Mary Ann Robbert
Data Mining Overview
➤ Knowledge Discovery in Databases, KDD
➤ No one data mining approach
➤ each tool viewed logically as application of client
➤ Can reside on separate machine or in separate process and access
data warehouse
➤ RDBMS or proprietary OLAP embed data mining
capabilities deeply within engines to improve efficiency
and add extensions
➤ Requires a good foundation in terms of a data warehouse

CS753 Dr. Mary Ann Robbert


Data Mining Overview
(con’t)
➤ Common algorithmic approaches
➤ association, affinity grouping
➤ predicting, sequence-
sequence-based analysis
➤ clustering
➤ classification
➤ estimation

➤ Steps are:data selection, data


transformation,data mining,result
interpretation.
CS753 Dr. Mary Ann Robbert
Strategic Benefit of Data
Mining
➤ Direct Marketing
➤ Trend Analysis
➤ Fraud detection
➤ Forecasting in Financial Markets

CS753 Dr. Mary Ann Robbert


Why Data Mining Now?

➤ Economics
➤ Unprecedented affordability of MIPS and MB

➤ Parallel computing
➤ Enormous amounts of data can be processed

➤ Popularity of data warehouses, data marts


➤ Relatively clean data available

CS753 Dr. Mary Ann Robbert


Data Mining compared to
Traditional Analysis
➤ Traditional Analysis
➤ Did sales of product X increase in Nov.?
➤ Do sales of product X decrease when there is a
promotion on product Y?
➤ Data mining is result oriented
➤ What are the factors that determine sales of
product X?

CS753 Dr. Mary Ann Robbert


Data Mining compared to
Traditional Analysis (con’t)
➤ Traditional; analysis is incremental
➤ Does billing level affect turnover?
➤ Does location affect turnover?
➤ Analyst builds model step by step

➤ Data Mining is result oriented


➤ Identify the factors and predict turnover

CS753 Dr. Mary Ann Robbert


Steps in Data Mining
➤ Data Manipulation - can be 70-80% of data
mining effort
➤ data cleaning
➤ missing values
➤ data derivation
➤ merging data

➤ Defining a study
➤ Supervised-
Supervised-articulating goal, choosing dependent
variable or output and specifying data fields
➤ Unsupervised-
Unsupervised-group similar types of data or identify
exceptions
CS753 Dr. Mary Ann Robbert
Steps in Data Mining (con’t)

➤ Reading the data and building the model


➤ model summarizes large amounts of data by
accumulating indicators
(frequencies,weight,conjunctions,differentiation)
➤ Understanding the model
➤ Know the particular model

➤ Prediction
➤ Choose the best outcome based on historical data

CS753 Dr. Mary Ann Robbert


Models

➤ Genetic Algorithms
➤ Neural Nets
➤ Agents
➤ Statistics
➤ Visualization

CS753 Dr. Mary Ann Robbert


Genetic Algorithms

➤ Artificial intelligence system that mimics the evolutionary,


survival-
survival-of-
of-the-
the-fittest processes to generate increasingly
better solutions to a problem.
➤ Genetic algorithms produce several generations of solutions,
choosing the best of the current set for each new generation.
➤ Examples
➤ Generating human faces based on a few known features.
➤ Generating solutions to routing problems.
➤ Generating stock portfolios.

CS753 Dr. Mary Ann Robbert


EVOLUTION IN GENETIC
ALGORITHMS
➤ SELECTION - or survival of the fittest. The
key is to give preference to better outcomes.
➤ CROSSOVER - combining portions of good
outcomes in the hope of creating an even
better outcome.
➤ MUTATION - randomly trying combinations
and evaluating the success (or failure) of the
outcome.

CS753 Dr. Mary Ann Robbert


Neural Nets
➤Mathematical Model of the Way a Brain
Functions
➤Machine learning approach by which historical
data can be examined for pattern recognition
➤A neural network simulates the human ability
to classify things based on the experience of
seeing many examples.
examples.

➤Pros -Numerical Data

➤Cons - Opaque, Art or Science

CS753 Dr. Mary Ann Robbert

:
/
/
w
w
w
.
a
➤Example
➤Distinguishing different chemical
compounds
➤Detecting anomalies in human tissue
that may signify disease
➤Reading handwriting
➤Detecting fraud in credit card use

CS753 Dr. Mary Ann Robbert


Intelligent Agents

➤ Software entities that carry out some set of


operations on behalf of user or program with some
degree of autonomy and employ some knowledge
or representation of users goals and desires.
➤ Some common characteristics
➤ ability to communicate, cooperate and coordinate with
other agents
➤ ability to act autonomously to achieve collective goal of
system

CS753 Dr. Mary Ann Robbert


Intelligent Agents (con’t)

➤ Tasks
➤ automate repetitive tasks
➤ finding and filtering information
➤ summarizing complex data

➤ Capability to learn and make


recommendations
➤ Black box approach hides complexity and
allows for design of scalable system
CS753 Dr. Mary Ann Robbert
Comparison
Starting
AI System Problem Type Based On Information

Expert Diagnostic or Strategies of Expert’s


Systems prescriptive experts know-how

Neural Identification, The human Acceptable


Networks classification, brain patterns
prediction

Genetic Biological Set of


Algorithms Optimal solution evolution possible
solutions

Intelligent Specific and One or more AI Your


Agents repetitive tasks techniques preferences
Statistics

➤ SAS, SPSS
➤ Pros - Established technology
➤ Cons - Needs assumptions, nominal
variable handling, management
acceptance?

CS753 Dr. Mary Ann Robbert


Visualization

➤ Data visualization refers to technologies


that support visualization of information
➤ Includes – digital images, GIS, multi-
dimensions, 3-D presentations, animations
➤ http://www.almaden.ibm.com/cs/quest/dem
o/assoc/general.html

CS753 Dr. Mary Ann Robbert


Data Mining is Not a Silver
Bullet
➤ It does not:
➤ Find answers to questions you don’t ask
➤ Eliminate the need for domain experience
➤ Remove the need for data analysis skills

CS753 Dr. Mary Ann Robbert


Data Mining Software

➤ http://www.kdnuggets.com/software/
➤ http://www.attar.com/ download
➤ http://www.cs.bham.ac.uk/~anp/software.ht
ml software listing

CS753 Dr. Mary Ann Robbert


Six Rules of Data Quality
by Ken Orr

1. Data that is not used cannot be correct for very long


2. Data Quality in an information system is a function of
its use, not its collection
3.Data quality will ultimately be no better than its most
stringent use
4. Data quality problems tend to become worse with the
age of the system
5. Less likely it is that some data element will change,
more traumatic it will be when it finally does change.
6. Information overload
CS753 affects data
Dr. Mary Ann quality
Robbert
Data Quality Software

➤ http://www.rulequest.com/gritbot-info.html

CS753 Dr. Mary Ann Robbert


General DW Data transformation

➤ Resolve inconsistent legacy formats


➤ Strip out unwanted fields
➤ Interpret codes into text
➤ Combine data from multiple sources under
a common key
➤ Find fields used for multiple purposes and
interpret fields value based on context

CS753 Dr. Mary Ann Robbert


Data transformation for
Data Mining
➤ Flag normal, abnormal, out of bounds or
impossible facts
➤ Recognize random or noise values from
context and mask out
➤ Apply uniform treatment to NULL values
➤ Flag fast records with changed status
➤ Classify individual record by one of its
aggregates
CS753 Dr. Mary Ann Robbert
Conclusion

➤ For successful data mining:


➤ data analysis and mining goals must be
identifies and formulated
➤ appropriate data must be selected, cleaned and
prepared for queries and business analysis
➤ http://www.rulequest.com/cubist-
examples.html#BOSTON
➤ http://www.almaden.ibm.com/cs/quest/
CS753 Dr. Mary Ann Robbert

S-ar putea să vă placă și