Sunteți pe pagina 1din 28

P M

Overview
Introduction
Explanation of Data Mining Techniques Advantages

Applications
Privacy

EXTRACTING DATAMINING: THE KNOWLWDGE FROM THE DATA B

KDD: KDD stands forKNOWLEDGE DISCOVER DATA BASE KDD identify the invisible correlation

Data collection(1960):using

coumputers,tapes and disks


Data access(1980):using RDBMS,SQL

Data warehouse&decision

support(1990):using OLAP,DWH, Data mining(today):using advanced algorithms,micro processors etc

Data Warehousing
Data Warehouse:
is a repository (or archive) of information

gathered from multiple sources, stored under a unified schema, at a single site.
Collect data Store in single repository Allows for easier query development as a single repository can be

queried.

DATA WAREHOUSING:

Data warehousing has some OLAP operations OLAP stands for OnlineAnalyticalProcess It stores the historical data It performs only read pattern It deals with the long term operations OLAP operations are rollup

drill down slicing dicing


It has high flexibility and it consists less no.of users It is subject-oriented

OLAP - On-line Analytical Processing Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening

1.Data Cleaning 2.Data Integration 3.Data Transformation 4.Data Reduction

Discovery of Knowledge

Steps:
Business Understanding: what problem are we trying to

solve? What is the business trying to achieve? Data Understanding: do we have the data to be able to answer this questions? If not, what is the cost of acquiring that additional information? Data Preparation: all data is dirty and needs to be cleaned and transformed. This is the heavy lifting stage. Analysis & Modeling: the tools must be chosen based on what the business is trying to understand and the data available. Evaluate Outcomes: how well does the model actually works from a statistical point of view (significance) and from a business point of view (actionability)? Deployment: driving the insight into the business.

Data Mining Techniques


Classification
Clustering Regression

Association Rules

Classification: Given a set of items that have several classes,

Classification (training instances) with their and given the past instances
associated class, Classification is the process of predicting the class of a new item. Therefore to classify the new item and identify to which class it belongs Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications Responds Rarely, Responds Sometimes, Responds Frequently. The bank will then attempt to find rules about the customers that respond Frequently and Sometimes. The rules could be used to predict needs of potential customers.

Technique for Classification


Decision-Tree Classifiers
Job
Engineer

Doctor

Carpenter

Income
<30K >50K <40K

Income
>90K

Income
<50K >100K

Bad

Good

Bad

Good

Bad

Good

Predicting credit risk of a person with the jobs specified.

Clustering

Clustering algorithms find groups of items that are similar. It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. (2)
Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased. The categories are unspecified and this is referred to as unsupervised learning

Regression
Regression deals with the prediction of a value, rather than a class.
Example: Find out if there is a relationship

between smoking patients and cancer related illness. It removes the noisy data Given values: X1, X2... Xn Objective predict variable Y One way is to predict coefficients a0, a1, a2
Y = a0 + a1X1 + a2X2 + anXn

Regression
Example graph:
Line of Best Fit Curve Fitting

Association Rules
An association algorithm creates rules that describe

how often events have occurred together. (2)


Example: When a customer buys a hammer, then 90%

of the time they will buy nails. Ex: computer=>antivirussoftware[support=2%,confidence=60%]. Support ex: (A=>B)=P(AUB) Confidence ex:(A=>B)=P(B/A)

Association Rules
Support: is a measure of what fraction of the

population satisfies both the antecedent and the consequent of the rule Support is the measure in rule of interestingness. Support is 2% means that 2% of the transaction under analysis show that computer and antivirus are purchased together

Association Rules
Confidence: is a measure of how often the

consequent is true when the antecedent is true. It is the measure ot rule of interestingness Example: Confidence is 60% that means 60% of the customer who purchased the computer also purchase the software

ADVANTAGES:
Provides new knowledge from existing data
Public databases Government sources Company Databases

Old data can be used to develop new knowledge

Weatherforecast Insurance Government Health care New knowledge can be used to improve services or products

Improvements lead to:


Bigger profits More efficient service

Uses of Data Mining


Sales/ Marketing
Diversify target market
Identify clients needs to increase response rates

Risk Assessment
Identify Customers that pose high credit risk

Fraud Detection
Identify people misusing the system. E.g. People who

have two Social Security Numbers

Customer Care
Identify customers likely to change providers Identify customer needs

Financial data analysis

industry manufacturing
Telecommunicationb industry
Biological data analysis Scientific application

Retail

Intrusion detection

What data mining has done for...


Scheduled its workforce to provide faster, more accurate answers to questions.

Reduced direct mail costs by 30% while garnering 95% of the campaigns revenue.

Applications of Data Mining


(4)

Source IDC 1998

Privacy Concerns

Effective Data Mining requires large sources of data To achieve a wide spectrum of data, link multiple data sources Linking sources leads can be problematic for privacy as follows: If the following histories of a customer were linked:
Shopping History Credit History Bank History Employment History

The users life story can be painted from the

collected data

References
Silberschatz, Korth, Sudarshan, Database System Concepts, 5th Edition, Mc Graw Hill, 2005 2. http://www.twocrows.com/glossary.htm, Two Crows, Data Mining Glossary 3. http://en.wikipedia.org/wiki/Data_mining, Wikipedia 4. http://phoenix.phys.clemson.edu/tutorials/exce l/regression.html 5. http://wwwmaths.anu.edu.au/~steve/pdcn.pdf
1.

S-ar putea să vă placă și