4 Using Data Mining in Your IT Systems

Using Data Mining in Your IT Systems
Rafal to edit Master subtitle style Click Lukawiecki Strategic Consultant, Project Botticelli Ltd rafal@projectbotticelli.co.uk
Objectives

Solve common business and IT scenarios Understand how to use BIDS
See it in action (about 70% afternoon for demos)
Solve DM problems by choosing and parametrising correct DM algorithms
This seminar is partly based on Data Mining book by ZhaoHui Tang and Jamie MacLennan, and also on Jamies presentations. Thank you to Jamie and to Donald Farmer for helping me in preparing this session. Thank you to Roni Karassik for a slide. Thank you to Mike Tsalidis, Olga Londer, and Marin Bezic for all the support. Thank you to Maciej Pilecki for assistance with demos.
The information herein is for informational purposes only and represents the opinions and views of Project Botticelli and/or Rafal Lukawiecki. The material presented is not certain and may vary based on several factors. Microsoft makes no warranties, express, implied or statutory, as to the information in this presentation. 2007 Project Botticelli Ltd & Microsoft Corp. Some slides contain quotations from copyrighted materials by other authors, as individually attributed. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Project Botticelli Ltd as of the date of this presentation. Because Project Botticelli & Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Project Botticelli cannot guarantee the accuracy of any information provided after the date of this presentation. Project Botticelli makes no warranties, express, implied or statutory, as to the information in this presentation. E&OE.
Agenda
Overview of Techniques Scenarios:

Customer Segmentation and Classification Analysing Sales Profitability and Risk Customer Needs Analysis Forecasting Other Scenarios
The Techniques
Microsoft DM Algorithms
Designed for wide use

Auto tuning and parameterisation They just work with little effort on your side
Consistent and simple interface Why Microsoft xxx?

There are few truly standard algorithms yet Each DM vendor has their own significant variations Microsoft invented some of the techniques
E.g. Use of trees for regression or nested cases
You can add 3rd party and your own algorithms easily
Algorithm
Data Mining Algorithms Description

Finds the odds of an outcome based on values in a training set Identifies relationships between cases Classifies cases into distinctive groups based on any attribute sets
Decision Trees Association Rules Clustering
Nave Bayes Clearly shows the differences in a particular variable for various
data elements
Sequence Clustering
Groups or clusters data based on a sequence of previous events
Time Series Analyzes and forecasts time-based data combining the power of
ARTXP (developed by Microsoft Research) for short-term predictions with ARIMA (in SQL 2008) for long-term accuracy.
Neural Nets Seeks to uncover non-intuitive relationships in data Linear Regression Logistic Regression
Determines the relationship between columns in order to predict an outcome Determines the relationship between columns in order to evaluate the probability that a column will contain a specific state
6
Algorithm Matrix
Associatio n Rules Clustering Decision Trees Linear Regression Logistic Regression Nave Bayes Neural Nets Sequence Clustering Time Series
ti ica if ss Cla on
i io o tat iat ati en oc gm tim ss A Es Se n n on
ed nc ti s va ca n e Ad ta tio xt ysis a or F ra D Te al g plo n n A Ex
Scenario 1: Customer Classification & Segmentation
Who are our customers? Are there any relationships between their demographics and their interest in buying from us? Who should we concentrate on more?
Lets Get Familiar with BIDS

Business Intelligence Development Studio
Offline and online modes Everything you do stays on the server Offline requires server admin privileges to deploy Process: 1. Define Data Sources and Data Source Views 2. Define Mining Structure and Models 3. Train (process) the Structures 4. Verify accuracy 5. Explore and visualise 6. Perform predictions 7. Deploy for other users 8. Regularly update and re-validate the Model
Demo
Using BIDS to Prepare for Data Mining Click to edit Master subtitle style
1.
10
Data Mining Designer

1.
2. 3. 4. 5.
Build a Mining Structure and its first Mining Model Train (process) model Validate model in the Accuracy Chart tab Explore and visualise Perform predictions
11
Microsoft Decision Trees
Use for:
Classification: churn and risk analysis Regression: predict profit or income Association analysis based on multiple predictable variable
Builds one tree for each predictable attribute Fast

12
Decision Tree Parameters

COMPLEXITY_PENALTY FORCE_REGRESSOR MAXIMUM_INPUT_ATTRIBUTES MAXIMUM_OUTPUT_ATTRIBUTES MINIMUM_SUPPORT SCORE_METHOD SPLIT_METHOD
13
Demo
Building a Data Mining Model for Click to edit Master subtitle styleCustomer Classification
1. 2.
Using Microsoft Decision Trees Exploring the Decision Tree
14
Microsoft Nave Bayes
Use for:

Classification Association with multiple predictable attributes
Assumes all inputs are independent Simple classification technique based on conditional probability
15
Nave Bayes Parameters

MAXIMUM_INPUT_ATTRIBUTES MAXIMUM_OUTPUT_ATTRIBUTES MAXIMUM_STATES MINIMUM_DEPENDENCY_PROBABILITY
16
Clustering
Applied to
Segmentation: Customer grouping, Mailing campaign Also: classification and regression Anomaly detection
Discrete and continuous Note:
Predict Only attributes not used for clustering
17
Clustering
18
Clustering
Anomaly Detection
Ag e
Daughte r Paren t
So n
Mal e
Femal e
19
Clustering Parameters

CLUSTER_COUNT CLUSTER_SEED CLUSTERING_METHOD MAXIMUM_INPUT_ATTRIBUTES MAXIMUM_STATES MINIMUM_SUPPORT MODELLING_CARDINALITY SAMPLE_SIZE STOPPING_TOLERANCE
20
Neural Network
Applied to

Classification Regression
Hidden Layers
Output Layer
Loyalt y
Great for finding complicated relationship among attributes
Difficult to interpret results
Input Layer Ag e Educatio n Se x Incom e
Gradient Descent method
21
Neural Network Parameters

HIDDEN_NODE_RATIO HOLDOUT_PERCENTAGE HOLDOUT_SEED MAXIMUM_INPUT_ATTRIBUTES MAXIMUM_OUTPUT_ATTRIBUTES MAXIMUM_STATES SAMPLE_SIZE
22
Demo
Expanding Customer Classification and Segmentation Click to Microsoft Clustering, Nave Bayes, and Neural with edit Master subtitle style Networks 2. Exploring and Visualising Patters Found with the Above
1.
23
Validating Results
Accuracy Viewer tabs run a full prediction against the holdout data Results compared to known holdout values and visualised:

Classification Matrix tedious but precise Lift Charts show how your model compares to a random uneducated guess

Compare multiple algorithm results Two types of chart: generic and specific to a predicted value (e.g. [Total of Car Purchases] = 2) Not a real profit prediction, just a name
Profit Chart is a simple variation of a Lift Chart
24
Demo
Verifying Master subtitle style Click to edit Results Using Classification Matrix
1. 2.
Validating Model Accuracy Using Two Types of Lift Charts
25
Improving Models
Approaches:

Change the algorithm Change model parameters Change inputs/outputs to avoid bad correlations Clean the data set Verify statistics (Data Explorer)
Perhaps there are no good patterns in data
26
Demo
Improving Clustering Results by Parameterisation Click to edit Master subtitle style
1. 2.
Re-validating Customer Classification Models
27
What makes some of our products more successful? Why is a model or a brand favoured by some groups of clients? Can we recommend additional products on our web site automatically without annoying them?
Scenario 2: Analysing Sales
28
First of All, Use:
Decision Trees
Especially with Nested Cases
This subtly changes DT so it finds associations
Clustering Naive Bayes Neural Networks And...
29
Association Rules
Use for:
Market basket analysis Cross selling and recommendation s Advanced data exploration
Finds frequent itemsets and rules Sensitive to parameters

30
Association Rules Parameters

MINIMUM_SUPPORT MINIMUM_PROBABILITY MINIMUM_IMPORTANCE MINIMUM_ITEMSET_SIZE MAXIMUM_ITEMSET_COUNT MAXIMUM_ITEMSET_SIZE MAXIMUM_SUPPORT
31
Demo
Analysing Customer Needs style Click to edit Master subtitle with Decision Trees and No
1. 2. 3.
Nesting... ... and Decision Trees with Nested Cases Using Association Rules to Find Shopping Preferences
32
Who are our most profitable customers? Can I predict profit of a future customer based on demographics? Should I give them a Platinum Elite card now?
Scenario 3: Profitability and Risk

33
Profitability and Risk
Finding what makes a customer profitable is another example of classification Typically solved with:

Decision Trees (Regression), Linear Regression, and Neural Networks or Logistic Regression Important to predict probability of the predicted, or expected profit Logistic Regression and Neural Networks
Often used for prediction
Risk scoring
34
Functions
DMX functions can be used to build richer prediction expressions Predict statistical measures:

PredictProbability PredictHistogram
Essential to use when predicting any values, in particular profit or risk
35
PredictProbability
PredictProbability(LoanStatu Probability of the most s) likely outcome PredictProbability(LoanStatu Probability that the loan s, will become very Defaulted) unhealthy
Similar for PredictAdjustedProbability, etc.
36
Demo
Analyzing and Predicting Lending Risk with a Named Query 2. Analyzing Profitability Using Multiple Algorithms Click to edit Master subtitle style 3. Performing Predictions in BIDS 4. Predicting in Excel Using Previously Deployed Models and Data Mining Tab
1.
37
Cross-Validating Results: Reliability

SQL Server 2008
X iterations of retraining and retesting the model Results from each test statistically collated Model deemed accurate (and perhaps reliable) when variance is low and results meet expectations
38
Demo
Cross-validation of Models style Click to edit Master subtitle Reliability
1.
39
How do they behave? What are they likely to do once they bought that really expensive car? Should I intervene?
Scenario 4: Customer Needs Analysis

40
What is a Sequence?
To discover the most likely beginning, paths, and ends of a customers journey through our domain consider using:

Association Rules Sequence Clustering
41
Sequence Clustering
Analysis of:

Customer behaviour Transaction patterns Click stream Customer segmentation Sequence prediction
Mix of clustering and sequence technologies
Groups individuals based on their profiles including sequence data
42
Sequence Data
Car Purchases Cust ID 1 Age 35 Marital Status Seq ID M 1 2 3 2 20 S 1 2 3 3 57 M 1 2 Brand Porch-A Bamborgini Kexus Wagen Voovo Voovo Voovo T-Yota
43
Sequence Clustering Parameters

CLUSTER_COUNT MAXIMUM_SEQUENCE_STATES MAXIMUM_STATES MINIMUM_SUPPORT
44
Demo
Analysing Customer Transaction Click to edit Master subtitle styleBehaviour Using
1. 2.
Sequence Clustering Analysing Events Leading to Customer Loss using Sequence Clustering
45
What are my sales going to be like in the next few months? Will I have credit problems? Will my server need an upgrade in the next 3 months?
Scenario 5: Forecasting
46
Estimating Future
Forecasting
But: data often is very seasonal
Seasonality detected by a Fast Fourier Transform SQL Server 2005 uses ARTXP (Auto Regressive Tree with Cross Prediction)
Time Series
For short term forecasting
SQL Server 2008 uses a hybrid of improved ARTXP and industry-standard ARIMA (Auto Regressive Integrated Moving Average)
Great for short and long-term forecasting
47
Time Series
Uses:

Forecast sales Inventory prediction Web hits prediction Stock value estimation
Regression tree technology to describe and predict series values
Tree allows multiple regressors

48
Auto-Regression
Month Milk Jan Feb Mar Apr May Jun July
49
Bread 80 90 85 110 120 123 150
Case Id 1 2 3 4 5
100 120 110 115 125 120 140
Milk (t-2) 100 120 110 115 125
Milk (t-1) 120 110 115 125 120
Milk (t0) 110 115 125 120 140
Bread Bread Bread (t-2) (t-1) (t0) 80 90 85 110 120 90 85 110 120 123 85 110 120 123 150
Regression Tree
All
Bread(t-2) <=110
Bread(t-2) >110
Milk(t1) <=120
Milk = 3.02 + 0.72*Bread(t-1) +0.31*Milk(t-1)
Milk(t1) >120
50
Input Data
Month Jan Feb Mar Apr May Jun July Format A Milk 100 120 110 115 125 120 140 Bread 80 90 85 110 120 123 150 Month Jan Jan Feb Feb Mar Mar Apr Format B Product Milk Bread Milk Bread Milk Bread Milk Sales 100 80 120 90 110 85 115
51
Time Series Parameters

AUTO_DETECT_PERIODICITY COMPLEXITY_PENALTY HISTORIC_MODEL_COUNT HISTORIC_MODEL_GAP MAXIMUM_SERIES_VALUE MINIMUM_SERIES_VALUE MINIMUM_SUPPORT MISSING_VALUE_SUBSITUTION PERIODICITY_HINT
52
Demo
Click to edit Master Using a Custom Model and BIDS 1. Forecasting Sales subtitle style
2.
Forecasting Sales Using Table Analysis Tools in Excel
53
Performance Monitoring
Issue:
What is causing my servers to suffer? Is there a pattern to a failure that keeps repeating?
Suggested Solution:
1.
2.
Time Series of a Performance Counter from a log, averaged and normalised Sequence Clustering for events that occur in the application log for each transaction
54
Demo
Click to edit Master Disk Utilisation Needs Using Time 1. Predicting Server subtitle style
Series
55
Other Scenarios
56
Data Improvement During ETL
Issue:
Inconsistent or missing data during ExtractTransform-Load (data warehousing).
Suggested Solution:
1.
2. 3.
Decision Tree (or Clustering, Naive Bayes) model for existing data Apply prediction in real-time as ETL is taking place Flag each row containing predicted values row with a probability measure (it is not a fact)
57
Security Threat Detection
Issue:
Finding suspicious transactions and intruder detection.
Suggested Solution:
1.
2.
. 1.
Clustering (or Neural Network) to find small groups of outliers Prediction of just one row transactional data to see if it belongs to the suspected cluster Or Sequence Clustering of clicks to detect a known attack pattern
58
Web Site and Email Feedback Analysis
Issue:
What are the main issues our customers highlight? How can I quickly spot problem submissions that need a response?
Suggested Solution:
1. 2.
3.
Text extraction and tokenization using SSIS Association Rules (or Sequence Clustering) of extracted tokens Possible prediction of a previously suggested resolution or just classification of the submission
59
Resources

Demos & newsletter: www.sqlserverdatamining.com AdventureWorksDW: www.codeplex.com Book by Jamie MacLennan and ZhaoHui Tang Data Mining with SQL Server 2005, Wiley 2005, ISBN 0471462616 Also: www.beyeblogs.com/donaldfarmer blogs.msdn.com/jamiemac www.microsoft.com/sql/technologies/dm forums.microsoft.com/MSDN/ShowForum.aspx?ForumID=81& SQL Server Books Online for all documentation Excellent seminars at www.microsoft.com/technetspotlight
60
Summary
Data Mining is a key technology for Predictive Analysis, a major trend Intuitive with great visual feedback for quality May promote you to a keeper of knowledge Discover and explore the hidden knowledge that can make you and your company more successful
61
Questions and Answers
Thank You!
62
2008 Microsoft Corporation & Project Botticelli Ltd. All rights reserved.
The information herein is for informational purposes only and represents the opinions and views of Project Botticelli and/or Rafal Lukawiecki. The material presented is not certain and may vary based on several factors. Microsoft makes no warranties, express, implied or statutory, as to the information in this presentation. 2007 Project Botticelli Ltd & Microsoft Corp. Some slides contain quotations from copyrighted materials by other authors, as individually attributed. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Project Botticelli Ltd as of the date of this presentation. Because Project Botticelli & Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft and Project Botticelli cannot guarantee the accuracy of any information provided after the date of this presentation. Project Botticelli makes no warranties, express, implied or statutory, as to the information in this presentation. E&OE.
63
How can I detect incorrect data entry without hard-coding the rules? Intelligent applications?
Bonus Scenario: Data Entry Validation

64
What is So Special?
Application behaviour evolves and follows the data mining model
Influenced by real-world events caused by your application!
We are creating a feedback loop from the application through its effects back to the application
The trick that connects the two is the discovery of new emerging patterns and old patterns disappearing the very job of DM
65
Intelligent Application
Training Data
DB data Client data Application log
Mining Model
Data To Predict
Just one row New Entry New Txion
DM Engi ne Mining Model Mining Model
DM Engi ne Predicted Data
66
Intelligent Application Steps

A Simplified View
1. 2.
3.
Prepare the database for mining Create and train the DM model on your data, consisting of both the inputs and actual outcomes Test the model. If OK... The model predicts outcomes Make application logic depend on predicted outcomes (if, case etc.) Update (and validate) the model periodically as data evolves
67
1. 2.
1.
The Intelligent Bit in Our Apps
Your if statement will test the value returned from a prediction typically, predicted probability or outcome Steps:
1.
Build a case (set of attributes) representing the transaction you are processing at the moment
E.g. Shopping basket of a customer plus their shipping info
2.
3.
Execute a SELECT ... PREDICTION JOIN on the preloaded mining model Read returned attributes, especially case probability for a some outcome
E.g. Probability > 50% that TransactionOutcome=ShippingDeliveryFailure
4. 5.
Your application has just made an intelligent decision! Remember to refresh and retest the model regularly daily?
68
Watch The Demo At...
www.microsoft.com/technetspotlight Find my session Build More Intelligent Applications Using Data Mining from Microsoft TechEd Developers 2007 in Barcelona
69

4 Using Data Mining in Your IT Systems

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

4 Using Data Mining in Your IT Systems

Încărcat de

Drepturi de autor:

Formate disponibile

Using Data Mining in Your IT Systems

Solve common business and IT scenarios Understand how to use BIDS

See it in action (about 70% afternoon for demos)

Solve DM problems by choosing and parametrising correct DM algorithms

Overview of Techniques Scenarios:

Designed for wide use

Consistent and simple interface Why Microsoft xxx?

E.g. Use of trees for regression or nested cases

Data Mining Algorithms Description

Decision Trees Association Rules Clustering

Groups or clusters data based on a sequence of previous events

i io o tat iat ati en oc gm tim ss A Es Se n n on

ed nc ti s va ca n e Ad ta tio xt ysis a or F ra D Te al g plo n n A Ex

Scenario 1: Customer Classification & Segmentation

Lets Get Familiar with BIDS

Data Mining Designer

Microsoft Decision Trees

Builds one tree for each predictable attribute Fast

Decision Tree Parameters

COMPLEXITY_PENALTY FORCE_REGRESSOR MAXIMUM_INPUT_ATTRIBUTES MAXIMUM_OUTPUT_ATTRIBUTES MINIMUM_SUPPORT SCORE_METHOD SPLIT_METHOD

Using Microsoft Decision Trees Exploring the Decision Tree

Microsoft Nave Bayes

Classification Association with multiple predictable attributes

Nave Bayes Parameters

MAXIMUM_INPUT_ATTRIBUTES MAXIMUM_OUTPUT_ATTRIBUTES MAXIMUM_STATES MINIMUM_DEPENDENCY_PROBABILITY

Discrete and continuous Note:

Predict Only attributes not used for clustering

CLUSTER_COUNT CLUSTER_SEED CLUSTERING_METHOD MAXIMUM_INPUT_ATTRIBUTES MAXIMUM_STATES MINIMUM_SUPPORT MODELLING_CARDINALITY SAMPLE_SIZE STOPPING_TOLERANCE

Great for finding complicated relationship among attributes

Difficult to interpret results

Input Layer Ag e Educatio n Se x Incom e

Gradient Descent method

Neural Network Parameters

HIDDEN_NODE_RATIO HOLDOUT_PERCENTAGE HOLDOUT_SEED MAXIMUM_INPUT_ATTRIBUTES MAXIMUM_OUTPUT_ATTRIBUTES MAXIMUM_STATES SAMPLE_SIZE

Profit Chart is a simple variation of a Lift Chart

Validating Model Accuracy Using Two Types of Lift Charts

Perhaps there are no good patterns in data

Re-validating Customer Classification Models

Scenario 2: Analysing Sales

First of All, Use:

Especially with Nested Cases

This subtly changes DT so it finds associations

Clustering Naive Bayes Neural Networks And...

Finds frequent itemsets and rules Sensitive to parameters

Association Rules Parameters

MINIMUM_SUPPORT MINIMUM_PROBABILITY MINIMUM_IMPORTANCE MINIMUM_ITEMSET_SIZE MAXIMUM_ITEMSET_COUNT MAXIMUM_ITEMSET_SIZE MAXIMUM_SUPPORT

Scenario 3: Profitability and Risk

Profitability and Risk

Often used for prediction

Essential to use when predicting any values, in particular profit or risk

Similar for PredictAdjustedProbability, etc.

Cross-Validating Results: Reliability

Scenario 4: Customer Needs Analysis

Association Rules Sequence Clustering

Mix of clustering and sequence technologies

Groups individuals based on their profiles including sequence data

Sequence Clustering Parameters

CLUSTER_COUNT MAXIMUM_SEQUENCE_STATES MAXIMUM_STATES MINIMUM_SUPPORT

But: data often is very seasonal

For short term forecasting

Great for short and long-term forecasting

Regression tree technology to describe and predict series values

Tree allows multiple regressors