Documente Academic
Documente Profesional
Documente Cultură
Chapter 1:
Know the planning tasks:
o Defining & measuring the target variable
o Collecting the data
o Comparing the distributions of key variables between the modeling data set & the target population
to verify that the sample adequately represents the target population
o Defining sampling weights if necessary
o Performing data-cleaning tasks
Types of Inputs:
o Data types:
Numeric
Character
o Measurement Scales:
Categorical
Nominal
Ordinal
o Ordinal=discrete=polychotomous
Interval
Interval-scaled=continuous
Binary
o Textual Data
Needs to be converted into a numeric form first
Defining the Target:
o First step in any data mining project is to define and measure the target variable to be predicted by
the model.
Source of Modeling Data:
o Data is based on an experiment carried out by conducting a marking campaign on a well-designed
sample of customers drawn from the target population
o OR
o Data is a sample drawn from the results of a past marketing campaign and not from the target
population
Less desirable, but is often necessary to make do with whatever data is available
Can make adjustments through observation weights to compensate for the lack of perfect
compatibility between the modeling sample and the target population.
Comparability between the sample and target
o Before modeling we must verify that the sample is a good representation of the target universe
Compare the distributions of key variables in the sample and target universe
o If the distribution of key characteristics in the sample and target population are different, sometimes
observation weights are used to correct for any bias.
Pre-Processing the Data:
o Eliminate irrelevant data elements that have no effect on the target variable
o Convert the data to an appropriate measurement scale
o Eliminate variables with highly skewed distributions
o Eliminate inputs which are really target variables disguised as inputs
o Impute missing values
Chapter 2:
Sample Nodes:
o Append
Combines data sets created by different paths of a process flow in a project
It stacks data instead of side-by-side merging variables
For example, combining Region A and Region B data that has same variables
o Data Partition
Partitions data into training, validation & test data
Training-Used to develop the model
Validation-Used to evaluate different models created w/ training data and select the
best one
Test- Used to independently assess the selected model
o File Import
Enables you to create a data source directly from an external file such as Excel
Can connect to any node
o Filter
Can be used to eliminate observations with extreme values (outliers) in the variables
Not good to routinely use this-want to find out reason for outliers first
This can follow or precede data partition node
o Input Data
First Node in any diagram
Specifies the data set you want to use in the diagram
o Merge
Used to combine different data sets within a project
Can combine output data sets from different nodes
Combines data as side-by-side merging
Time Series Nodes
o Time Series Decomposition Node
Used to find trend and cyclical factors and seasonal factors
o Time Series Data Prep Node
Used to convert time stamped transaction data into time series
o Time Series Exponential Smoothing Node
Converts transaction data and applies smoothing method to the time series and makes
forecasts for a specified time horizon beyond the sample period
o Time Series Reduction Node
Transforms the original time series into a reduced form
o Time Series Similarity Node
Compares time series for similarities
Tools for Initial Data Exploration
o StatExplore:
Chi-square stat shows the strength of the relationship between the target & each categorical
input variable.
Can create chi-square stats for continuous variables, but you have to create categorical
variables from them first. You have to set interval variables to yes and specify number of bins.
Shows a chi-square plot and variable worth plot when run.
Variable worth-calculated from the p-values of the chi-square stats
Variables with the highest chi-square are the most important
o StatExplore for continuous targets:
Interval Variables property set to No and set correlations, Pearson correlations and spearman
correlations to yes.
o MultiPlot Node:
Shows plots of all inputs and input levels (if categorical) against the target (frequency plots)
o Graph Explore Node
Select variables you want to plot and their roles
Tools for Data Modification
o Drop Node
Used to drop variables from the data set or metadata
o Replacement Node
Can be used to filter out extreme values in a variable without losing any observations
Can also be used to change the distribution of any variable in the sample
It’s like filtering without losing the observation
o Impute Node
Used for imputing missing values of inputs
o Interactive Binning Node
Binning helps uncover complex non-linear relationships between the inputs and the target
Binning is a method for converting an interval-scaled variable into a categorical variable
Forms bins; once bins are formed, the Gini statistic is computed for each input.
If Gini Stat has a value below the minimum cutoff property, it is rejected.
o Principal Components Node
Principle components are new variables constructed from a set of variables. They are linear
combinations of the original variables.
In general, a small # of principal components can capture most of the information contained
in the original inputs.
Using principal components results in no collinearity
Target variables won’t be included in constructing the principal components
Principal components are calculated as weighted sums of the original variables.
Eigenvalues are equal to the statistical variance of the new components.
Utility Nodes
o Control Points
o End Groups
o Ext Demo
o Metadata
o Reporter
o SAS Code
Used to incorporate SAS procedures & external SAS code into the process flow of a project
Can be placed anywhere in a process flow
o Score Code Export
o Start Groups
Chapter 3:
Cluster Node:
o Used to create clusters of observations with similar characteristics
o Enables you to discover patterns in your data
o Creates the clusters from the input variables alone without reference to the target variable
Variable Selection Node:
o Used for variable selection
o Can look at the strength of the relationship of input variables with target variable by using chi-square
or r-squared values.
o Interval targets- R-square
o Binary targets-Both r-square and chi-square
o R-square:
First, node rejects any variables that have a value less than the minimum r-squared
R-square tells us the proportion of variation in the target variable explained by a single input
variable, ignoring the effect of other input variables.
To detect non-linear relationships, the VS node creates binned variables from each interval
variable. These are called AOV16 variables. These are treated as class variables.
VS node then performs a forward stepwise regression to evaluate the variables chosen in the
first step.
o Chi-square:
VS node creates a tree based on chi-square maximization
VS node first bins interval variables and uses the binned variable rather than the original
inputs in building the tree.
Node rejects any split with chi-square below the specified threshold
o Use when you want to make an initial selection of inputs or eliminate irrelevant inputs
o Selects variables using only the training data set
o Six Cases:
Case 1:
Target=continuous & interval-scaled
Inputs= numeric & interval-scaled
o In this case, VS node calculates two measures of correlation between each
input & the target:
Original variable & target- r-squared
AOV16 variable & target- r-squared
o Always include the AOV16 variables because a regression node after this will
eliminate any that aren’t necessary
Case 2:
Target=continuous & interval-scaled
Inputs= categorical & nominal-scaled
o R-squared is calculated using one-way ANOVA
o Have option to use orginal or grouped variables
o Grouped variables- variables whose categories are collapsed or combined
Case 3:
Target=binary
Inputs= numeric & interval-scaled
o Inputs can be selected by using either the r-squared or chi-square criterion
o Chi-square is more appropriate than r-squared, especially if your goal is to
estimate a logistic regression model, when the target is binary. However,
continuous inputs have to be binned in order to calculate chi-square stats.
o You can avoid binning if you use r-squared for interval inputs.
Case 4:
Target=binary
Inputs= categorical & nominal-scaled
o Can use either r-squared or chi-squared
Case 5:
Target=continuous & interval-scaled
Inputs= mixed
o Discussed in more detail later
Case 6:
Target=binary
Inputs= mixed
o Discussed in more detail later
Variable Clustering Node:
o Divides the inputs in a predictive modeling data set into disjoint clusters or groups
o Disjoint- If in one cluster, cannot appear in any other cluster
o Inputs in a cluster are strongly inter-correlated and the inputs included in any other cluster are NOT
strongly correlated with the inputs in any other cluster
o You can then estimate a predictive model by including only 1 variable from each cluster or a linear
combination of all variables in that cluster. This reduces the severity of collinearity and results in
having fewer variables to deal with in the model.
o Starts with all variables in one cluster and divides it into smaller and smaller clusters using an
algorithm
o Automatically excludes variables with the role set to target
o Selects inputs without reference to the target variable
o Use primarily to identify groups of input variables that are similar and then to select a representative
variable from each cluster or create a new variable that is a linear combination of the inputs in a
cluster.
o If going to use cluster components, set variable selection property to cluster component
o If you connect a VC node to a regression node, there ends up being 2 variable selection procedures
ran which results in only the best of the best variables being included in the model.
o Lift Chart:
Measures the effectiveness of a predictive model
Model comparison node will output a cumulative lift chart
Formulas will be provided, but know how to calculate:
% Response
Cumulative % Response
% Captured Response
Cumulative Captured Response
Average Response rate
Lift
Cumulative Lift
Decision Tree node
o Can also be used for selecting important inputs
o Inputs selected by the DT node are the inputs that contribute to the segmentation of the data set into
homogenous groups.
o Inputs that are most useful in creating groups of customers called segments or leaf nodes
o Target variable plays important role here
o After partitioning is complete, each observation in the data set belongs to one and only one segment
or leaf node.
o “_NODE_” is a variable created that indicates which leaf an observation belongs to
o Inputs that provide good splits are selected and passed to the next node
Transform Variables
o Provides a wide variety of transformations that can be applied to the inputs for improving the
precision of the predictive models
o Variety of transformation of interval-scaled variables listed above in chapter 2
o 2 types of transformations for categorical inputs:
Group Rare levels
Dummy Indicators
o Can place this before or after VS node
If you have a very large # of variables in the data set, you may want to eliminate those that
have extremely low linear correlation with the target first.
o If necessary, may need an impute node before applying transformations or selecting variables.
If original variable is closest to Normal under Max Normal settings, no transformation will be
applied.
o Can pass more than one type of transformation to the next node by:
Using multiple transform variables nodes and then use a merge node
Use one transform variables node and set the interval inputs property to multiple
o Best- node selects the transformation that yields the best chi-square value for the target
o Multiple- makes several transformations for each input & passes them on to the next node and then
regression node will eliminate any that aren’t necessary
1
o Simple Transformations: Log, Log10, Inverse (𝑋), √𝑋, 𝑋 2 , 𝑒 𝑋 , range, centering (𝑥 − 𝜇𝑥 ), and
(𝑥−𝜇𝑥 )
standardize ( )
𝜎𝑥