Sunteți pe pagina 1din 6

Chapter 1-3 Study Guide

Chapter 1:
 Know the planning tasks:
o Defining & measuring the target variable
o Collecting the data
o Comparing the distributions of key variables between the modeling data set & the target population
to verify that the sample adequately represents the target population
o Defining sampling weights if necessary
o Performing data-cleaning tasks
 Types of Inputs:
o Data types:
 Numeric
 Character
o Measurement Scales:
 Categorical
 Nominal
 Ordinal
o Ordinal=discrete=polychotomous
 Interval
 Interval-scaled=continuous
 Binary
o Textual Data
 Needs to be converted into a numeric form first
 Defining the Target:
o First step in any data mining project is to define and measure the target variable to be predicted by
the model.
 Source of Modeling Data:
o Data is based on an experiment carried out by conducting a marking campaign on a well-designed
sample of customers drawn from the target population
o OR
o Data is a sample drawn from the results of a past marketing campaign and not from the target
population
 Less desirable, but is often necessary to make do with whatever data is available
 Can make adjustments through observation weights to compensate for the lack of perfect
compatibility between the modeling sample and the target population.
 Comparability between the sample and target
o Before modeling we must verify that the sample is a good representation of the target universe
 Compare the distributions of key variables in the sample and target universe
o If the distribution of key characteristics in the sample and target population are different, sometimes
observation weights are used to correct for any bias.
 Pre-Processing the Data:
o Eliminate irrelevant data elements that have no effect on the target variable
o Convert the data to an appropriate measurement scale
o Eliminate variables with highly skewed distributions
o Eliminate inputs which are really target variables disguised as inputs
o Impute missing values
Chapter 2:
 Sample Nodes:
o Append
 Combines data sets created by different paths of a process flow in a project
 It stacks data instead of side-by-side merging variables
 For example, combining Region A and Region B data that has same variables
o Data Partition
 Partitions data into training, validation & test data
 Training-Used to develop the model
 Validation-Used to evaluate different models created w/ training data and select the
best one
 Test- Used to independently assess the selected model
o File Import
 Enables you to create a data source directly from an external file such as Excel
 Can connect to any node
o Filter
 Can be used to eliminate observations with extreme values (outliers) in the variables
 Not good to routinely use this-want to find out reason for outliers first
 This can follow or precede data partition node
o Input Data
 First Node in any diagram
 Specifies the data set you want to use in the diagram
o Merge
 Used to combine different data sets within a project
 Can combine output data sets from different nodes
 Combines data as side-by-side merging
 Time Series Nodes
o Time Series Decomposition Node
 Used to find trend and cyclical factors and seasonal factors
o Time Series Data Prep Node
 Used to convert time stamped transaction data into time series
o Time Series Exponential Smoothing Node
 Converts transaction data and applies smoothing method to the time series and makes
forecasts for a specified time horizon beyond the sample period
o Time Series Reduction Node
 Transforms the original time series into a reduced form
o Time Series Similarity Node
 Compares time series for similarities
 Tools for Initial Data Exploration
o StatExplore:
 Chi-square stat shows the strength of the relationship between the target & each categorical
input variable.
 Can create chi-square stats for continuous variables, but you have to create categorical
variables from them first. You have to set interval variables to yes and specify number of bins.
 Shows a chi-square plot and variable worth plot when run.
 Variable worth-calculated from the p-values of the chi-square stats
 Variables with the highest chi-square are the most important
o StatExplore for continuous targets:
 Interval Variables property set to No and set correlations, Pearson correlations and spearman
correlations to yes.
o MultiPlot Node:
 Shows plots of all inputs and input levels (if categorical) against the target (frequency plots)
o Graph Explore Node
 Select variables you want to plot and their roles
 Tools for Data Modification
o Drop Node
 Used to drop variables from the data set or metadata
o Replacement Node
 Can be used to filter out extreme values in a variable without losing any observations
 Can also be used to change the distribution of any variable in the sample
 It’s like filtering without losing the observation
o Impute Node
 Used for imputing missing values of inputs
o Interactive Binning Node
 Binning helps uncover complex non-linear relationships between the inputs and the target
 Binning is a method for converting an interval-scaled variable into a categorical variable
 Forms bins; once bins are formed, the Gini statistic is computed for each input.
 If Gini Stat has a value below the minimum cutoff property, it is rejected.
o Principal Components Node
 Principle components are new variables constructed from a set of variables. They are linear
combinations of the original variables.
 In general, a small # of principal components can capture most of the information contained
in the original inputs.
 Using principal components results in no collinearity
 Target variables won’t be included in constructing the principal components
 Principal components are calculated as weighted sums of the original variables.
 Eigenvalues are equal to the statistical variance of the new components.
 Utility Nodes
o Control Points
o End Groups
o Ext Demo
o Metadata
o Reporter
o SAS Code
 Used to incorporate SAS procedures & external SAS code into the process flow of a project
 Can be placed anywhere in a process flow
o Score Code Export
o Start Groups
Chapter 3:
 Cluster Node:
o Used to create clusters of observations with similar characteristics
o Enables you to discover patterns in your data
o Creates the clusters from the input variables alone without reference to the target variable
 Variable Selection Node:
o Used for variable selection
o Can look at the strength of the relationship of input variables with target variable by using chi-square
or r-squared values.
o Interval targets- R-square
o Binary targets-Both r-square and chi-square
o R-square:
 First, node rejects any variables that have a value less than the minimum r-squared
 R-square tells us the proportion of variation in the target variable explained by a single input
variable, ignoring the effect of other input variables.
 To detect non-linear relationships, the VS node creates binned variables from each interval
variable. These are called AOV16 variables. These are treated as class variables.
 VS node then performs a forward stepwise regression to evaluate the variables chosen in the
first step.
o Chi-square:
 VS node creates a tree based on chi-square maximization
 VS node first bins interval variables and uses the binned variable rather than the original
inputs in building the tree.
 Node rejects any split with chi-square below the specified threshold
o Use when you want to make an initial selection of inputs or eliminate irrelevant inputs
o Selects variables using only the training data set
o Six Cases:
 Case 1:
 Target=continuous & interval-scaled
 Inputs= numeric & interval-scaled
o In this case, VS node calculates two measures of correlation between each
input & the target:
 Original variable & target- r-squared
 AOV16 variable & target- r-squared
o Always include the AOV16 variables because a regression node after this will
eliminate any that aren’t necessary

 Case 2:
 Target=continuous & interval-scaled
 Inputs= categorical & nominal-scaled
o R-squared is calculated using one-way ANOVA
o Have option to use orginal or grouped variables
o Grouped variables- variables whose categories are collapsed or combined
 Case 3:
 Target=binary
 Inputs= numeric & interval-scaled
o Inputs can be selected by using either the r-squared or chi-square criterion
o Chi-square is more appropriate than r-squared, especially if your goal is to
estimate a logistic regression model, when the target is binary. However,
continuous inputs have to be binned in order to calculate chi-square stats.
o You can avoid binning if you use r-squared for interval inputs.
 Case 4:
 Target=binary
 Inputs= categorical & nominal-scaled
o Can use either r-squared or chi-squared
 Case 5:
 Target=continuous & interval-scaled
 Inputs= mixed
o Discussed in more detail later
 Case 6:
 Target=binary
 Inputs= mixed
o Discussed in more detail later
 Variable Clustering Node:
o Divides the inputs in a predictive modeling data set into disjoint clusters or groups
o Disjoint- If in one cluster, cannot appear in any other cluster
o Inputs in a cluster are strongly inter-correlated and the inputs included in any other cluster are NOT
strongly correlated with the inputs in any other cluster
o You can then estimate a predictive model by including only 1 variable from each cluster or a linear
combination of all variables in that cluster. This reduces the severity of collinearity and results in
having fewer variables to deal with in the model.
o Starts with all variables in one cluster and divides it into smaller and smaller clusters using an
algorithm
o Automatically excludes variables with the role set to target
o Selects inputs without reference to the target variable
o Use primarily to identify groups of input variables that are similar and then to select a representative
variable from each cluster or create a new variable that is a linear combination of the inputs in a
cluster.
o If going to use cluster components, set variable selection property to cluster component
o If you connect a VC node to a regression node, there ends up being 2 variable selection procedures
ran which results in only the best of the best variables being included in the model.
o Lift Chart:
 Measures the effectiveness of a predictive model
 Model comparison node will output a cumulative lift chart
 Formulas will be provided, but know how to calculate:
 % Response
 Cumulative % Response
 % Captured Response
 Cumulative Captured Response
 Average Response rate
 Lift
 Cumulative Lift
 Decision Tree node
o Can also be used for selecting important inputs
o Inputs selected by the DT node are the inputs that contribute to the segmentation of the data set into
homogenous groups.
o Inputs that are most useful in creating groups of customers called segments or leaf nodes
o Target variable plays important role here
o After partitioning is complete, each observation in the data set belongs to one and only one segment
or leaf node.
o “_NODE_” is a variable created that indicates which leaf an observation belongs to
o Inputs that provide good splits are selected and passed to the next node
 Transform Variables
o Provides a wide variety of transformations that can be applied to the inputs for improving the
precision of the predictive models
o Variety of transformation of interval-scaled variables listed above in chapter 2
o 2 types of transformations for categorical inputs:
 Group Rare levels
 Dummy Indicators
o Can place this before or after VS node
 If you have a very large # of variables in the data set, you may want to eliminate those that
have extremely low linear correlation with the target first.
o If necessary, may need an impute node before applying transformations or selecting variables.
 If original variable is closest to Normal under Max Normal settings, no transformation will be
applied.
o Can pass more than one type of transformation to the next node by:
 Using multiple transform variables nodes and then use a merge node
 Use one transform variables node and set the interval inputs property to multiple
o Best- node selects the transformation that yields the best chi-square value for the target
o Multiple- makes several transformations for each input & passes them on to the next node and then
regression node will eliminate any that aren’t necessary
1
o Simple Transformations: Log, Log10, Inverse (𝑋), √𝑋, 𝑋 2 , 𝑒 𝑋 , range, centering (𝑥 − 𝜇𝑥 ), and
(𝑥−𝜇𝑥 )
standardize ( )
𝜎𝑥

S-ar putea să vă placă și