Sunteți pe pagina 1din 5

# Data Mining Methods and Models

## Data Mining Methods and Models

Copyright by Daniel T. Larose, Ph.D. Chapter Summary and Keywords

Algorithm walk-throughs, hands-on analysis problems, chapter exercises, white-box approach, data mining as a process, graphical and exploratory approach, companion website, Clementine, Minitab, SPSS, WEKA. Chapter 1: Dimension Reduction Methods Chapter one begins with an assessment of the need for dimension reduction in data mining. Principal components analysis is demonstrated, in the context of a real-world example using the Houses data set. Various criteria are compared for determining how many components should be extracted. Emphasis is given to profiling the principal components for the end-user, along with the importance of validating the principal components using the usual hold-out methods in data mining. Next, factor analysis is introduced and demonstrated using the real-world Adult data set. The need for factor rotation is discussed, which clarifies the definition of the factors. Finally, user-defined composites are briefly discussed, using an example. Key Words: Principal components, factor analysis, commonality, variation, scree plot, eigenvalues, component weights, factor loadings, factor rotation, user-defined composite. Chapter 2: Regression Modeling Chapter Two begins by using an example to introduce simple linear regression and the concept of least squares. The usefulness of the regression is then measured by the coefficient of determination r 2 , and the typical prediction error is estimated using the standard error of the estimate s. The correlation coefficient r is discussed, along with the ANOVA table for succinct display of results. Outliers, high leverage points, and influential observations are discussed in detail. Moving from descriptive methods to inference, the regression model is introduced. The t-Test for the relationship between x and y is shown, along with the confidence interval for the slope of the regression line, the confidence interval for the mean value of y given x, and the prediction interval for a randomly chosen value of y given x. Methods are shown for verifying the assumptions underlying the regression model. Detailed examples are provided using the Baseball and California data sets. Finally, methods of applying transformations to achieve linearity is provided. Key Words: Simple linear regression, least squares, prediction error, outlier, high leverage point, influential observation, confidence interval, prediction interval, transformations. Chapter 3: Multiple Regression and Model Building Multiple regression, where more than one predictor variable is used to estimate a response variable, is introduced by way of an example. To allow for inference, the
Data Mining Methods and Models
Copyright by Daniel T. Larose, Ph.D. Chapter Summary and Keywords

multiple regression model is defined, with both model and inferential methods representing extensions of the simple linear regression case. Next, regression with categorical predictors (indicator variables) is explained. The problems of multicollinearity are examined; multicollinearity represents an unstable response surface due to overly correlated predictors. The variance inflation factor is defined, as an aid in identifying multicollinear predictors. Variable selection methods are then provided, including forward selection, backward elimination, stepwise, and best-subsets regression. Mallows C p statistic is defined, as an aid in variable selection. Finally, methods for using the principal components as predictors in multiple regression are discussed. Key Words: Categorical predictors, indicator variables, multicollinearity, variance inflation factor, model selection methods, forward selection, backward elimination, stepwise regression, best-subsets. Chapter 4: Logistic Regression Logistic regression is introduced by way of a simple example for predicting the presence of disease based on age. The maximum likelihood estimation methods for logistic regression are outlined. Emphasis is placed on interpreting logistic regression output. Inference within the framework of the logistic regression model is discussed, including determining whether the predictors are significant. Methods for interpreting the logistic regression model are examined, including for dichotomous, polychotomous, and continuous predictors. The assumption of linearity is discussed, as well as methods for tackling the zero-cell problem. We then turn to multiple logistic regression, where more than one predictor is used to classify a response. Methods are discussed for introducing higher order terms to handle non-linearity. As usual, the logistic regression model must be validated. Finally, the application of logistic regression using the freely available software WEKA is demonstrated, using a small example. Key Words: Maximum likelihood estimation, categorical response, classification, the zero-cell problem, multiple logistic regression, WEKA. Chapter 5: Nave Bayes and Bayesian Networks Chapter Five begins by contrasting the Bayesian approach with the usual (frequentist) approach to probability. The maximum a posteriori (MAP) classification is defined, which is used to select the preferred response classification. Odds ratios are discussed, including the posterior odds ratio. The importance of balancing the data is discussed. Nave Bayes classification is derived, using a simplifying assumption which greatly reducing the search space. Methods for handling numeric predictors for Nave Bayes classification are demonstrated. An example of using WEKA for Nave Bayes is provided. Then, Bayesian Belief Networks (Bayes Nets) are introduced and defined.
Data Mining Methods and Models
Copyright by Daniel T. Larose, Ph.D. Chapter Summary and Keywords

Methods for using the Bayesian network to find probabilities are discussed. Finally, an example of using Bayes nets in WEKA is provided. Key Words: Bayesian approach, maximum a posteriori classification, odds ratio, posterior odds ratio, balancing the data, Nave Bayes classification, Bayesian belief networks, WEKA. Chapter 6: Genetic Algorithms Chapter Six begins by introducing genetic algorithms by way of analogy with the biological processes at work in the evolution of organisms. The basic framework of a genetic algorithm is provided, including the three basic operators: Selection, Crossover, and Mutation. A simple example of a genetic algorithm at work is examined, with each step explained and demonstrated. Next, modifications and enhancements from the literature are discussed, especially for the selection and crossover operators. Genetic algorithms for real-valued variables are discussed. The use of genetic algorithms as optimizers within a neural network is demonstrated, where the genetic algorithm replaces the using backpropagation algorithm. Finally, an example of the use of WEKA for genetic algorithms is provided. Key Words: Selection, crossover, mutation, optimization, global optimum, selection pressure, crowding, fitness, WEKA. Chapter 7: Case Study: Modeling Response to Direct Mail Marketing The case study begins with an overview of the cross-industry standard process for data mining: CRISP-DM. For the business understanding phase, the direct mail marketing response problem is defined, with particular emphasis on the construction of an accurate cost / benefit table, which will be used to assess the usefulness of all later models. In the data understand and data preparation phases, the Clothing Store data set is explored. Transformations to achieve normality or symmetry are applied, as is standardization and the construction of flag variables. Useful new variables are derived. The relationships between the predictors and the response are explored, and the correlation structure among the predictors is investigated. Next comes the modeling phase. Here, two principal components are derived, using principal components analysis. Clustering analysis is performed, using the BIRCH clustering algorithm. Emphasis is laid on the effects of balancing (and over-balancing) the training data set. The baseline model performance is established. Two sets of models are examined, Collection A, which uses the principal components, and Collection B, which does not. The technique of using over-balancing as a surrogate for misclassification costs is applied. The method of combining models via voting is demonstrated, as is the method of combining models using the mean response probabilities.
Data Mining Methods and Models
Copyright by Daniel T. Larose, Ph.D. Chapter Summary and Keywords

Key Words: CRISP-DM standard process for data mining, BIRCH clustering algorithm, over-balancing, misclassification costs, cost / benefit analysis, model combination, voting, mean response probabilities.

## Data Mining Methods and Models

Copyright by Daniel T. Larose, Ph.D. Chapter Summary and Keywords