0 Voturi pozitive0 Voturi negative

17 (de) vizualizări5 paginiAug 08, 2011

© Attribution Non-Commercial (BY-NC)

DOC, PDF, TXT sau citiți online pe Scribd

Attribution Non-Commercial (BY-NC)

17 (de) vizualizări

Attribution Non-Commercial (BY-NC)

- Steve Jobs
- Wheel of Time
- NIV, Holy Bible, eBook
- NIV, Holy Bible, eBook, Red Letter Edition
- Cryptonomicon
- The Woman Who Smashed Codes: A True Story of Love, Spies, and the Unlikely Heroine who Outwitted America's Enemies
- Contagious: Why Things Catch On
- Crossing the Chasm: Marketing and Selling Technology Project
- Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
- Zero to One: Notes on Start-ups, or How to Build the Future
- Console Wars: Sega, Nintendo, and the Battle that Defined a Generation
- Dust: Scarpetta (Book 21)
- Hit Refresh: The Quest to Rediscover Microsoft's Soul and Imagine a Better Future for Everyone
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- Crushing It!: How Great Entrepreneurs Build Their Business and Influence—and How You Can, Too
- Make Time: How to Focus on What Matters Every Day
- Algorithms to Live By: The Computer Science of Human Decisions
- Wild Cards

Sunteți pe pagina 1din 5

Preface. The preface begins by discussing why Data Mining Methods and Models is needed. Because of the powerful data mining software platforms currently available, a strong caveat is given against glib application of data mining methods and techniques. In other words, data mining is easy to do badly. The best way to avoid these costly errors, which stem from a blind black-box approach to data mining, is to instead apply a white-box methodology, which emphasizes an understanding of the algorithmic and statistical model structures underlying the software. Data Mining Methods and Models applies this white-box approach by (1) walking the reader through the operations and nuances of the various algorithms, using small sample data sets, so that the reader gets a true appreciation of what is really going on inside the algorithm, (2) providing examples of the application of the various algorithms on actual large data sets, (3) supplying chapter exercises, which allow readers to assess their depth of understanding of the material, as well as have a little fun playing with numbers and data, and (4) providing the reader with hands-on analysis problems, representing an opportunity for the reader to apply his or her newly-acquired data mining expertise to solving real problems using large data sets. Data mining is presented as a well-structured standard process, namely, the Cross-Industry Standard Process for Data Mining (CRISP-DM). A graphical approach to data analysis is emphasized, stressing in particular exploratory data analysis. Data Mining Methods and Models naturally fits the role of textbook for an introductory course in data mining. Instructors may appreciate (1) the presentation of data mining as a process, (2) the White box approach, emphasizing an understanding of the underlying algorithmic structures, (3) the graphical approach, emphasizing exploratory data analysis, and (4) the logical presentation, flowing naturally from the CRISP-DM standard process and the set of data mining tasks. Particularly useful for the instructor is the companion website, which provides ancillary materials for teaching a course using Data Mining Methods and Models, including Powerpoint presentations, answer keys, and sample projects. The book is appropriate for advanced undergraduate or graduate-level courses. No computer programming or database expertise is required. The software used in the book includes Clementine, Minitab, SPSS, and WEKA. Free trial versions of Minitab and SPSS are available for download from their company websites. WEKA is open-source data mining software freely available for download. Keywords:

Copyright by Daniel T. Larose, Ph.D. Chapter Summary and Keywords

Algorithm walk-throughs, hands-on analysis problems, chapter exercises, white-box approach, data mining as a process, graphical and exploratory approach, companion website, Clementine, Minitab, SPSS, WEKA. Chapter 1: Dimension Reduction Methods Chapter one begins with an assessment of the need for dimension reduction in data mining. Principal components analysis is demonstrated, in the context of a real-world example using the Houses data set. Various criteria are compared for determining how many components should be extracted. Emphasis is given to profiling the principal components for the end-user, along with the importance of validating the principal components using the usual hold-out methods in data mining. Next, factor analysis is introduced and demonstrated using the real-world Adult data set. The need for factor rotation is discussed, which clarifies the definition of the factors. Finally, user-defined composites are briefly discussed, using an example. Key Words: Principal components, factor analysis, commonality, variation, scree plot, eigenvalues, component weights, factor loadings, factor rotation, user-defined composite. Chapter 2: Regression Modeling Chapter Two begins by using an example to introduce simple linear regression and the concept of least squares. The usefulness of the regression is then measured by the coefficient of determination r 2 , and the typical prediction error is estimated using the standard error of the estimate s. The correlation coefficient r is discussed, along with the ANOVA table for succinct display of results. Outliers, high leverage points, and influential observations are discussed in detail. Moving from descriptive methods to inference, the regression model is introduced. The t-Test for the relationship between x and y is shown, along with the confidence interval for the slope of the regression line, the confidence interval for the mean value of y given x, and the prediction interval for a randomly chosen value of y given x. Methods are shown for verifying the assumptions underlying the regression model. Detailed examples are provided using the Baseball and California data sets. Finally, methods of applying transformations to achieve linearity is provided. Key Words: Simple linear regression, least squares, prediction error, outlier, high leverage point, influential observation, confidence interval, prediction interval, transformations. Chapter 3: Multiple Regression and Model Building Multiple regression, where more than one predictor variable is used to estimate a response variable, is introduced by way of an example. To allow for inference, the

Data Mining Methods and Models

Copyright by Daniel T. Larose, Ph.D. Chapter Summary and Keywords

multiple regression model is defined, with both model and inferential methods representing extensions of the simple linear regression case. Next, regression with categorical predictors (indicator variables) is explained. The problems of multicollinearity are examined; multicollinearity represents an unstable response surface due to overly correlated predictors. The variance inflation factor is defined, as an aid in identifying multicollinear predictors. Variable selection methods are then provided, including forward selection, backward elimination, stepwise, and best-subsets regression. Mallows C p statistic is defined, as an aid in variable selection. Finally, methods for using the principal components as predictors in multiple regression are discussed. Key Words: Categorical predictors, indicator variables, multicollinearity, variance inflation factor, model selection methods, forward selection, backward elimination, stepwise regression, best-subsets. Chapter 4: Logistic Regression Logistic regression is introduced by way of a simple example for predicting the presence of disease based on age. The maximum likelihood estimation methods for logistic regression are outlined. Emphasis is placed on interpreting logistic regression output. Inference within the framework of the logistic regression model is discussed, including determining whether the predictors are significant. Methods for interpreting the logistic regression model are examined, including for dichotomous, polychotomous, and continuous predictors. The assumption of linearity is discussed, as well as methods for tackling the zero-cell problem. We then turn to multiple logistic regression, where more than one predictor is used to classify a response. Methods are discussed for introducing higher order terms to handle non-linearity. As usual, the logistic regression model must be validated. Finally, the application of logistic regression using the freely available software WEKA is demonstrated, using a small example. Key Words: Maximum likelihood estimation, categorical response, classification, the zero-cell problem, multiple logistic regression, WEKA. Chapter 5: Nave Bayes and Bayesian Networks Chapter Five begins by contrasting the Bayesian approach with the usual (frequentist) approach to probability. The maximum a posteriori (MAP) classification is defined, which is used to select the preferred response classification. Odds ratios are discussed, including the posterior odds ratio. The importance of balancing the data is discussed. Nave Bayes classification is derived, using a simplifying assumption which greatly reducing the search space. Methods for handling numeric predictors for Nave Bayes classification are demonstrated. An example of using WEKA for Nave Bayes is provided. Then, Bayesian Belief Networks (Bayes Nets) are introduced and defined.

Data Mining Methods and Models

Copyright by Daniel T. Larose, Ph.D. Chapter Summary and Keywords

Methods for using the Bayesian network to find probabilities are discussed. Finally, an example of using Bayes nets in WEKA is provided. Key Words: Bayesian approach, maximum a posteriori classification, odds ratio, posterior odds ratio, balancing the data, Nave Bayes classification, Bayesian belief networks, WEKA. Chapter 6: Genetic Algorithms Chapter Six begins by introducing genetic algorithms by way of analogy with the biological processes at work in the evolution of organisms. The basic framework of a genetic algorithm is provided, including the three basic operators: Selection, Crossover, and Mutation. A simple example of a genetic algorithm at work is examined, with each step explained and demonstrated. Next, modifications and enhancements from the literature are discussed, especially for the selection and crossover operators. Genetic algorithms for real-valued variables are discussed. The use of genetic algorithms as optimizers within a neural network is demonstrated, where the genetic algorithm replaces the using backpropagation algorithm. Finally, an example of the use of WEKA for genetic algorithms is provided. Key Words: Selection, crossover, mutation, optimization, global optimum, selection pressure, crowding, fitness, WEKA. Chapter 7: Case Study: Modeling Response to Direct Mail Marketing The case study begins with an overview of the cross-industry standard process for data mining: CRISP-DM. For the business understanding phase, the direct mail marketing response problem is defined, with particular emphasis on the construction of an accurate cost / benefit table, which will be used to assess the usefulness of all later models. In the data understand and data preparation phases, the Clothing Store data set is explored. Transformations to achieve normality or symmetry are applied, as is standardization and the construction of flag variables. Useful new variables are derived. The relationships between the predictors and the response are explored, and the correlation structure among the predictors is investigated. Next comes the modeling phase. Here, two principal components are derived, using principal components analysis. Clustering analysis is performed, using the BIRCH clustering algorithm. Emphasis is laid on the effects of balancing (and over-balancing) the training data set. The baseline model performance is established. Two sets of models are examined, Collection A, which uses the principal components, and Collection B, which does not. The technique of using over-balancing as a surrogate for misclassification costs is applied. The method of combining models via voting is demonstrated, as is the method of combining models using the mean response probabilities.

Data Mining Methods and Models

Copyright by Daniel T. Larose, Ph.D. Chapter Summary and Keywords

Key Words: CRISP-DM standard process for data mining, BIRCH clustering algorithm, over-balancing, misclassification costs, cost / benefit analysis, model combination, voting, mean response probabilities.

Copyright by Daniel T. Larose, Ph.D. Chapter Summary and Keywords

- Criminal Justice Reference: 209187Încărcat deDOJ
- A Concept Note on Retail Location AnalysisÎncărcat devivek kumar pathak
- Credit CardÎncărcat deJaved Memon
- A COMPARISON OF GOODNESS-OF-FIT TESTS.pdfÎncărcat deUswatunKhasanah
- Statistics: Introduction to RegressionÎncărcat deJuan
- Jurnal InternationalÎncărcat deRyuzaki Razak Souljr
- Utility of Panoramic Radiography for Identification of the Pubertal Growth PeriodÎncărcat deJose Collazos
- Treating Multicollinearity With SASÎncărcat deShafey Tarek
- Assumptions of RegressionÎncărcat deWaqar Ahmad
- Takeover Target Revised VersionÎncărcat dekoki
- A_longitudinal_observational_study_of_br.pdfÎncărcat deafriza diki
- 1471-2288-9-56.pdfÎncărcat deCarlos De Oro
- Comparison Between Neural Networks and Multiple Regression Analysis to Predict Rock Fragmentation in Open-Pit MinesÎncărcat deali
- paper_05.pdfÎncărcat deKristine Gusto Ungab
- Krol Gerrit s0739677 ReMa Thesis 2015Încărcat deJohn Marc Ruado Dante
- 9. Business Mgmt - Ijbmr - Human Capital and FinancialÎncărcat deTJPRC Publications
- nnet.pdfÎncărcat deLuis Garcia
- Introduction to Bio Statistics 2nd Edition R. Sokal F. Rohlf Statistics BiologyÎncărcat deDorilde Tavares
- Chap.16Încărcat deRifki Maulana
- 2004-Developing a Framework for a Standarized Work Programme for Building ProjectÎncărcat dearies
- PERFROMANCE CALCS TOP Calculating Confidence Intervals for the AHRQ QIÎncărcat dePassFam14
- 125873000-Kuiper-Ch03.pdfÎncărcat dejmurcia78
- SPSSComplexSamples17.0Încărcat desachin
- 02._NASKAH_PUBLIKASIÎncărcat deZahira Hafizah
- DumitruÎncărcat deaiman
- Descriptive StatisticsÎncărcat deKomangh
- Research Paper Effect of Mobile Banking on Customer-823Încărcat deamittaneja28
- THESIS-IT-PANCIT.docxÎncărcat deErik Babida
- Least Square RegressionÎncărcat deSamiullah Qureshi
- Trends Appl DMÎncărcat deAllison Collier

- Lecture 6Încărcat dekegnata
- Simple Regression 2-10-12Încărcat deDon Ho
- E3023-15 Standard Practice for Probability of Detection Analysis for â Versus a DataÎncărcat deAhmed Shaban Kotb
- Analyzing Working Skill Influence on the Working Readiness of Vocational High School Student of Construction Engineering in North Sulawesi 1Încărcat deInternational Journal of Innovative Science and Research Technology
- Etc 2410 NotesÎncărcat deMohammad Rashman
- Topic 2 - MapReduce With Pyrhon Extra II Linear RegressionÎncărcat demayankkapoor85
- Gursoy Final SheetÎncărcat demusicpanda17
- Quiz 4 Review Questions With AnswerssÎncărcat deSteven Nguyen
- Student Slides Chapter 3Încărcat deNiaz Ahmad
- CorrelationÎncărcat deRizzah Mae Soriano Raguine
- Using Landscape Indices to Predict Habitat ConnectivityÎncărcat deCarolina Salgado Ramirez
- 4_Linear_Regression_WPS.pdfÎncărcat deNipun Goyal
- Demand EstimationÎncărcat deMargaret Divya
- Simple RegressionÎncărcat deiga gemelia
- LECTURE NOTE SSE2193.docÎncărcat deAin Farhan
- Modul StatistikÎncărcat deAnNa Nurjannah Anwar
- 17814_introduction to RlogisticÎncărcat deHaresh Verma
- 3092545_1_econreview-questions.docxÎncărcat deglenia
- Delta's New Song Case Analysis From BaiduÎncărcat deFaris Majduddin
- Statistics for Business and Economics: bab 14Încărcat debalo
- 07 Chapter 2Încărcat deRaHul Rathod
- Chapter 4 in managerial economicÎncărcat demyra
- Management Information Systems A1Încărcat deelias_galyamov
- Fisheries Biology Aspects of Yellow Rasbora (Rasbora lateristriata BLKR 1854) From Central Lombok, IndonesiaÎncărcat debuya301290
- 09 Inference for Regression Part1Încărcat deRama Dulce
- ECONOÎncărcat deNoman Moin Ud Din
- Ch24.MultipleÎncărcat deamisha2562585
- Chapter 1.pdfÎncărcat deCharleneKronstedt
- 76197355 Ken Black QA 5th Chapter15 SolutionÎncărcat deManish Khandelwal
- A5-ENG-KURIKULUM-PASCASARJANA-FINAL-R1.pdfÎncărcat deKresno N Soetomo

## Mult mai mult decât documente.

Descoperiți tot ce are Scribd de oferit, inclusiv cărți și cărți audio de la editori majori.

Anulați oricând.