Data Science

Data
Science
Post Graduate
Program
Data
Analytics
Post Graduate Program in Data Analytics
Data Science: Participant Manual
Contents
Design and Analysis of Experiment ................................................................................................................................. 13
Data and Data Collection .......................................................................................................................................................................13
Data Collection Techniques .................................................................................................................................................................13
What is Design of Experiment? ..........................................................................................................................................................13
DoE: Some Terminology ........................................................................................................................................................................14
DoE: Examples of Experiments from Daily Life ..........................................................................................................................14
Design of Experiment: Cake Baking .................................................................................................................................................14
Guidelines for Designing Experiments ...........................................................................................................................................14
Baking the Cake: Steps in Design of Experiments ......................................................................................................................14
DoE for Cake Baking: Factor Levels..................................................................................................................................................15
Strategy of Experimentation ...............................................................................................................................................................15
History of Design of Experiment .......................................................................................................................................................16
Four Eras of DOE ......................................................................................................................................................................................16
Some Major Players in DOE .................................................................................................................................................................16
Why Design of Experiment? ................................................................................................................................................................17
Building Blocks of DoE ...........................................................................................................................................................................17
One-factor-at-a-time Experiments (OFAT) ...................................................................................................................................17
Factorial Designs ......................................................................................................................................................................................18
Factorial Designs with Several Factors...........................................................................................................................................19
Factorial v/s OFAT ...................................................................................................................................................................................19
Example: Effect of Re and k/D on friction factor f .....................................................................................................................19
Central Composite Design ....................................................................................................................................................................22
Randomized Design and ANOVA .......................................................................................................................................................22
Some Terminology ...................................................................................................................................................................................22
ANOVA ...........................................................................................................................................................................................................24
Some Useful Quantities ..........................................................................................................................................................................25
How to Compare Means of the Sample? .........................................................................................................................................26
Null Hypothesis .........................................................................................................................................................................................27
Results of ANOVA .....................................................................................................................................................................................29
Inferences from ANOVA ........................................................................................................................................................................29
Probability Theory................................................................................................................................................................. 30
Brief Overview: Population ..................................................................................................................................................................30
Sample ...........................................................................................................................................................................................................30
Sample Space ..............................................................................................................................................................................................30
Types of Data ..............................................................................................................................................................................................30
Random Variables ....................................................................................................................................................................................31
Confidential and restricted. Do not distribute. (c) Imarticus Learning 2

Events ............................................................................................................................................................................................................31
Algebra of Sets ...........................................................................................................................................................................................33
How Do We Assign a Probability to the Event? ..........................................................................................................................34
Probability ...................................................................................................................................................................................................34
Axioms of Probability (Kolmogorov Axioms) ..............................................................................................................................35
Conditional Probability: Motivation ................................................................................................................................................36
Multiplicative Rule and Independence ...........................................................................................................................................37
Implication of Conditional Independence .....................................................................................................................................37
Bayes Rule ...................................................................................................................................................................................................37
Bayes Rule Example: AIDS Test .........................................................................................................................................................38
Bayesian Learning ....................................................................................................................................................................................39
Application of Bayes Rule .....................................................................................................................................................................39
Random Variables ....................................................................................................................................................................................39
Probability Distributions: Discrete Random Variables ...........................................................................................................40
Cumulative Distribution Function (CDF) .......................................................................................................................................42
The Binomial Distribution ....................................................................................................................................................................42
The Hypergeometric Distribution .....................................................................................................................................................44
The Poisson Distribution ......................................................................................................................................................................45
Distribution of Continuous Random Variable .............................................................................................................................47
Cumulative Distribution Function (CDF) .......................................................................................................................................47
Triangular Distribution .........................................................................................................................................................................49
Normal Distribution ................................................................................................................................................................................50
Finding Probabilities using Normal Distribution .......................................................................................................................52
The Standard Normal Distribution (Z) ...........................................................................................................................................53
Exponential Distribution .......................................................................................................................................................................55
Introduction to Statistics ..................................................................................................................................................... 58
What is statistics? .....................................................................................................................................................................................58
Why statistics? ...........................................................................................................................................................................................58
Basic Vocabulary of Statistics .............................................................................................................................................................58
Types of Statistics .....................................................................................................................................................................................58
Define Variable & Types ........................................................................................................................................................................59
Types of Variables ....................................................................................................................................................................................59
Types of Measurement Scales .............................................................................................................................................................60
Population and Sample ..........................................................................................................................................................................60
Population vs. Sample ............................................................................................................................................................................60
Distribution .................................................................................................................................................................................................61
Creating Simple Frequency Distributions .....................................................................................................................................61

The Normal Distribution .......................................................................................................................................................................62
Measure of Central Tendency .............................................................................................................................................................62
Central Tendency Measures ................................................................................................................................................................64
Measure of Dispersion ...........................................................................................................................................................................65
Mean Deviation..........................................................................................................................................................................................65
Standard Deviation ..................................................................................................................................................................................65
Comparing the Standard Deviations ................................................................................................................................................66
Choosing a Measure of Variability ....................................................................................................................................................67
Other Measures .........................................................................................................................................................................................67
Dispersion Measures ..............................................................................................................................................................................70
Outliers..........................................................................................................................................................................................................71
Types of Outliers .......................................................................................................................................................................................72
Impact of Outliers .....................................................................................................................................................................................72
Detect Outliers ...........................................................................................................................................................................................73
Inferential Statistics ................................................................................................................................................................................73
Confidence Interval .................................................................................................................................................................................73
Degrees of Freedom ................................................................................................................................................................................73
P-Value ..........................................................................................................................................................................................................73
Process of Hypothesis Testing ............................................................................................................................................................74
Chi Square Test ..........................................................................................................................................................................................75
Goodness of Fit ..........................................................................................................................................................................................75
Test of Independence .............................................................................................................................................................................75
Chi-Square Test Steps .............................................................................................................................................................................76
Chi-Square Test Example ......................................................................................................................................................................76
Selection of the Hypothesis Test ........................................................................................................................................................77
Normality Tests .........................................................................................................................................................................................77
Graphical Method: Q-Q Plot .................................................................................................................................................................78
Shapiro-Wilk’s W Test ............................................................................................................................................................................78
Anderson Darling Test ...........................................................................................................................................................................78
Kolmogorov-Smirnov Test ...................................................................................................................................................................78
One-Tailed Tests .......................................................................................................................................................................................79
Two-Tailed Tests ......................................................................................................................................................................................79
Hypothesis Tests for Normal Data ....................................................................................................................................................79
t-Test Vs. z-Test .........................................................................................................................................................................................79
z-Test .............................................................................................................................................................................................................80
One Sample T-Test ...................................................................................................................................................................................80
Two Sample T-Test ..................................................................................................................................................................................80

ANOVA ...........................................................................................................................................................................................................81
Homogeneity of Variance......................................................................................................................................................................81
Approach to Non-Normal Data ...........................................................................................................................................................82
Non-Normality ...........................................................................................................................................................................................82
Non-parametric Tests .............................................................................................................................................................................83
Mood’s Median Test ................................................................................................................................................................................83
Mann-Whitney Test .................................................................................................................................................................................84
Sign Test .......................................................................................................................................................................................................84
Hypothesis Tests Summary .................................................................................................................................................................84
Correlation and Regression .................................................................................................................................................................85
Uses of Correlation and Regression .................................................................................................................................................85
Correlation Coefficient ...........................................................................................................................................................................85
Regression ...................................................................................................................................................................................................86
Descriptive Statistics ..............................................................................................................................................................................87
The “Hotshot” Sales Executive ............................................................................................................................................................87
Central Tendency .....................................................................................................................................................................................88
Bull’s Eye ......................................................................................................................................................................................................89
Dispersion Measures ..............................................................................................................................................................................89
Dispersion Measures: Key Definitions ............................................................................................................................................89
Percentiles and Deciles ..........................................................................................................................................................................90
Shape of a Distribution ..........................................................................................................................................................................90
Hypothesis Testing ................................................................................................................................................................ 91
Important Basic Terms ..........................................................................................................................................................................91
Population and Sample ..........................................................................................................................................................................91
What is Hypothesis Testing? ...............................................................................................................................................................92
Hypothesis Testing Process .................................................................................................................................................................93
The Test Statistic and Critical Values ..............................................................................................................................................94
Errors in Hypothesis Testing ..............................................................................................................................................................94
Type I & II Error Relationship ............................................................................................................................................................95
6 Steps in Hypothesis Testing .............................................................................................................................................................96
Hypothesis Testing Example ...............................................................................................................................................................96
P-Value Approach to Testing ...............................................................................................................................................................97
The 5 Step P-value Approach to Hypothesis Testing ...............................................................................................................98
p-value Hypothesis Testing Example ..............................................................................................................................................98
P-Value ..........................................................................................................................................................................................................99
Two Tail Tests and Confidence Intervals.......................................................................................................................................99
Errors in Inference................................................................................................................................................................................ 101

Controlling Type I Errors ................................................................................................................................................................... 101
Power of a Statistical Test ................................................................................................................................................................. 101
Strategy for Designing a Good Hypothesis Test ....................................................................................................................... 102
Three Ways to Determine .................................................................................................................................................................. 104
Case Study ................................................................................................................................................................................................. 106
Interpretation ......................................................................................................................................................................................... 107
Hypothesis Testing: σ Unknown ..................................................................................................................................................... 107
t Test of Hypothesis for the Mean (σ Unknown) ..................................................................................................................... 107
One-Tail Tests ......................................................................................................................................................................................... 109
Lower-Tail Tests .................................................................................................................................................................................... 109
Upper-Tail Tests .................................................................................................................................................................................... 109
Proportions .............................................................................................................................................................................................. 111
Z Test for Proportion: Number in Category of Interest ........................................................................................................ 112
p-Value Solution ..................................................................................................................................................................................... 113
Type II Error ............................................................................................................................................................................................ 114
Beta .............................................................................................................................................................................................................. 114
Calculating β and Power of the Test.............................................................................................................................................. 115
Potential Pitfalls and Ethical Considerations ............................................................................................................................ 115
Regression Analysis ............................................................................................................................................................ 116
Scatter Plot ............................................................................................................................................................................................... 116
Correlation Coefficient ........................................................................................................................................................................ 117
Features of ρ and r ................................................................................................................................................................................ 117
Significance Test for Correlation .................................................................................................................................................... 119
Introduction to Regression Analysis............................................................................................................................................. 120
Types of Regression Models ............................................................................................................................................................. 120
Population Linear Regression.......................................................................................................................................................... 121
Linear Regression Assumptions ..................................................................................................................................................... 121
Estimated Regression Model ............................................................................................................................................................ 121
Least Squares Criterion ...................................................................................................................................................................... 122
Minimization for Least Squares Criterion .................................................................................................................................. 122
The Least Squares Equation ............................................................................................................................................................. 123
Interpretation ......................................................................................................................................................................................... 123
Simple Linear Regression Example ............................................................................................................................................... 123
Sample Data for House Price Model .............................................................................................................................................. 124
Interpretation b0 ................................................................................................................................................................................... 125
Least Squares Regression Properties ........................................................................................................................................... 125
Explained and Unexplained Variation.......................................................................................................................................... 125

Coefficient of Determination, R2 ..................................................................................................................................................... 126
Examples of Approximate R2 Values ............................................................................................................................................. 126
Standard Error of Estimate ............................................................................................................................................................... 127
The Standard Deviation of the Regression Slope .................................................................................................................... 128
Comparing Standard Errors ............................................................................................................................................................. 128
Inference about the Slope: t Test .................................................................................................................................................... 129
Regression Analysis for Description ............................................................................................................................................. 130
Interval Estimates for Different Values of x .............................................................................................................................. 131
Estimation of Mean Values: Example ........................................................................................................................................... 131
Estimation of Individual Values: Example ................................................................................................................................. 132
Logistic Regression ............................................................................................................................................................. 133
Review of Linear Estimation ............................................................................................................................................................ 133
Multiple Linear Regression ............................................................................................................................................................... 133
Non-linear Estimation ......................................................................................................................................................................... 133
Logistic Regression ............................................................................................................................................................................... 134
Link Functions ........................................................................................................................................................................................ 134
Dichotomous Independent Variables ........................................................................................................................................... 134
Redefining the Dependent Variable .............................................................................................................................................. 135
Logit Function ......................................................................................................................................................................................... 137
Latent Variables ..................................................................................................................................................................................... 138
Logistic Regression ............................................................................................................................................................................... 138
Logistic Function ................................................................................................................................................................................... 141
Logit Transformation .......................................................................................................................................................................... 141
Dichotomous Predictor ....................................................................................................................................................................... 141
Example: Signs of CD and Age .......................................................................................................................................................... 142
Example: Age at 1st Pregnancy and Cervical Cancer ............................................................................................................. 143
Putting It All Together ......................................................................................................................................................................... 144
Time Series Modeling ........................................................................................................................................................ 145
The Importance of Forecasting ....................................................................................................................................................... 145
Common Approaches to Forecasting ............................................................................................................................................ 145
Time-Series Data.................................................................................................................................................................................... 145
Time Series Components ................................................................................................................................................................... 146
Trend Component ................................................................................................................................................................................. 146
Seasonal Component............................................................................................................................................................................ 146
Cyclical Component .............................................................................................................................................................................. 146
Irregular Component ........................................................................................................................................................................... 147
Smoothing Methods ............................................................................................................................................................................. 147

Three Methods for Trend-Based Forecasting ........................................................................................................................... 150
Linear Trend Forecasting .................................................................................................................................................................. 150
Nonlinear Trend Forecasting ........................................................................................................................................................... 152
Exponential Trend Model .................................................................................................................................................................. 152
Trend Model Selection Using Differences................................................................................................................................... 152
Autoregressive Modeling ................................................................................................................................................................... 152
Choosing a Forecasting Model ......................................................................................................................................................... 153
Residual Analysis ................................................................................................................................................................................... 154
Measuring Errors................................................................................................................................................................................... 154
Principal of Parsimony ........................................................................................................................................................................ 154
Forecasting With Seasonal Data ..................................................................................................................................................... 154
Exponential Model with Quarterly Data ..................................................................................................................................... 155
Estimating the Quarterly Model ..................................................................................................................................................... 155
Quarterly Model Example .................................................................................................................................................................. 155
Index Numbers ....................................................................................................................................................................................... 156
Index Numbers: Interpretation ....................................................................................................................................................... 157
Aggregate Price Indexes ..................................................................................................................................................................... 157
Unweighted Aggregate Price Index ............................................................................................................................................... 157
Weighted Aggregate Price Indexes ................................................................................................................................................ 158
Common Price Indexes ....................................................................................................................................................................... 158
Pitfalls in Time-Series Analysis ....................................................................................................................................................... 158
Introduction to Machine Learning ................................................................................................................................ 159
Human Body Temperature Distribution ..................................................................................................................................... 159
Threat Perception in Real Time Security Systems.................................................................................................................. 159
Machine Learning .................................................................................................................................................................................. 160
What Can Machine Learning Do? ................................................................................................................................................... 160
Applications ............................................................................................................................................................................................. 160
Algorithms & Machine Learning Models ..................................................................................................................................... 161
Supervised Learning ............................................................................................................................................................................ 161
Regression: Examples .......................................................................................................................................................................... 161
Classification: Examples ..................................................................................................................................................................... 161
Classification: Applications ............................................................................................................................................................... 162
Unsupervised Learning ....................................................................................................................................................................... 163
Document Clustering and Text Mining ........................................................................................................................................ 163
Learning Associations ......................................................................................................................................................................... 163
Object Recognition ................................................................................................................................................................................ 164
Reinforcement Learning..................................................................................................................................................................... 164

Applications of Reinforcement Learning .................................................................................................................................... 164
Machine Learning and Traditional Statistics ............................................................................................................................ 164
Machine Learning Design Study ..................................................................................................................................................... 165
Distance-Based Linear Models ....................................................................................................................................... 166
Partitional Clustering .......................................................................................................................................................................... 166
What is Clustering? ............................................................................................................................................................................... 166
What is NOT Cluster Analysis? ........................................................................................................................................................ 167
Types of Clustering ............................................................................................................................................................................... 167
Notion of “Closeness” Based on Distance Metric ..................................................................................................................... 168
Metric: Mathematically Speaking ................................................................................................................................................... 168
Clustering Strategy using Distance Metric ................................................................................................................................. 169
K-means Overview ................................................................................................................................................................................ 169
Algorithm: k-means .............................................................................................................................................................................. 169
Illustration of K-mean Algorithm ................................................................................................................................................... 170
Evaluating K-means Clusters ........................................................................................................................................................... 171
Colour Quantization Problem .......................................................................................................................................................... 171
True - Colour vs. Index - Colour Images ...................................................................................................................................... 171
Image Compression .............................................................................................................................................................................. 172
MATLAB Code ......................................................................................................................................................................................... 173
Choice of K ................................................................................................................................................................................................ 173
K-means: Limitations ........................................................................................................................................................................... 174
Overcoming K-means Limitations ................................................................................................................................................. 175
K-Means Clustering: Epilougue ....................................................................................................................................................... 176
K-medoids Clustering .......................................................................................................................................................................... 176
K-medoids Algorithm .......................................................................................................................................................................... 176
Other Partitional Clustering Algorithms ..................................................................................................................................... 177
Fuzzy K-means Application: Classify Cancer Cells ................................................................................................................. 177
Possible Features................................................................................................................................................................................... 177
Classes are Nonseparable .................................................................................................................................................................. 178
Fuzzy Classifier (FCM) ........................................................................................................................................................................ 179
Other Distinctions Between Sets of Clusters ............................................................................................................................. 179
Types of Clusters ................................................................................................................................................................................... 179
Illustration of K-mean Algorithm ................................................................................................................................................... 180
K-Nearest Neighbor .............................................................................................................................................................................. 180
Instance-Based Learning .................................................................................................................................................................... 180
Distance Between Neighbors ........................................................................................................................................................... 181
Non-Numeric Data ................................................................................................................................................................................ 184

k-NN Variations ...................................................................................................................................................................................... 184
Distance-Weighted Nearest Neighbor Algorithm ................................................................................................................... 185
How to Choose “K”? .............................................................................................................................................................................. 185
How to Find Optimal Value of “K” ? ............................................................................................................................................... 185
K-NN: Computational Complexity .................................................................................................................................................. 185
Remarks ..................................................................................................................................................................................................... 185
K-Nearest Neighbor: Synthetic Control ....................................................................................................................................... 186
Support Vector Machine ..................................................................................................................................................................... 186
Classification Tasks .............................................................................................................................................................................. 187
Problems in Classifying Data ............................................................................................................................................................ 187
Introduction: Linear Separators .................................................................................................................................................... 187
Selection of a Good Hyper-Plane .................................................................................................................................................... 188
Maximizing the Margin ....................................................................................................................................................................... 188
Classification Margin............................................................................................................................................................................ 189
Maximum Margin Classification ..................................................................................................................................................... 189
Linear SVM Mathematically .............................................................................................................................................................. 189
Solving the Optimization Problem ................................................................................................................................................. 190
The Optimization Problem Solution ............................................................................................................................................. 190
Soft Margin Classification .................................................................................................................................................................. 190
Theoretical Justification for Maximum Margins ...................................................................................................................... 191
Linear SVMs: Overview ...................................................................................................................................................................... 192
Non-linear SVMs .................................................................................................................................................................................... 192
The “Kernel Trick”................................................................................................................................................................................. 193
What Functions are Kernels? ........................................................................................................................................................... 193
Non-linear SVMs Mathematically ................................................................................................................................................... 194
Properties of SVM .................................................................................................................................................................................. 194
Weakness of SVM................................................................................................................................................................................... 194
SVM Applications ................................................................................................................................................................................... 195
SVM: Iris Dataset.................................................................................................................................................................................... 195
Decision Tree, Random Forest and Bagging .............................................................................................................. 196
What is a Decision Tree? .................................................................................................................................................................... 196
Decision Trees as Rules ...................................................................................................................................................................... 196
How to Create a Decision Tree? ...................................................................................................................................................... 196
Choosing Attributes .............................................................................................................................................................................. 197
Decision Tree Algorithms .................................................................................................................................................................. 197
Identifying the Best Attributes ........................................................................................................................................................ 198
ID3 Heuristic............................................................................................................................................................................................ 198

Entropy ...................................................................................................................................................................................................... 198
Some Intuitions ...................................................................................................................................................................................... 199
ID3 ................................................................................................................................................................................................................ 199
Information Gain.................................................................................................................................................................................... 199
Decision ..................................................................................................................................................................................................... 200
Stopping Rule .......................................................................................................................................................................................... 201
Evaluation ................................................................................................................................................................................................. 201
Continuous Attribute ........................................................................................................................................................................... 201
Pruning Trees .......................................................................................................................................................................................... 201
Subtree Replacement ........................................................................................................................................................................... 202
Strengths of ID3 algorithm ................................................................................................................................................................ 203
Problems with ID3 ................................................................................................................................................................................ 203
Problems with Decision Trees ......................................................................................................................................................... 203
Overfitting: Formal definition .......................................................................................................................................................... 203
How can we avoid overfitting? ........................................................................................................................................................ 204
Random Forests ..................................................................................................................................................................................... 204
Classification using Naïve Bayes .................................................................................................................................... 206
Background .............................................................................................................................................................................................. 206
Probability Basics .................................................................................................................................................................................. 206
Probabilistic Classification ................................................................................................................................................................ 206
Naïve Bayes .............................................................................................................................................................................................. 207
Naïve Bayes classification .................................................................................................................................................................. 208
Relevant Issues ....................................................................................................................................................................................... 210
Avoiding the zero-Probability Problem ...................................................................................................................................... 211
Naïve Bayes: Titanic Dataset ............................................................................................................................................................ 212
Artificial Neural Network ................................................................................................................................................. 213
The Biological Neuron ......................................................................................................................................................................... 213
Prehistory ................................................................................................................................................................................................. 213
Nervous Systems as Logical Circuits ............................................................................................................................................. 213
The Perception ....................................................................................................................................................................................... 214
Linear Neurons ....................................................................................................................................................................................... 214
A Motivation Example ......................................................................................................................................................................... 214
Behavior of the Iterative Learning Procedure .......................................................................................................................... 215
Deriving the Delta Rule ....................................................................................................................................................................... 215
The Error Surface .................................................................................................................................................................................. 216
Online versus Batch Learning .......................................................................................................................................................... 216
Adding Biases .......................................................................................................................................................................................... 217

Transfer Functions ............................................................................................................................................................................... 217
Activation Functions ............................................................................................................................................................................ 217
Neuron Models ....................................................................................................................................................................................... 218
Gaussian Function ................................................................................................................................................................................. 219
The Key Elements of Neural Networks ........................................................................................................................................ 220
Preprocessing the Input Vectors .................................................................................................................................................... 220
Statistical and ANN Terminology ................................................................................................................................................... 220
Network Architectures ....................................................................................................................................................................... 220
Single Layer Feed-forward ................................................................................................................................................................ 220
Multi-layer Feed-forward NN (FFNN).......................................................................................................................................... 221
FFNN Neuron Model ............................................................................................................................................................................ 221
Training Algorithm: Backpropagation ......................................................................................................................................... 221
Total Mean Squared Error ................................................................................................................................................................. 222
Weight Update Rule.............................................................................................................................................................................. 222
Backprop Learning Algorithm (Incremental-mode) ............................................................................................................. 223
Stopping Criterions ............................................................................................................................................................................... 223
Applications ............................................................................................................................................................................................. 223
Neural Network and Computers ..................................................................................................................................................... 224
NNs vs. Computers ................................................................................................................................................................................ 224
What Can You Do with an NN and What Not? .......................................................................................................................... 225

Design and Analysis of Experiment

Data and Data Collection
Data Collection Techniques

Research Literature
 Observations
 Surveys
 Tests
 Document Analysis
What is Design of Experiment?

Design of experiments is a statistical technique that involves the introduction of purposeful
and carefully planned changes to a process, while controlling for other factors, with the goal
of measuring the impact of those changes on the process output.

In many statistical studies a variable of interest, called the response variable (or dependent
variable), is identified. Then data are collected that explain how one or more factors (or in-
dependent variables) influence the variable of interest.
DoE: Some Terminology

If cannot be controlled, the data obtained are observational. For example: In order to study
the relationship between the size of a home and its sale price, a real estate agent randomly
selects 50 recently sold homes and records the square footages and sales prices of these
homes. The real estate agent cannot control the sizes of the randomly selected homes. The
data is observational.
If the factors can be controlled, the data are experimental. The purpose of most experiments
is to compare and estimate the effects of the different treatments on the response variable.
Experiments can be designed in many different ways to collect this information. The values,
or levels, of the factor (or combination of factors) are called treatments.
DoE: Examples of Experiments from Daily Life
Example Factors Response

Photography Speed of film, lighting, Quality of slides made close
shutter speed up with flash attachment
Boiling water Pan type, burner size, cover Time to boil water
Mailing Stamp, area code, time of Number of days required
day when letter mailed for letter to be delivered
Cooking Amount of cooking wine, Taste of stewed chicken
oyster sauce, sesame oil
Basketball Distance from basket, type Number of shots made (out
of shot, location on floor of 10) with basketball
Undergoing training for Education, Study Hours, Knowledge and Grades
DBDA Assignments completed, obtained
Instructors expertise
Design of Experiment: Cake Baking

Guidelines for Designing Experiments
 Recognition of and statement of the problem: Develop all ideas about the
objectives of the experiment, get input from everybody, use team approach
 Choice of factors, levels, ranges, and response variables: Use judgment or prior
test results.
 Choice of experimental design: Sample size, replicates, run order, randomization,
design of data collection forms, software to use
Baking the Cake: Steps in Design of Experiments

1. Choosing the Factors
 Factors that have the greatest impact on the final product.

 Domain knowledge
 Usually no more than six or seven key factors
2. Setting the Levels
 What combinations of factors?
 Restrictions if any on levels for the factors
 Must account range of interest of the outcome
3. Assessing the Response
 The response is the outcome of the experiment.
 The response can be qualitative or quantitative
DoE for Cake Baking: Factor Levels
Strategy of Experimentation
1 Best guess approach (trial and error)
 Can continue indefinitely
 Cannot guarantee best solution has been found
2 One-factor-at-a-time (OFAT) approach

 Inefficient (requires many test runs)
 Fails to consider any possible interaction between factors
3 Factorial approach (invented in the 1920’s)

 Factors varied together
 Correct, modern, and most efficient approach
 Can determine how factors interact
 Used extensively in industrial R and D, and for process improvement.
There are 27 different combination of 3 factors with 3 levels.

 For 2 factors with k levels: 2𝑘
 For 3 factors with 3 levels: 3𝑘
These are Full Factorial Design. Factorial designs are good with small number of variables.
Consider there are 100 variables and 20 outputs. Typical “Simplified” jet engine models.
Knowledge of statistics applicable:
 Samples, mean, variances
 Equivalence of means and variance of samples

All DOE are based on the same statistical principles and method of analysis - ANOVA and
regression analysis.
History of Design of Experiment
Four Eras of DOE

1 The agricultural origins (1918 – 1940s)
 R. A. Fisher & his co-workers
 Profound impact on agricultural science
 Factorial designs, ANOVA
2 The first industrial era (1951 – late 1970s)
 Box & Wilson, response surfaces
 Applications in the chemical & process industries
3 The second industrial era (Late 1970s – 1990)
 Quality improvement initiatives in many companies
 Taguchi and robust parameter design, process robustness
4 The modern era (Beginning circa 1990)
 Wide range of uses of computer technology in DOE
 Expanded use of DOE in Six-Sigma and in business
 Use of DOE in computer experiments
Some Major Players in DOE

 Sir Ronald A. Fisher (Pioneer)
Invented ANOVA and used of statistics in experimental design while working at
Rothamsted Agricultural Experiment Station, London, England.
 George E. P. Box Married Fisher’s daughter
Developed response surface methodology (1951), plus many other contributions to
statistics passed away in 2013.

 Raymond Myers
 J. S. Hunter
 W. G. Hunter
 Yates
 Montgomery
 Finney
Why Design of Experiment?

In general, DOE helps to learn about the process we are investigating, screen important
variables, build a mathematical model, obtain prediction equations and optimize the
response (if required).
The focus is on three very useful and important classes of factorial designs:
 OFAT
 Factorial design
 Randomized design
Building Blocks of DoE

Replication:
 Allows an estimate of experimental error
 Precise estimate of the sample mean value
Randomization:
 “Average out” effects of extraneous factors
 Reduce bias and systematic errors
Blocking: “Factor out” variable not studied
One-factor-at-a-time Experiments (OFAT)

OFAT is a prevalent, but potentially disastrous type of experimentation commonly used by
many engineers and scientists in both industry and academia. Tests are conducted by
changing the levels of one factor while holding the levels of all other factors fixed. The
“optimal” level of the first factor is then selected. Each factor is varied and its “optimal” level
selected while the other factors are held fixed. OFAT experiments are regarded as easier to
implement, more easily understood, and more economical than factorial experiments, better
than trial and error.
OFAT experiments provide the optimum combinations of the factor levels, but, each of
these presumptions can be shown to be false except under very special circumstances. The
key reasons why OFAT should not be conducted except under very special circumstances
are:
 Do not provide adequate information on interactions
 Do not provide efficient estimates of the effects

Factorial Designs
In a factorial experiment, all possible
combinations of factor levels are tested.
The golf experiment:

 Type of driver (over or regular)
 Type of ball (balata or 3-piece)
 Walking vs. riding a cart
 Type of beverage (Beer vs water)
 Time of round (am or pm)
 Weather
 Type of golf spike
A two-factor factorial experiment
involving type of driver and type of ball.

Factorial Designs with Several Factors
Factorial v/s OFAT

In Factorial design, experimental trials or runs are performed at all possible combinations of
factor levels in contrast to OFAT experiments. Factorial and fractional factorial experiments
are among the most useful multi-factor experiments for engineering and scientific
investigations.
Factorial OFAT
2 factors: 4 runs (3 effects) 2 factors: 6 runs (2 effects)
3 factors: 8 runs (7 effects) 3 factors: 16 runs (3 effects)
5 factors: 32 or 16 runs (31 or 15 effects) 5 factors: 96 runs (5 effects
7 factors: 128 or 64 runs (127 or 63 effects) 7 factors: 512 runs (7 effects)
Example: Effect of Re and k/D on friction factor f

Consider a 2-level factorial design (22).
Reynold’s number = Factor A; k/D = Factor B
Levels for A: 104 (low), 106 (high)
Levels for B: 0.0001 (low), 0.001 (high)
Responses:
(1) = 0.0311,

a = 0.0135
b = 0.0327
ab = 0.0200
Effect (A) = -0.66, Effect (B) = 0.22, Effect (AB) = 0.17
% contribution: A = 84.85%, B = 9.48%, AB = 5.67%
The presence of interactions implies that one cannot satisfactorily describe the effects of
each factor using main effects.
Friction in Pipes: Experiments of Nikuradse

Johann Nikuradse, 1933: Artificial Sand Roughness

Design – Ease Plot

Design – Expert Plot
DESIGN-EXPERT Pl ot Interaction Graph DESIGN-EXPERT Pl ot
Log10(f) B: k/D Log10(f)

-1.495
X = A: RE
X = A: RE Y = B: k/D
Y = B: k/D -1.554
Desi gn Poi nts -1.567 -1.611
B- 0.000 -1.668
B+ 0.001
-1.725
Log10(f)
Log10(f)
-1.639
-1.783
-1.712
0.0008828
0.0007414
-1.784
B:0.0006000
k/D
0.0004586 5.707
5.354
4.293 4.646 5.000 5.354 5.707 5.000
4.646
0.0003172 4.293
A: RE
A: RE
DESIGN-EXPERT Pl ot Log10(f) DESIGN-EXPERT Pl ot

0.0008828 Log10(f)
Predicted vs. Actual
Log10(f) -1.494
Desi gn Poi nts
X = A: RE
Y = B: k/D
0.0007414
-1.566
-1.668 -1.706
Predicted
B: k/D
0.0006000
-1.592-1.630
-1.639
0.0004586
-1.744 -1.711
-1.783
0.0003172
4.293 4.646 5.000 5.354 5.707
-1.783 -1.711 -1.639 -1.566 -1.494
A: RE
Actual
Central Composite Design

Augmenting the basic 22 design with a centre point and 5 axial points we get a central
composite design and a 2nd order model can be fit. The nonlinear nature of the relationship
between Re, k/D and the friction factor f can be seen. If Nikuradse (1933) had used a factorial
design in his pipe friction experiments, he would need far less experimental runs.
Randomized Design and ANOVA

Shelf Problem:
A commercial bakery supplies many supermarkets. In order to improve the effectiveness of
its supermarket shelf displays, the company wishes to compare the effects of shelf display
height (bottom, middle, or top) and width (regular or wide) on monthly demand. The bakery
employs two-way analysis of variance to find the display height and width combination that
produces the highest monthly demand.
Some Terminology
In order to collect data in an experiment, the different treatments are assigned to objects
(people, cars, animals, or the like) that are called experimental units. For example: Super

markets in “Shelf problem”. In general, when a treatment is applied to more than one
experimental unit, it is said to be replicated. In a completely randomized experimental
design, independent random samples of experimental units are assigned to the treatments.
Shelf Problem Data
Sample Means
Plotting Treatment Means

Some Generalization of the Problem Formulation
To study the effects of two factors on a response variable.
Assumption:
 The first factor (factor 1) has a levels (levels 1, 2, . . . , a).
 The second factor (factor 2) has b levels (levels 1,2, . . . , b).
A treatment is considered to be a combination of a level of factor 1 and a level of factor 2.
There are a total of ab treatments. With m experimental units and completely randomized
experimental design will results in observing m values of the response variable for each of
the ab treatments. A two-factor factorial experiment is performed.
Different Treatment Effects in Two-Way ANOVA
ANOVA
In addition to graphical analysis, Analysis of Variance (ANOVA) is a tool to analyse data. It is
necessary to study the influence of each individual parameter (principle variable), effect of
2 parameters taken at a time (interaction).

Some notations
Sample Means
Some Useful Quantities

In order to numerically compare the between-treatment and within-treatment variability,
several sums of squares and mean squares can be defined.
Treatment Sum of the Squares (SST):
n is the total number of experimental units employed in the ANOVA, and
x is the overall mean of all observed values of the response variable.

Treatment sum of squares measures the amount of between-treatment variability.
To measure the within-treatment variability, the quantity defined is Error Sum of Squares.
If there were no variability within each sample, the error sum of squares would be equal to 0.
The more the values within the samples vary, the larger will be SSE.
The variability in the observed values of the response must come from one of two sources:
1. The between-treatment variability
2. The within-treatment variability
The sum of squares that measures the total amount of variability (SSTO) in the observed
values of the response:
It follows that the total sum of squares equals the sum of the treatment sum of squares and
the error sum of squares. The SST and SSE are said to partition the total sum of squares.
Treatment Mean Squares

𝑆𝑆𝑇
𝑀𝑆𝑇 =
𝑝−1
Error Mean Squares

𝑆𝑆𝐸
𝑀𝑆𝐸 =
𝑛−𝑝
How to Compare Means of the Sample?

In order to decide whether there are any statistically significant differences between the
treatment means, it makes sense to compare the amount of between-treatment variability to
the amount of within-treatment variability. An F test for differences between Treatment
Means.

Null Hypothesis
No interaction exists between factors 1 and 2 versus the alternative hypothesis 𝐻_𝑎: that
interaction does exist. Reject 𝐻0 in favour of 𝐻𝑎 at level of significance 𝛼 if,
𝑀𝑆(𝑖𝑛𝑡)
𝐹𝑖𝑛𝑡 =
𝑀𝑆𝐸
is greater than the 𝐹𝛼 based on (a -1)(b - 1) numerator and ab(m - 1) denominator degrees of
freedom.
Two Way ANOVA Tables
Procedure for 2-Way ANOVA

Step 1: Calculate SSTO, which measures the total amount of variability:
Step 2: Calculate SS(1), which measures the amount of variability due to the different
levels of factor 1:

Step 3: Calculate SS(2), which measures the amount of variability due to the different
levels of factor 2:
Step 4: Calculate SS(interaction), which measures the amount of variability due to the
interaction between factors 1 and 2:
Step 5: Calculate SSE, which measures the amount of variability due to the error:
Minitab Output for Shelf Display

Results of ANOVA
In the shelf display case, 𝐹𝛼=0.5 based on (a-1)(b-1) = 2 numerator and ab (m-) =12
denominator degrees of freedom is 3.89. F(int) = 0.82 is much smaller than 𝐹𝛼=0.5 = 3.89,
𝐻0 cannot be rejected at the .05 level of significance.
Conclusion: Little or no interaction exists between shelf display height and shelf display
width.
Inferences from ANOVA

The relationship between mean demand for the bakery product and shelf display height
depends little (or not at all) on the shelf display width. The conclusions are suggested by the
plot. Those inferences are substantiated by sound mathematical treatment.
Plotting Treatment Means
Little or no interaction exists between factors 1 and 2. These can be (separately) tested for
their significance – testing the significance of the main effects.
For F(1): Reject 𝐻_0 at the .05 level of significance. There is a strong evidence that at least
two of the bottom, middle, and top display heights have different effects on mean monthly
demand.
For F(2): Cannot reject 𝐻_0 at the .05 level of significance. No strong evidence that the
regular display width and the wide display have different effects on mean monthly demand.

Probability Theory
Brief Overview: Population
We often have questions concerning large populations. If we want to know the average
weight of all 20 year olds in the India, then the population is all individuals who are 20 years
old and living in the India. If we want to know the proportion of middle aged men who do
not have a heart attack after taking a certain drug, then the population is the set of all middle
aged men.
The entire set of possible observations in which we are interested. Gathering information
from the entire population is not always possible due to barriers such as time, accessibility,
or cost. Usually populations are so large that a researcher cannot examine the entire
population. Therefore, a subset of the population is selected to represent the population in a
research study.
Sample
It is a subset of population from which inferences are drawn about population. Sample data
provide only limited information about the population. As a result, sample statistics are
generally imperfect representatives of the corresponding population parameters.
The goal is to use the results obtained from the sample to help answer questions about the
population.
Sample Space
An Event is a subset of a sample space. It is a basic outcomes. The Sample Space is the set of
all possible outcomes of an experiment.
Probability of event A:
𝑛(𝐴)
𝑃(𝐴) =
𝑛(𝑆)
Where, n(A) = the number of element in the set of the event A
n(S) = the number of element in the sample space S
A sample space S is the set of all possible outcomes of a (conceptual or physical) random
experiment. (Ω can be finite or infinite.)
Examples:
1. S may be the set of all possible outcomes of dice roll. S = {1,2,3,4,5,6}
2. Number of hours people sleep. S = {h: h ≥ 0 hours}
3. Temperature recorded in Mumbai for last 10 years. S = {T : T [5C,41C] }
4. Do you brush teeth everyday? S = {yes, no}
Types of Data
 Discrete: Quantitative data are called Continuous if the sample space contains an
interval or continuous span of real numbers.

 Continuous: Quantitative data are called Discrete if the sample space contains a
finite or countably infinite number of values.
 Categorical: Qualitative data are called Categorical if the sample space contains
objects that are grouped or categorized based on some qualitative trait.
Random Variables
A random variable is all of possible outcomes from an experiment. Random variable can be
discrete or continuous. A discrete random variable can assume at most a countable number
of values. A continuous random variable is one which takes an infinite number of possible
values. Continuous random variables are usually measurements.
Types of Random Variables

1. Discrete Random Variables:
 Number of Sales
 Number of Calls
 Shares of Stock
 People in Line
 Mistakes Per Page
2. Continuous Random Variables:

 Length
 Time
 Depth
 Weight
 Volume
Questions about the Sample

Suppose that the Mumbai traffic police claims that (60%) of Mumbai residents maintain a
car or two-wheeler. Suppose we take a random sample of 100 Mumbai citizens and
determine that the proportion of citizens in the sample who maintain a car or two-wheeler
is 69/100 = 69%
Sample of 100 Mumbai citizens proportion of citizens with a car or two-wheeler, if the actual
population proportion is 60%, how likely is it that we'd get a sample proportion of 0.69?
Events
Event A is a subset of the sample space S. A  S
The Rule of Union: The probability of the union of two events in terms of probability of two
events and the probability of their intersection
The rule of unions:

The Rule of Computing Probability
Addition rules Mutually exclusive events
‘A’ occurs or ‘B’ occurs or both ‘A’ & ‘B’ occurs
Addition rules Events are not mutually exclusive
When A & B are mutually exclusive , A & B cannot simultaneously occur.
Probability of the union of A & B is determined by adding the probability of the events A & B
and then subtracting the probability of the intersection of the events A & B.
The symbol ∩ means both A&B simultaneously occur.
Independence Event
The probability that one event occurs in no way affects the probability of the other event
occurring. If A and B are independent events then the probability of both occurring is:
Example: Rolled a die and flipped a coin.
The Probability of getting any number on the die is no way influences the probability of getting
a head or tail on the coin.
Mutually Exclusive Event

Certain pairs of event have a unique relationship referred to as mutual exclusivity. Two events
are said to be mutually exclusive if they can’t occur at the same time.
Mutually exclusive event have their probability defined as follow:

𝑃(𝐴) + 𝑃(𝐵) = 1
When you flip a fair coin, you either get a head or a tail but not both. We can prove that these
event are mutually exclusive by adding their probabilities:
Dependent Event
The outcome of one event affects the outcome of the other. If A and B are dependent events
then the probability of both occurring is:
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴) × 𝑃(𝐵|𝐴) Probability of B given A
Suppose we have 5 blue marbles and 5 red marbles in a bag. We pull out one marble, which
may be blue or red. What is the probability that the second marble will be red?
If the first marble was red, then the bag is left with 4 red marbles out of 9 so the probability
4
of drawing a red marble on the second draw is 9. But if the first marble we pull out of the
draw is blue, then there are still 5 red marbles in the bag and the probability of pulling a red
5
marble out of the bag is 9.
Non Mutually Exclusive Event
Two events A and B are said to be mutually non-exclusive events if both the events A and B
have at least one common outcome between them.
Non-Mutually exclusive event have their probability defined as follow:
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)
Flipping a coin twice (or flipping two coins)
 Event A: I get at least one head.

 Event B: I get at least one tail.
Both occur if I get HT or TH. They’re also not identical: if I get HH, A occurs but B doesn’t.
Algebra of Sets
Since events and sample spaces are just sets, let's review the algebra of sets:

Ø is the "null set" (or "empty set")
A ∩ B = "intersection" = the elements in A and B. If A ∩ B = Ø, then A and B are called

"mutually exclusive events" (or "disjoint events"
C ∪ D = "union" = the elements in C or D or both
If E ∪ F ∪ G ∪ ... = S, then E, F, G, and so on are called "exhaustive events."
D' = Dc = "complement" = the elements not in D
How Do We Assign a Probability to the Event?

Personal Approach
"I think there is an 80% chance of rain today."
"I think there is a 50% chance that the world's oil reserves will be depleted by the year
2100."
"I think there is a 1% chance that the men's basketball team will end up in the Final Four
sometime this decade."
Frequency
Probability that you will get Heads when you toss a coin?
Classical Approach
As long as the outcomes in the sample space are equally likely (!!!), the probability of
event A is:
𝑁(𝐴)
𝑃(𝐴) =
𝑁(𝐵)
Probability
It is a measurement of likelihood that a particular event will occur.
Tossing a Coin: When a coin is tossed, these are two possible outcomes: Head and Tail.
Throwing Dice: When a single die is thrown, there are 6 possible outcomes: 1, 2, 3, 4, 5, 6.
1
The Probability of any one of them is 6
Experiments and Outcome

 An experiment is an act or process that leads to one of several possible outcomes

 An outcome of an experiment is some observation or measurement
Probability is a (real-valued) set function P that assigns to each event A in the sample
space S a number P(A), called the probability of the event A.
Axioms of Probability (Kolmogorov Axioms)

0 ≤ 𝑃(𝐴) ≤ 1
Area A cannot be greater than area of 
𝑃(𝑆) = 1
A = {S}
Corollary: P(∅) = 0
When area A is zero, there is “null” element in A, 𝐴 = {∅}

𝜎- additivity: For disjoint events (sets)
∞ ∞
𝑃 (⋃ 𝐴𝑖 ) = ∑ 𝑃(𝐴𝑖 )
𝑖=1 𝑖=1
Consider for 2 events A and B which are not disjoint
Events in A and B = (𝐴 ⋃ 𝐵) − (𝐴 ⋂ 𝐵)
Corollary:
Event A
P(𝐴 ⋃ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ⋂ 𝐵)
P(¬𝐴) = 1 − 𝑃(𝐴)
If A ⊆ B, then P(A) ≤ P(B). Event B
Conditional Probability: Motivation
Diagnostic test on 137 patients for renal disease.
P(T+) is the probability that the person tests positive-P(T+) = 54/137
Let P(D) is the probability the person is truly diseased
P(D) = 67/137
If a person has renal disease, what is the probability that he/she tests positive
for the disease?
P(T+|D)
N(T+ ⋂ 𝐷)
= = 44/67
𝑁(𝐷)

𝑁(𝑇+ ⋂ 𝐷)
𝑁(𝑆)
=
𝑁(𝐷)/𝑁(𝑆)
P(𝑇+ ⋂ 𝐷)
=
𝑃(𝑆)
The conditional probability of an event A given that an event B has occurred is written:
𝑃(𝐴|𝐵) and is calculated using: (𝐴|𝐵) = 𝑃(𝐴 ∩ 𝐵)/𝑃(𝐵) as long as P(B) > 0.
Properties of Conditional Probability: Because conditional probability is just a probability,

it satisfies the three axioms of probability. That is, as long as P(B) > 0:
P(A | B) ≥ 0
P(B | B) = 1
If A1, A2, ... Ak are mutually exclusive events, then
P(A1 ∪ A2 ∪ ... ∪ Ak| B) = P(A1 | B) + P(A2 | B) + ... + P( Ak| B) and likewise for infinite unions.
Multiplicative Rule and Independence

The probability that two events A and B both occur is given by the multiplication rule as:
P(A ∩ B) = P(A | B) × P(B) or by P(A ∩ B) = P(B | A) × P(A)
Independent Events: Events A and B are independent events if the occurrence of one of
them does not affect the probability of the occurrence of the other.
P(B|A) = P(B), (provided that P(A) > 0) or P(A|B) = P(A), (provided that P(B) > 0)
Implication of Conditional Independence

It is known that P(A ∩ B) = P(A | B) × P(B) or P(A ∩ B) = P(B | A) × P(A)
For independent events, P(A | B) = P(A) and P(B | A) = P(B)

Substituting it, the result is: For Independent events P(A ∩ B) = P(A) × P(B)
Bayes Rule
Bayes rule is important for reverse conditioning.
𝐏(𝐁|𝐀)𝐏(𝐀)
𝐏(𝐀|𝐁) =
𝐏(𝐁)
Corollary from Bayes Rule

𝑃(𝐵 |¬𝐴)𝑃(¬𝐴)
 𝑃(¬𝐴|𝐵) = 𝑃(𝐵)
 𝑃(𝐴|𝐵) + 𝑃(¬𝐴|𝐵) = 1
 𝑃(𝐵) = 𝑃(𝐵|𝐴)𝑃(𝐴) + 𝑃(𝐵|¬𝐴)𝑃(¬𝐴)

Bayes Rule Example: AIDS Test
Test Data
 Approximately 0.1% are infected
 Test detects all infections
 Test reports positive for 1% healthy people
Probability of having AIDS
if Test is Positive, let A be the event that person has AIDS,
𝑷(𝑨) = 0.1%= 0.001, 𝑷(𝑨𝒄 ) = (100-0.1)%  0.9999
Let T be the event that test is positive, for healthy people, 𝑃(𝑇|𝐴𝑐 ) = 1%  0.01
For infected people, 𝑃(𝑇|𝐴) = 100%  1.0
We want to find 𝑷(𝑨|𝑻)
Probability of having AIDS if test is positive:
𝑃(𝑇|𝐴)𝑃(𝐴)
𝑃(𝐴|𝑇) =
𝑃(𝑇)
=
𝑃((𝑇 ∩ 𝐴) ∪ (𝑇 ∩ 𝐴𝑐 ))
=
𝑃(𝑇|𝐴)𝑃(𝐴) + 𝑃(𝑇|𝐴𝑐 )𝑃(𝐴𝑐 )
1 𝑋 0.001
= 𝟎. 𝟎𝟗𝟏
1 𝑋 0.001 + 0.01 𝑋 0.9999
Improving the Diagnosis
Use a follow-up test!

 Test 2 reports positive for 90% infections
 Test 2 reports positive for 5% healthy people
Probability of having AIDS if test 1 and test 2 is positive
Let A be the event that person has AIDS,
𝑷(𝑨) = 0.1%  0.001, 𝑷(𝑨𝒄 ) = (100-0.1)% = 99.99%  0.9999
Let 𝑇1 be the event that test 1 is positive:
 For healthy people, 𝑷(𝑻𝟏 |𝑨𝒄 ) = 1%  0.01

 For infected people, 𝑷(𝑻|𝑨) = 100%  1.0

Let 𝑇2 be the event that test 2 is positive
 For healthy people, 𝑷(𝑻𝟐 |𝑨𝒄 ) = 5 %  0.05

 For infected people, 𝑷(𝑻𝟐 |𝑨) = 90%  0.90
We want to find 𝑷(𝑨|𝑻)
Test T1 , T2 are independent, 𝑷(𝑻𝟏 ∩ 𝑻𝟐 |𝑨) = 𝑷(𝑻𝟏 |𝑨)𝑷(𝑻𝟐 |𝑨)

𝑃(𝑇1 ∩ 𝑇2 |𝐴𝑐 )𝑃(𝐴𝑐 ) 𝑃(𝑇1 ∩ 𝑇2 |𝐴𝑐 )𝑃(𝐴𝑐 )
𝑃(𝐴𝑐 |𝑇1 ∩ 𝑇2 ) = =
𝑃(𝑇1 ∩ 𝑇2 ) 𝑃(𝑇1 ∩ 𝑇2 |𝐴)𝑃(𝐴) + 𝑃(𝑇1 ∩ 𝑇2 |𝐴𝑐 )𝑃(𝐴𝑐 )
0.01 X 0.05 X 0.999
= 1X0.9X0.005+0.01 X 0.05 X 0.999 = 0.357
P(A|𝑻𝟏 ∩ 𝑻𝟐 )= 𝟏 − 𝑷(𝑨𝒄 |𝑻𝟏 ∩ 𝑻𝟐 ) = 0.643
Bayesian Learning
Application of Bayes Rule

 Document Classification
 Identifying Mining Regions based on Gamma Ray Lithographs
 Identifying Defective Components
 Predicting Life Expectancy of Jet Engines
 Sensor Selection
 Optical Sensor Placement
 Spelling Corrector
Random Variables
A random variable is all of possible outcomes from an experiment. Random variable can be
discrete or continuous. A discrete random variable can assume at most a countable number

of values. A continuous random variable is one which takes an infinite number of possible
values. Continuous random variables are usually measurements.
Real valued random variable is a function of the outcome of the randomised experiment.
𝑋: Ω→ℝ
𝑃(𝑎<𝑋 <𝑏)= ̇𝑃(𝜔: 𝑎<𝑋(𝜔)<𝑏)
𝑃(𝑋=𝑎)= ̇𝑃(𝜔: 𝑋(𝜔)=𝑎)
A random variable x takes on a defined set of values with different probabilities.
Example: If you roll a die, the outcome is random (not fixed) and there are 6 possible
outcomes, each of which occur with probability one-sixth.
It is how frequently we expect different outcomes to occur if we repeat the experiment
over and over.
Probability Distributions: Discrete Random Variables

The probability distribution of a discrete random variable is a graph, table and formula. It
specifies the probability associated with each possible outcome the random variable can
assume.
1 ≥ p(x) ≥ 0 for all values of x
The area under a probability function is always 1, p(x) = 1
In case of continuous variable probability distribution is a real valued function that maps
the possible values of x against their respective probabilities of occurrence, p(x).
When the variable can only take a fixed number of values.
Say a random variable x follows this pattern p(x) = (.3)(.7)x-1 for x

> 0. This table gives the probabilities (rounded to two digits) for x
between 1 and 10.

E.g.: If you roll a die, you can get 1, 2, 3, 4, 5, or 6 you cannot get 1.2 or 0.1. If it is a fair die,
the probability distribution will be 1/6, 1/6, 1/6, 1/6, 1/6, 1/6.
Expected Values of Discrete Random Variables
Example: Use of expected Values of Discrete Variable

In a roulette wheel in a U.S. casino, a $1 bet on “even” wins $1 if the ball falls on an even
number. The chances of winning this bet are 47.37%. X = {win $1, lose $1}
Let p (win $1) be the probability to win $1, p (win $1) = 0.4737 => p (lose $1) = 0.5263
On average, bettors lose about a less than a cent for each dollar they put down on a bet like
this. (These are the best bets for patrons.)

Cumulative Distribution Function (CDF)
The Cumulative Distribution Function, FX(x), is defined as, 𝑭_𝑿 (𝒙)=𝑷(𝑿≤𝒙).
Depending upon the process of selecting a sample, the type of sample space and purpose of
sampling we get different discrete probability distributions:
 Binomial: Yes/no outcomes (dead/alive, treated/untreated, sick/well).Samples are
drawn with replacements.
 Hypergeometric: Sampling without replacement
 Poisson: Counts (e.g., how many cases of disease in a given area)
The Binomial Distribution

Binomial Distribution is the probability of success or failure in a given experiment. Binomial
distribution is the result of a binomial experiment, which has the following properties:
 A fixed number of trials (observations)

 Each trial follows Bernoulli distribution, has two possible outcomes- success or failure
When a Bernoulli experiment is conducted ‘n’ number of times, then the sum of those
distributions will be binomially distributed with parameters ‘n’ and ‘P’. The probability of
success is ‘P’. The probability of failure is 1-P. The trials are independent, which means that
the outcome of one trial does not affect the outcome of any other trials.
The probability of x successes in a binomial experiment with n trials and probability of

success = p is
n!
p(x)= x!(n-x)! ⋅p(x) (1-P)n-x , for x = 0, 1, 2, 3,……
Example:
Let us consider the probability of getting 1 in rolling a die 20 number of times. Here n = 20
and p = 1/6. Hence success would be rolling a one and failure would indicate getting any
other number. On the other hand if we consider rolling an even number, then

p = ½ and n= 20
Flip a coin 3 times, Outcomes are Heads or Tails
P(H) = .5; P(F) = 1-.5 = .5
A head on flip i doesn’t change P(H) of flip i + 1
A Binomial Random Variable n identical trials.Two outcomes: Success or Failure.
P(S) = p; P(F) = q = 1 – p Trials are independent x is the number of S’s in n trials.
The Binomial Probability Distribution

The number of ways of getting the desired results. The probability of getting the required
number of successes. The probability of getting the required number of failures.
p = P(S) on a single trial
q=1–p
n = number of trials
x = number of successes
𝑛
𝑛!
Note: ) = 𝑟!(𝑛−𝑟)!
(
𝑟
Example: Say 40% of the class is female. What is the probability that 6 of the first 10
students walking in will be female?

n
P( x)    p x q n  x
 x
10 
  (.46 )(.6106 )
6
 210(.004096)(.1296)
 .1115
Mean 𝜇 = 𝑛𝑝
Variance 𝜎2 = 𝑛 𝑝 𝑞
Standard Deviation = 𝜎 = √𝑛 𝑝 𝑞
For 1,000 coin flips:

  np  1000  .5  500
 2  npq  1000  .5  .5  250
  npq  250  1
The actual probability of getting exactly 500 heads out of 1000 flips is just over 2.5%, but
the probability of getting between 484 and 516 heads (that is, within one standard
deviation of the mean) is about 68%.
The Hypergeometric Distribution

Hypergeometric Distribution describes the probability of success in ‘n’ draws without
replacement. The result of each draw can be grouped into one of the two mutually
exclusive events- success or failure. The probability of success changes in each draw as the
size of the population changes for each draw.
Example: A deck of cards contains 20 cards: 6 red cards and 14 black cards. 5 cards are
drawn randomly without replacement. What is the probability that exactly 4 red cards are
drawn?
The probability of choosing exactly 4 red cards is:

In the binomial situation, each trial was independent. Drawing cards from a deck and
replacing the drawn card each time. If the card is not replaced, each trial depends on the
previous trial(s). The hypergeometric distribution can be used in this case. Randomly draw
n elements from a set of N elements, without replacement. Assume there are r successes and
N-r failures in the N elements.
The hypergeometric random variable is the number of successes, x, drawn from the r
available in the n selections.
 r  N  r  nr
   

P( x)    
x n x N
N r ( N  r ) n( N  n)
  2 
n N 2 ( N  1)
Where:
N = the total number of elements
r = number of successes in the N elements
n = number of elements drawn
X = the number of successes in the n elements
Example: Suppose a customer at a pet store wants to buy two hamsters for his daughter,
but he wants two males or two females (i.e., he wants only two hamsters in a few months)
Wants two hamsters but both of the same sex. There are ten hamsters, five male and five
female. What is the probability of drawing two of the same sex? (With hamsters, it’s
virtually a random selection.)
𝑃(𝑀 = 2) = 𝑃(𝐹 = 2)
(52)(10−5
2−2 )
= = 0.22 𝑃(𝑀 = 2 ∪ 𝐹 = 2)
(10
2)
= 𝑃(𝑀 = 2) + 𝑃(𝐹 = 2)
= 2 X 0.22 = 0.44
The Poisson Distribution

Poisson distribution gives the probability of certain events occurring in a fixed interval of
time. Poisson distribution can be calculated using the formula:
λx e-λ
P(X = x) =
x!
Where, x = 0, 1, 2, 3 ….

λ = mean number of occurrences in the interval
e = Euler’s constant ≈ 2.718
Example: The number of errors in a new edition book is Poisson distributed with mean 1.5
per 100 pages and this varies from book to book. What is the probability that there are no
typographical errors in a randomly selected 100 pages of a new book?
e-λ λx
P(x)=
x!
Substituting x = 0 and λ = 1.5,

e-1.5 1.50
P(0) =
0!
(2.71828)-1.5 (1)
=
0!
= 0.2231
The Poisson Distribution evaluates the probability of a (usually small) number of

occurrences out of many opportunities in.
Evaluates probability of a number of occurrences in:

 Period of Time
 Area
 Volume
 Weight
 Distance
 Other Units of Measurement
λ = mean number of occurrences in the given unit of time, area, volume, etc.
e = 2.71828….
µ=λ
2=λ
x e  
P( x) 
x!
Say in a given stream there are an average of 3 striped trout per 100 yards. What is the
probability of seeing 5 striped trout in the next 100 yards, assuming a Poisson distribution?
x e   35 e 3
P( x  5)    .1008
x! 5!
How about in the next 50 yards, assuming a Poisson distribution? Since the distance is only
half as long, λ is only half as large.

x e   1.55 e 1.5
P( x  5)    .0141
x! 5!
Distribution of Continuous Random Variable
The probability density function ("p.d.f.") of a continuous random variable X with
support S is an integral function f(x) satisfying the following:
(1) f(x) is positive everywhere in the support S, that is, f(x) > 0, for all x in S
(2) The area under the curve f(x) in the support S is 1, that is:
∫𝑆 𝑓(𝑥) 𝑑𝑥 = 1
(3) If f(x) is the p.d.f. of x, then the probability that x belongs to A, where A is some interval,
is given by the integral of f(x) over that interval, that is:
𝑃(𝑋 ∈ 𝐴) = ∫ 𝑓(𝑥) 𝑑𝑥
𝐴
Cumulative Distribution Function (CDF)

The cumulative distribution function is defined for discrete random variables as:
𝑭𝑿 (𝒙) = 𝑷(𝑿 ≤ 𝒙) = ∑ 𝒇(𝒕)

𝒕≤𝒙
The Cumulative Density Function (“c.d.f.") of a continuous random variable X is defined as:
𝑥
𝐹(𝑋) = ∫ 𝑓(𝑡) 𝑑𝑡
−∞
For continuous random variables, F(x) is a non-decreasing continuous function

𝑑𝐹(𝑥)
Note: For continuous random variable, pdf can be obtained from cdf : 𝑓(𝑥) = 𝑑𝑥
Example: X has the following probability f(t)=

density function:
f(x) = 3x2
What is the CDF, F(x) for 0< x <1
F(x) = ∫𝑥−∞ 𝑓(𝑡) 𝑑𝑡 = ∫𝑥−∞ 3 𝑡2 𝑑𝑡
= [𝑡 3 ]0𝑥 = 𝑥 3 0<𝑥<1

0 𝑥≤0
3
F(x) =
F(x) = { 𝑥 0<𝑥<1
1 𝑥≥1
Descriptive Properties of Continuous Random Variable

The mean, or expected value, of a continuous random variable is:
∞
𝜇 = 𝐸(𝑋) = ∫ 𝑥 𝑓(𝑥) 𝑑𝑥
−∞
The variance of a continuous random variable x is:

2
𝜎2 = 𝐸[(𝑋 − 𝜇)] = ∫∞ 2
−∞(𝑥 − 𝜇) 𝑓(𝑥) 𝑑𝑥
The standard deviation of a continuous random variable x is:

2
𝜎 = √𝐸[(𝑋 − 𝜇)] = √∫∞ 2
−∞(𝑥 − 𝜇) 𝑓(𝑥) 𝑑𝑥
Probability Distributions of continuous random variables:

 Uniform
 Triangular
 Normal
 Exponential
Equally Likely chances of occurrences of random values between a maximum and a

minimum.
𝜇 = (b+a)/2
f(x)
2
𝜎 = (b-a)2/12
‘a’ is a location parameter

1/(b-a)
‘b-a’ is a scale parameter
No shape parameter.

Triangular Distribution
Parameters:
 minimum a, maximum b, most likely c
 Symmetric or skewed in either
direction
 a is the location parameter
 (b-a) scale parameter
 c is the shape parameter
a+b+c
μ=
3
2
a2 + b2 + c2 − ab − bc − ac
σ =
18
Used as rough approximation of other distributions.

Cumulative Distribution Function
F  x  0 if x  a
 x a 
2
F  x  if a  x  c
 b  a  c  a 
 b x 
2
F  x  1 if c  x  b
 b  a  b  c 
F  x  1 if x  b
Normal Distribution
Normal Distribution represents the distribution of many random variables as a symmetrical
bell-shaped graph.
Normal distribution has the following properties:

 Bell Shaped’
 Symmetrical
 Mean, Median and Mode are Equal
 ‘Middle Spread’ Equals 1.33 
 Random Variable has Infinite Range
 Mean, median and mode are all equal
 The curve is symmetric at the center around the mean
 Exactly half of the values are to the left of the center and exactly half are to the right

 The total area under the curve is 1
Empirical Rule:
If a data set has an approximately bell-shaped relative frequency histogram, then, 68% of the
data will fall within one standard deviation of the mean, 95% of the data will fall within two
standard deviations of the mean. Almost all (99.7%) of the data will fall within three standard
deviations of the mean.
Example: The daily demand for gasoline at a gas station is normally distributed with mean of
1000 gallons and standard deviation of 100 gallons. There is exactly 1100 gallons of gasoline
in storage on a particular day. What is the probability that there will be enough gasoline to
satisfy the customer’s demand on that day?
Here, Demand for gasoline = X and we want to find the probability of X < 1100.
First we standardize the data,

X-u 1100-1000
P(X < 1100) = P ( < )=P(Z<1)
σ 100
The variable X has been transformed to z, but this does not cause any change in the area. The
value of z specify the location of the corresponding value of X. Using z table we find the value
of z as 0.8413.

Many Normal Distributions
Varying the Parameters 𝜇 and 𝜎, we obtain different Normal Distributions. There are an
Infinite Number.
Finding Probabilities using Normal Distribution

Probability is the area under the curve!
Each distribution has its own table? Infinitely Many Normal Distributions Means Infinitely
Many Tables to Look Up!

The Standard Normal Distribution (Z)
All normal distributions can be converted into the standard normal curve by subtracting
the mean and dividing by the standard deviation:
𝟏 𝟐)
(𝒙 −𝝁) 𝟏
𝒁= , 𝒇(𝒁) = 𝒆(−𝟐 𝒁
𝝈
√𝟐 𝝅
Somebody calculated all the integrals for the standard normal and put them in a table!
Integration is not required to be done manually by humans anymore because computers now
do all the integration.
Example: Use of Standardization

Suppose SAT scores roughly follows a normal distribution in the U.S. population of college-
bound students (with range restricted to 200-800).
 The average math SAT: 500
 Standard deviation: 50
What’s the probability of getting a math SAT score of 575 or less, =500 and =50?
 68% of students will have scores between 450 and 550
 95% will be between 400 and 600
 99.7% will be between 350 and 650
What’s the probability of getting a math SAT score of 575 or less, =500 and =50?
575  500
Z  1.5
50
i.e., A score of 575 is 1.5 standard deviations above the mean.

575 1 x 500 2 1.5 1
1  ( ) 1  Z2
 P( X  575)  200 (50) 2 2
 e 50
 
dx 
 2
 e 2 dz
But to look up Z= 1.5 in standard normal chart (or enter into SAS/R/Python) = .9332

Looking up Probabilities in the Standard Normal Table
What is the area to the left of Z=1.51

in a standard normal curve?
Area is 93.45%
Finding Z Values for Known Probabilities

Recovering X Values for Known Probabilities
No Multivariate Gaussian Distribution
Exponential Distribution
Exponential distribution is used to model the time elapsed between two events.
Examples:
 The length of time between telephone calls
 The length of time between arrivals at a service station
 The life time of electronic components
The distribution function can be expressed as:
1
Mean = λ

1
Variance = λ2
Example: The lifetime of a battery is exponentially distributed with λ = 0.5. What is the
probability that the battery will last for more than 20 hours?
(P > 20) = e-0.05(20) = e-1 =0.3679
What is the probability that the battery will last between 10 to 15 hours?
P (10 < X < 15) = e-0.05(10) - e-0.05 (15)
= e-0.5 - e-0.75 = 0.6065 – 0.4724 = 0.1341
Time between random events / time till first random event?

If a Poisson process has constant average
rate 𝜆, the mean after a time 𝑡 is 𝜇 = 𝜆𝑡.
What is the probability distribution for the

time to the first event?
Models:
 Time between customer arrivals to a
service system
 The time to failure of machines
Memoryless : the current time has no effect

on future outcomes
No shape or location parameter
𝜆 is the scale parameter

Example: On average lightening kills three people each year in the UK, d = 3. So the rate is
𝜆 = 3/year. Assuming strikes occur randomly at any time during the year so𝜆 is constant,
time from today until the next fatality has pdf (using 𝑡 in years)
𝑓(𝑡) = 𝜆𝑒 −𝜆𝑡 = 3 𝑒 −3𝑡
What is the probability the next death occurs in less than one year, i.e. t = 1?
EXCEL: Sampling from Probability Distributions

 Analysis Tool Pack
 Random number Generation option
 Several Functions: enter RAND( ) for probability
 NORMINV (probability, mean, Std. Dev.)
 NORMSINV (probability, mean, Std. Dev.)
 LOGINV (probability, mean, Std. Dev.)
 BETAINV (probability, alpha, beta, A, B)
 GAMMAINV(probability, alpha, beta)
Sampling Distributions and Sampling Error

𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 = 𝜎/√𝑛
Sampling Distribution of the mean?

 For large n: Normal regardless of the population
 Central Limit Theorem

Introduction to Statistics
What is statistics?
A branch of mathematics taking and transforming numbers into useful information for
decision makers. Methods for processing & analyzing numbers. Methods for helping reduce
the uncertainty inherent in decision making.
Why statistics?
Knowledge of Statistics allows you to make better sense of the ubiquitous use of numbers.
Using statistics in business:

 Business Memos
 Business Research
 Technical Reports
 Technical Journals
 Newspaper Articles
 Magazine Articles
Decision makers use statistics for various purposes:

 Present and describe business data and information properly.
 Draw conclusions about large groups of individuals or items, using information
collected from subsets of the individuals or items.
 Make reliable forecasts about a business activity.
 Improve business processes.
Statistics is a way to get information from data.

 Data: Facts, especially numerical facts, collected together for reference or
information.
 Information: Knowledge communicated concerning some particular fact.
Basic Vocabulary of Statistics

 Variable: A variable is a characteristic of an item or individual.
 Operational definitions: Data values are meaningless unless their variables have
operational definitions, universally accepted meanings that are clear to all
associated with an analysis.
 Data: Data are the different values associated with a variable.
Types of Statistics
1. Descriptive Statistics: Collect, Organize, Characterize & Present Data. Descriptive
statistics are methods for organizing and summarizing data. For example, tables or
graphs are used to organize data, and descriptive values such as the average score are
used to summarize data. A descriptive value for a population is called a parameter
and a descriptive value for a sample is called a statistic.

2. Inferential Statistics: Make inferences, Hypothesis testing, Determine relationships

& Make predictions. Inferential statistics are methods for using sample data to make
general conclusions (inferences) about populations. Sample is typically only a part of
the whole population, sample data provide only limited information about the
population. As a result, sample statistics are generally imperfect representatives of
the corresponding population parameters.
Define Variable & Types

A Variable is a characteristic or condition that can change or take on different values. Most
research begins with a general question about the relationship between two variables for a
specific group of individuals.
Classification of Data/Variable – Level of Data Analysis
Types of Variables
1. Discrete Variable
When the variable can only take a fixed number of values. (Such as class size) consist
of indivisible categories.
E.g.: If you roll a die, you can get 1, 2, 3, 4, 5, or 6 you cannot get 1.2 or 0.1. If it is a fair
die, the probability distribution will be 1/6, 1/6, 1/6, 1/6, 1/6, 1/6.
2. Continuous variable
A continuous distribution is appropriate when the variable can take on an infinite
number of values.
E.g.: Length, Mass, Height and Weight.

Types of Measurement Scales
1. Nominal Scale: A nominal scale is an unordered set of categories identified only by
name. Nominal measurements only permit you to determine whether two
individuals are the same or different.
2. Interval Scale: An interval scale is an ordered series of equal-sized categories.
Interval measurements identify the direction and magnitude of a difference. The zero
point is located arbitrarily on an interval scale.
3. Ratio Scale: A ratio scale is an interval scale where a value of zero indicates none of
the variable. Ratio measurements identify the direction and magnitude of differences
and allow ratio comparisons of measurements.
4. Ordinal Scale: An ordinal scale is an ordered set of categories. Ordinal measurements
tell you the direction of difference between two individuals.
Examples for Variables
Population and Sample

The entire group of individuals is called the population. Usually populations are so large that
a researcher cannot examine the entire group. Therefore, a sample is selected to represent
the population in a research study.
The goal is to use the results obtained from the sample to help answer questions about the
population.
Population vs. Sample

Population Sample
A population consists of all the items or A sample is the portion of a population
individuals about which you want to selected for analysis.
draw a conclusion.
Population is the entire group being A sample is a subset of the population
studied. that is being surveyed.

Measures used to describe the Measures computed from sample data
population are called parameters are called statistics.
Distribution
The general term for any organized set of data. We organize data so we can see the pattern
they form. We need to know how many total scores were sampled (N). We are also concerned
with how often each different score occurs in the data. How often a score occurs is
symbolized f for frequency.
Creating Simple Frequency Distributions

If the different scores are nominal or ordinal, then we use a bar graph.
If we have a small range of interval or ratio scores, we use a histogram.

If we have a large range of interval or ratio scores, we use a frequency polygon.
The Normal Distribution

Specific mathematical properties describe this distribution. The height of the curve above
any score reflects the number of people at that score. Scores far away from the middle (in
the tails) are relatively infrequent. The further a score is from the center, the less frequently
that score appears.
Measure of Central Tendency

The goal of measures of central tendency is to come up with the one single number that best
describes a distribution of scores. There are three basic measures of central tendency, and
choosing one over another depends on two different things. The scale of measurement used,
so that a summary makes sense given the nature of the scores. The shape of the frequency
distribution, so that the measure accurately summarizes the distribution.

Measures of Central Tendency – Mode
The most common observation in a group of scores is that distributions can be unimodal,
bimodal, or multimodal. If the data is categorical (measured on the nominal scale) then only
the mode can be calculated. The most frequently occurring score (mode) is Vanilla.
Measures of Central Tendency – Median

The number that divides a distribution of scores exactly in half. The median is the same as
the 50th percentile. Better than mode because only one score can be median and the median
will usually be around where most scores fall. The median is computed when data are ordinal
scale or when they are highly skewed. If data are perfectly normal, the mode is the median.
There are three methods for computing the median, depending on the distribution of
scores.
 If you have an odd number of scores pick the middle score. 1 4 6 7 12 14 18. Median
is 7.
 If you have an even number of scores, take the average of the middle two. 1 4 6 7 8
12 14 16. Median is (7+8)/2 = 7.5
 If you have several scores with the same value in the middle of the distribution use
the formula for percentiles (not found in your book).
Measures of Central Tendency – Mean

The arithmetic average of some data is average score or value and computed simply by
adding together all scores and dividing by the number of scores. It uses information from
every single score.
X
For a population:  =
N
X
For a Sample: X =
n

Weighted Mean
Pretend that one semesters class of 23 students scored M1 = 18 points on a quiz. The same
quiz was then given the next semester to 34 students who then got M2 = 22 points. What is
the overall (weighted) mean for these 57 students?
Solution:
SX1 can be computed by multiplying M1 times the sample size (SX1= M1*n1 = 18*23 = 414).
For the second class, SX2 = M2*n2 = 22 * 34 = 748
SXtotal = SX1 + SX2 = 414 + 748 = 1206
ntotal = n1 + n2 = 23 + 34 = 57
Mtotal = SXtotal / ntotal = 1206/57 = 21.158
Practical Example
10 randomly selected Stores Sales (in Millions) 8 of the Store Sales less than 25, two store
sales greater than 45. 8, 12, 6, 16, 10, 20, 22, 25, 47, 55
 Median = 18.0(in Millions)
 Mean = 22.1(in Millions)
Which is more accurate regarding generalization to the ‘typical store Sales’? One that
includes:
 Stores having less sales than Average Sales?
 Outlier Store sales?
Central Tendency Measures

Measure of Dispersion
Which of the distributions of scores has the larger dispersion?
The upper distribution has more

dispersion because the scores are
more spread out that is, they are less
similar to each other.
Semi Interquartile-Range
A quartile is a division of a distribution of scores. The 1st quartile refers to the 25th
percentile, the 2nd quartile refers to the 50th percentile (or median) and the 3rd quartile
refers to the 75th percentile. Interquartile range refers to the distance between the 1st and
3rd quartile.
Mean Deviation
Mean Deviation is also known as average deviation. Here, the deviation is taken from any
average especially Mean, Median or Mode. While taking deviation, we have to ignore negative
items and consider all of them as positive.
The formula of MD is given below:
MD = d
N (deviation taken from mean)
MD = m
N (deviation taken from median)
MD = z
N (deviation taken from mode)
Standard Deviation
This is the most useful and most commonly used of the measures of variability. The standard
deviation looks to find the average distance that scores are away from the mean.
( X  X ) 2
(n - 1)
=sum (sigma)
X=score for each point in data

X =mean of scores for the variable
n=sample size (number of observations or cases)
Standard Deviation = 32.01. The score of student 12 and 14 are skewing this calculation
(indirectly through the mean).
Comparing the Standard Deviations

Choosing a Measure of Variability
Since we’re discussing the measures of variability in terms of descriptive statistics, they
would only be reported in relation to a measure of central tendency. If the median was used
as a measure of central tendency, then the semi-interquartile range is the appropriate
measure of variability to use. If the mean was used as a measure of central tendency, then
the standard deviation is the appropriate measure of variability.
Other Measures
There are other measures that are frequently used to analyze a collection of data:
 Skewness
 Kurtosis
 Coefficient of Variation
 Box Plot
 Scattered Plot
Skewness
Skewness is the lack of symmetry of the data. For grouped data:
Kurtosis
Kurtosis provides information regarding the shape of the population distribution (the
peakedness or heaviness of the tails of a distribution). For grouped data:

Coefficient of Variation
 Measure of Relative Variation
 Always a %
 Shows Variation Relative to Mean
 Used to Compare 2 or More Groups
S 
Formula (for Sample): CV     100%
X 
Stock A:
Average Price last year = $5
Standard Deviation = $5
CV = 100%
(5/5*100=100%)
Stock B: CV = 5%
Average Price last year = $100
Standard Deviation = $5
(5/100*100=5%)
Box Plot
A box-plot is a visual description of the distribution based on:
 Minimum
 Q1
 Median
 Q3
 Maximum
A box plot is a graphical representation of statistical data based on the minimum, first
quartile, median, third quartile, and maximum. Mainly used to identify outliers.
A boxplot is a one-dimensional graph of numerical data based on the five-number summary,

which includes the minimum value, the 25th percentile (known as Q1), the median, the 75th
percentile (Q3), and the maximum value. In essence, these five descriptive statistics divide
the data set into four equal parts.
A boxplot can show information about the distribution, variability, and center of a data set.
Consider the following 25 exam scores: 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78,
79, 85, 87, 88, 89, 93, 95, 96, 98, 99, and 99. The five-number summary for these exam scores
is 43, 68, 77, 89, and 99, respectively.

Some statistical software adds asterisk signs (*) to show numbers in the data set that are
considered to be outliers — numbers determined to be far enough away from the rest of the
data to be noteworthy.
It’s easy to misinterpret a boxplot by thinking the bigger the box, the more data. Remember
each of the four sections shown in the boxplot contains an equal percentage (25%) of the
data. A bigger part of the box means there is more variability (a wider range of values) in
that part of the box, not more data. You can’t even tell how many data values are included in
a boxplot — it is totally built around percentages.
Scatter Plot
Displays the relationship between two continuous variables. Useful in the early stage of
analysis when exploring data and determining is a linear regression analysis is appropriate.
May show outliers in your data.
A graph that contains plotted points that show the relationship between two variables. A
scatterplot consists of an X axis (the horizontal axis), a Y axis (the vertical axis), and a series
of dots. Each dot on the scatterplot represents one observation from a data set. The position
of the dot on the scatterplot represents its X and Y values.

Scatterplots are used to analyze patterns in bivariate data. These patterns are described in
terms of linearity, slope, and strength.
 Linearity refers to whether a data pattern is linear (straight) or nonlinear (curved).

 Slope refers to the direction of change in variable Y when variable X gets bigger. If
variable Y also gets bigger, the slope is positive; but if variable Y gets smaller, the slope
is negative.
 Strength refers to the degree of "scatter" in the plot. If the dots are widely spread, the
relationship between variables is weak. If the dots are concentrated around a line, the
relationship is strong.
Dispersion Measures

A distribution can have one or many peaks. Distribution with one clear peak is
called unimodal. Distribution with two clear peaks is called bimodal. A symmetric
distribution with a single peak at the centre is referred to as bell-shaped.
Outliers
An outlier is an observation which does not
appear to belong with the other data. Outliers can
arise because of a measurement or recording
error or because of equipment failure during an
experiment, etc. An outlier might be indicative of
a sub-population, e.g. an abnormally low or high
value in a medical test could indicate presence of
an illness in the patient.
Re-define the upper and lower limits of the

boxplots (the whisker lines) as:
 Lower limit = Q1-1.5IQR, and
 Upper limit = Q3+1.5IQR
Note that the lines may not go as far as these

limits. If a data point is < lower limit or > upper Outliers
limit, the data point is considered to be an
outlier.

Outlier Analysis and Treatment
When a data point deviates markedly from the others in a sample, it is called an outlier. Any
other expected observation is labeled as an inlier. Outliers are extreme values that can occur
on both sides (minimum or maximum). In a normal distribution, 0.4% are Outliers (>2.7 SD)
and 1 in a million is an extreme Outlier (>4.72 SD).
Examples: Unusual credit card purchase.
Outliers are different from the noise data. Noise is random error or variance in a measured
variable. Noise should be removed before outlier detection.
Types of Outliers
1. Global Outlier: A data point is considered a global outlier if its value is far outside the
entirety of the data set in which it is found.
2. Contextual Outliers: A data point is considered a contextual outlier if its value
significantly deviates from the rest the data points in the same context.
3. Collective Outliers: A subset of data points within a data set is considered anomalous
if those values as a collection deviate significantly from the entire data set.
Causes of Outliers
 Data Entry Errors: Human errors such as errors caused during data collection,
recording or entry can cause outliers in data.
 Measurement Error: When the measurement instrument used turns out to be faulty.
 Intentional Outlier: Intentional Outlier is found in self-reported measures that
involves sensitive data.
 Data Processing Error: When we extract data form multiple sources, it is possible
that some manipulation or extraction errors may lead to outliers in the dataset.
Impact of Outliers
Outliers can extremely change the results of the data analysis and statistical modeling. It
increases the error variance and reduces the power of statistical tests. If the outliers are non-
randomly distributed, they can decrease normality. They can bias or influence estimates that
may be of substantive interest. They can also impact the basic assumption of Regression,
ANOVA and other statistical model assumptions.

Detect Outliers
Most commonly used method to detect outliers is visualization. We use various visualization
methods, like Box-plot, Histogram, and Scatter Plot. Any value, which is beyond the range of
-1.5 x IQR to 1.5 x IQR. Any value which out of range of 5th and 95th percentile can be
considered as outlier. Data points, three or more standard deviation away from mean are
considered outlier. Outlier detection is merely a special case of the examination of data for
influential data points.
Inferential Statistics
Hypothesis Testing
Is a method of making an inference about a population parameter based on sample data. Is
statistical analysis used to determine if the difference observed in samples is not a random
occurrence but a true difference.
Key Terms
Three key terms that you need to understand in Hypothesis Testing are:
 Confidence Interval: Measure for reliability of an estimate; sample is used for
estimating a population parameter so we need to know the reliability of that
estimate
 Degrees of Freedom: Number of values that are free to vary in a study
 P-value: Probability of obtaining a test statistic at least as extreme as the one that
was actually observed, assuming that the Null Hypothesis is true
Confidence Interval
Describes the reliability of an estimate. Range of values (lower and upper boundary) within
which the population parameter is included. Width of the interval indicates the uncertainty
associated with the estimate. In confidence level, probability associated with the confidence
interval.
Degrees of Freedom
Degrees of Freedom is the measure of number of values in a study that are free to vary. For
example, if you have to take ten different courses to graduate, and only ten different courses
are offered, then you have nine degrees of freedom. In nine semesters, you will be able to
choose which class to take. In the tenth semester, there will only be one class left to take –
there is no choice.
P-Value
P-value is the probability of obtaining a test statistic at least as extreme as the one that was
actually observed, assuming that the Null Hypothesis is true. When P-value is less than a
certain significance level (often 0.05), you "reject the null hypothesis". This result indicates
that the observed result is not due to a random occurrence but a true difference.

Process of Hypothesis Testing

The process of hypothesis testing consists of four steps:
1. Formulate the Null Hypothesis and the alternative hypothesis. Please remember
Null Hypothesis is always status-quo, which means “no difference”. Data is gathered
as evidence to either reject or not reject the Null Hypothesis.
2. Identify a test statistic that can be used to assess the truth of the Null Hypothesis.
3. Compute the P-value; the smaller the P-value, the stronger the evidence against the
Null Hypothesis.
4. Compare the P-value to an acceptable significance value. If P-value is less than the
significance level, the Null Hypothesis is rejected.
Four possible scenarios:
Decision
Prison Set free
Correct
True State
Innocent Type I error

decision
Guilty Correct
Type II error
decision
 Type I Error (α): Reject the Null Hypothesis when it is true.

 Type II Error (β): Accept the Null Hypothesis when it is false.
Null Hypothesis = “Person is innocent”
Important points to note regarding Hypothesis Testing are:

 It is always “Reject” or “Do Not Reject” the Null Hypothesis.
 Rejecting the Null Hypothesis means there is evidence that there is a difference
(based on the sample data).
 Failure to reject the Null Hypothesis means data is insufficient to conclude that
there is a difference.

Selection of Test Based on Data Types
Chi Square Test

Tests a Null Hypothesis that the frequency distribution of certain events observed in a
sample is consistent with a particular distribution. Events considered must be mutually
exclusive and have a total probability of 1.
Used for:
 Goodness of Fit: Observed frequency distribution differs from a theoretical
distribution
 Test of Independence: Paired observations on two variables, expressed in a
contingency table, are independent of each other.
Goodness of Fit
When you toss a coin, you have an equal probability of getting a head or a tail. So, if you
toss the coin 100 times, the expected distribution is: head = 50 & tail = 50. Take a scenario
where you get this result: head = 57 & tail = 43.
How do you know if the coin is biased or if you toss another 100 times, you will get the
expected distribution?
Goodness of Fit test helps to establish if the observed distribution fits the expected
distribution.
Test of Independence
An ice cream vendor conducts a survey to capture the relation between ice cream flavor
preference and gender.
Based on the above data, how can the vendor establish the relation between gender and ice
cream flavor preference?
Test of Independence helps to establish association or relation between two categorical

variables.

Chi-Square Test Steps
Calculate the Chi-Squared test statistic: Chi-Square statistic is the normalized sum of
squared Deviations between observed and theoretical frequencies.
Determine the Degrees of Freedom of that statistic: Number of frequencies reduced by the
number of parameters of the fitted distribution. Compare the Chi-Square to the critical
value from the Chi-Square distribution.
Chi-Square Test Example

A university wants to analyze student’s decision to enroll in part-time courses. It is
assumed that students who have children have enrolled for part-time courses. Sarah
collected the data and generated the contingency table.
Chi-Square from the test is less than the critical Chi-Square and P-value is less than the
significance level. Null Hypothesis is rejected. Sarah can conclude that having children has
an association with student enrollment into part-time courses.

Selection of the Hypothesis Test
Normality Tests
Normality test establishes if the data approximately follows a normal distribution. Tests for
hypothesis are different for normal and non-normal data hence the need to first check the
distribution type.
Common tests for normality are:

 Normal probability plot – graphical method
 Shapiro-Wilk’s W test
 Anderson Darling test
 Kolmogorov-Smirnov test

Graphical Method: Q-Q Plot
The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data
plausibly came from some theoretical distribution. For example, if we run a statistical
analysis that assumes our variable is normally distributed, we can use a Normal Q-Q plot to
check that assumption. If data is normal, the points would fall approximately on a straight
line. Easy to interpret and outlier detection is easy.
Shapiro-Wilk’s W Test
Commonly used test for checking the normality of data.
Ho = A sample x1, ..., xn came from a normally distributed population
Reject Null Hypothesis if the “W” (test statistic) is below a predetermined threshold or reject
Null Hypothesis if the P-value is less than the alpha level.
Anderson Darling Test

Another test used for checking the normality of data.
Ho = the data follows normal distribution
Ha = the data do not follow normal distribution
Used if a sample is from a given probability distribution reject the Null Hypothesis when P-
value < alpha and failing the normality test allows you to state with 95% confidence that data
does not fit the normal distribution. The Anderson-Darling test is severely affected by ties in
the data (A tie is when identical values occurs more than once in the data set).
Kolmogorov-Smirnov Test
It is a non-parametric test. Used to compare two samples, or, a sample with a given
distribution. While testing for normality, samples are standardized and compared with a
standard normal distribution. Less powerful for testing normality than the Shapiro–Wilk test
or Anderson–Darling test.

One-Tailed Tests
Test where the region of rejection is on only one side of the
sampling distribution.
Example:
 Null Hypothesis: Response time to customer query
<=10 minutes
 Alternative Hypothesis: Response time > 10
minutes. Region of rejection would be the numbers
greater than 10 (there is no bound on the lesser time
interval)
Two-Tailed Tests
Test where the region of rejection is on both sides of the
sampling distribution.
Example:
Speed limit in a freeway 60 – 80 mph (acceptable range
of values). Region of rejection would be numbers from
both sides of the distribution, that is, both <60 and >80
are defects.
Hypothesis Tests for Normal Data

Tests for comparing means of two samples
 z-test
 One Sample T-test
 Two Sample T-test
 Paired T-test
 Analysis of Variance (ANOVA)
Test for comparing variances
 Homogeneity of Variance (HOV)
t-Test Vs. z-Test

General Thumb Rule for using t-test:
 n < 30
 Unknown Population Standard Deviation
General Thumb Rule for using z-test:

 n > 30
 Known Population Standard Deviation

z-Test
Z-test is a statistical test where normal distribution is applied and is basically used for
dealing with problems relating to samples when n ≥ 30. The z measure is calculated as:
̅ - μ) / SE
z = (X
Where x is the mean sample to be standardized, μ (mu) is the population mean and SE is
the standard error of the mean.
SE = σ / SQRT(n)
Where σ is the population standard deviation and n is the sample size.
One Sample T-Test

One sample T-test is used to compare mean of a sample to an expected or target mean.
Expected mean is the population mean and is not derived from the sample. T-test uses the
sample standard deviation and is used when the population standard deviation is not known.
Example: Null Hypothesis = All sales managers are meeting the quarterly target of 1M.
Two Sample T-Test

Two Sample T-test is used to compare means of two samples. The two samples can be:
 Independent samples: Used when two separate sets of independent and identically
distributed samples are obtained. Different versions of the test exist in following
scenarios:
1. Equal sample sizes and equal Variance
2. Unequal sample sizes and equal Variance
3. Unequal sample sizes and unequal Variance

 Paired samples: Used when the same group has been tested or used twice. For
example, study of a group of patients before and after a treatment.
Example:
 Null Hypothesis: Production times in the plants in Pune and Xian are same.
 Null Hypothesis: Student scores are same before and after introduction of video
based learning.
ANOVA
ANOVA is used for comparing means of more than 2 samples.
 Null Hypothesis = Means of all the samples are equal
 Alternative hypothesis = Mean of at least one of the samples is different
Variance of all samples is assumed to be similar.
Example:
 Null Hypothesis: Query response time same across all five query categories.
 Null Hypothesis: No difference in student performance across the 6 modules of the
Analytics course.
Homogeneity of Variance
F-test is used for the testing if variances of two samples are similar. Test statistic F (ratio of
two sample variances) = S2X/ S2Y. Test statistics F follows an F-distribution with n − 1 and m
− 1 Degrees of Freedom if the Null Hypothesis is true else it has a non-central F-distribution.
Null Hypothesis is rejected if F is either too large or too small. F-test does not work on non-
normal distributions. Other tests for testing equality of two variances are Levene's test,
Bartlett's test, or the Brown-Forsythe test.

Approach to Non-Normal Data

If data is not normal identify reasons for non-normality and, if possible, address those
reasons. Use techniques that do not require normality of data.
Non-Normality
Reasons for Non-Normality
 Extreme Values
 Overlap of Two or More Processes
 Insufficient Data Discrimination
 Sorted Data
 Values close to zero or a natural limit
 Data follows a Different Distribution
Reasons for Non-normality

Extreme Values: Results in a skewed distribution. Clean the data if number of extreme
values is small. Determine measurement errors, data-entry errors and outliers. Remove
from data or treat them.
Overlap of Two or More Processes

Data coming from more than one process. Example: Data coming from two processes.
Stratify the data based on the process. Check for normality within the resulting data sets.

Insufficient Data Discrimination, insufficient number of different values. Measurement

devices with poor resolution can make continuous and normally distributed data look
discrete and not normal. Can be overcome by using more accurate measurement systems or
by collecting more data.
Sorted Data
If data available is a subset of the data produced
from a process. If data is from within specific Normal Distribution
limits of the process only, it might not follow a
normal distribution.
If many values are close to a natural limit, the

data may not be normal. Transform the variable
and check using a probability plot to see if
transformed variable follows a normal
distribution.
Data Follows a Different Distribution. Use non-

parametric tests. Non-parametric tests do not
assume data to be normal.
Non-parametric Tests
Mood’s Median Test
Is a non-parametric test that does not make assumptions about the distribution of data. Tests
the equality of medians. Y variable is continuous or discrete-ordinal or discrete-count. X
variable is discrete with two or more attributes
Example: Comparing the Medians of the monthly satisfaction ratings (Y) of six customers (X)
over the last two years. Comparing the Medians of the number of calls per week (Y) at a
service hotline separated by four different call types (X = complaint, technical question,
positive feedback, or product information) over the last six months.

Mann-Whitney Test
If data is not normally distributed, the non-parametric Mann-Whitney can be used for
comparing two samples. It is also called 2 sample rank test. Compares the Medians from two
populations.
Sign Test
 Non-parametric equivalent of the One Sample T-test
 Can also be used for paired data by calculating the difference between the two
samples
 Compares median of sample to median of population
 Y variable is continuous or discrete-ordinal or discrete-count
 Looks at the number of observations above and below the median.
Hypothesis Tests Summary

 Mood’s Median: H0=Median1 = Median2 = … = Mediann
 HOV: H0=σ21=σ22=…=σ2n
 ANOVA: H0= μ1= μ2=…= μn
 Paired T-test: H0=Difference = 0
 One Sample T-test: H0=μ = Target
 Two Sample T-test:H0= μ1= μ2
Correlation and Regression

Correlation analysis is used for investigating the relationship between two quantitative
variables.
Goals of correlation analysis:
 Analyze if two measurement variables have a relation. This means change in one
influences change in the other measure
 Quantify the strength of the relationship between the variables
Regression analysis identifies the relationship between an outcome variable and one or
more explanatory or input variables in the form of an equation. Correlation test helps to
establish association or relation between two continuous variables and Regression analysis
provides the magnitude of the relation.
Uses of Correlation and Regression

Correlation is used to test if there is an association or relation between two variables.
Variation in one variable is related to variation in the other variable. The relation could be
causal - variation in one variable causes variation in the other
Regression is used in estimating the magnitude of the relation between variables. Relation is
defined as Variable1= function (Variable2). Based on this relation, value of one variable
corresponding to a particular value of the other variable can be estimated.
Correlation does not imply causation, but a relation. Correlation can also be coincidental.
Examples:
Analysis of Student Grades in Mathematics and English: Use Correlation to determine if the
students who are good at Mathematics tend to be equally good at English. Use Regression to
determine whether the marks in English can be predicted for given marks in Mathematics.
Analysis of Home Runs and Batting: Use Correlation to determine the relationship between
the number of home runs that a major league baseball team hits and its team batting average.
Use Regression to determine the number of home runs for a given batting average.
Correlation Coefficient
Correlation Coefficient (also called Pearson Correlation Coefficient) is a measure of strength
and direction of a linear relation between two variables. Correlation Coefficient r or R is
defined as covariance of variables divided by product of Standard Deviations of the variables.

The Correlation Coefficient ranges from -1 to 1. +1 indicates perfect collinearity, which

means, if one value increases, the other also increases in the same proportion. -1 indicates
perfect negative collinearity, which means, if one value decreases, the other increases in the
same proportion. Zero indicates no relationship between the variables.
Examples: A positive correlation between height of a child and age: As the child grows his or
her height increases almost linearly. A negative correlation between temperature and time
babies take to crawl: Babies take longer to learn to crawl in cold months (when they are
bundled in clothes that restrict their movement), than in warmer months.
Regression
Regression analysis finds the “line of best fit” for one response variable (continuous) based
on one or more explanatory variables. Statistical methods to assess the Goodness of Fit of
the model. Regression analysis for two variables X and Y estimates the relation as Y=f(X). If
it is a linear relation, the relation should be defined as Y = a + bx (simple linear regression)
Example: Age and cholesterol level are correlated. The regression equation based on
sample was estimated as: Cholesterol level= 156.3+0.65*Age.

Descriptive Statistics
In statistical analysis, the three fundamental concepts associated with describing data are
Location or Central Tendency, Dispersion or Spread and Shape or Distribution.
Descriptive Statistics helps in identifying potential data problems such as errors, outliers,
and extreme values, identifying process issues and selection of appropriate statistical test
for understanding the underlying relationships/patterns.
The “Average” Story

Alan went for a trek. On the way, he had to cross a stream. As Alan did not know swimming,
he started exploring alternate routes to cross over. Suddenly he saw a sign-post, which said
“Average depth 3 feet”. Alan was 5’7” tall and thought he could safely cross the stream. Alan
never reached the other end and drowned in the stream. Why did this tragedy happen?
Why did Alan Drown?
The “Hotshot” Sales Executive

Kurt works as a sales manager at vsellhomes.com.
In the monthly sales review, Kurt reports that he
will achieve his quarterly target of $1M. Kurt
claims his average deal size is $100,000 and he has
10 deals in his pipeline. Kurt’s boss Ross is very
delighted with his numbers. At the end of quarter,
even after closing 8 deals Kurt fails to meet his
target number and falls short by more than
$500,000.
Why did Kurt fail to achieve his quarterly target?

With 10 deals in pipeline and with average deal
size of $100,000 and converting 7 of those deals,
how did he fail?

Average deal size in pipeline = $100,000
Deal #10 is of significantly higher value than all the other deals and impacts the average
calculation
Median = $55,000 more realistic measure
Median is less susceptible to the influence of Outliers.
Central Tendency
A measure of Central Tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. In other words, the Central Tendency
computes the “center” around which the data is distributed.
Measure of central tendency:

 Mean: calculated by totaling all the values in a data set and dividing by the number of
values. Average value of the data set = Sum of values of all observations/ Total
number of observations.
 Median: the central value in a dataset. The exact middle value of the data; sort the
data and value in the center is the Median.
 Mode: the most frequently occurring value in a given set. The value that occurs the
maximum number of times; count the frequencies of all values and the value with
maximum frequencies is the Mode.
Timing for the Men’s 500-meter Speed Skating event in Winter Olympics is tabulated. The
Central Tendency measures are computed below:

Bull’s Eye
Sam and Paul are throwing darts at the local sports bar. A few of their friends start a
betting pool. Both Sam and Paul shoot 10 practice shots each so that their friends can
decide their bets.
Dispersion Measures
Measures of Dispersion describe the data spread or how far the measurements are from
the center.
 Range
 Variance/standard deviation
 Mean absolute deviation
 Interquartile range
Dispersion Measures: Key Definitions

Quartiles of a distribution are the 3 values that split the data into 4 equal parts. Median (or
Second quartile) divides the data into a lower and upper half. First quartile is the middle of
the lower half (25%). Third quartile is the middle of the upper half (75%).
The inter-quartile range (IQR) is a measure that indicates the extent to which the central
50% of values within the dataset are dispersed.
IQR = Q3 – Q1
Ignores Outliers. Only two points used in estimation.

Percentiles and Deciles
Deciles of a distribution are the 9 Percentiles of a distribution are

values that split the data into 10 the 99 values that split the data
equal parts. into a hundred equal parts.
Shape of a Distribution
The shape of a distribution is described by the following characteristics. Skewness is a
measure of symmetry:
The distribution is symmetric at the centre

if each half is a mirror image of the other.
The distribution with fewer observations on
the left (toward lower values) is said to
be left skewed. The distribution with fewer
observations on the right (toward higher
values) is said to be right skewed. Kurtosis
measures if the distribution is peaked or
flat.

Hypothesis Testing
Important Basic Terms
 Population = All possible values
 Sample = A portion of the population
 Inference Statistics = Generalizing from a sample to a population with calculated
degree of certainty.
 Parameter = A characteristic of population
 E.g.: Population mean (µ)
 Statistic = Calculated from data in that sample,
 E.g.: Sample mean
Population and Sample

What is Hypothesis Testing?
A hypothesis is a claim (assumption) about a population parameter.
Population Mean Example: The mean monthly cell phone bill in this city is μ = $42.
Population Proportion Example: The proportion of adults in this city with cell phones is π =
0.68.
It is a tentative explanation for certain behaviors, phenomenon or events that have occurred
or will occur. A statistical hypothesis is an assertion concerning one or more populations an
educated guess and claim or statement about a property of a population.
The goal in Hypothesis Testing is to analyze a sample in an attempt to distinguish between

population characteristics that are likely to occur and population characteristics that are
unlikely to occur.
To prove that a hypothesis is true, or false, with absolute certainty, we need to examine the
entire population (which is practically not possible). Instead, hypothesis testing concerns on
how to use a random sample to judge if it is evidence (data in the sample) that supports or
not the hypothesis about a parameter.
A criminal trial is an example of hypothesis testing without the statistics. In a trial a jury must
decide between two hypotheses. The null hypothesis is H0: The defendant is innocent. The
alternative hypothesis or research hypothesis is H1: The defendant is guilty. The jury does
not know which hypothesis is true. They must make a decision on the basis of evidence
presented.
Make statement(s) regarding unknown population parameter values based on sample data.
 Null hypothesis - Statement regarding the value(s) of unknown parameter(s).

Typically will imply no association between explanatory and response variables in
our applications (will always contain an equality). Statement about the value of a
population parameter. Represented by H0. Always stated as an Equality. States the
claim or assertion to be tested. Example: The average number of TV sets in U.S.
Homes is equal to three ( H 0 : μ  3 ). Is always about a population parameter, not
about a sample statistic.
Begin with the assumption that the null hypothesis is true. Similar to the notion of
innocent until proven guilty. Refers to the status quo or historical value. Always
contains “=”, “≤” or “” sign. May or may not be rejected.

 Alternative hypothesis - Statement contradictory to the null hypothesis (will always
contain an inequality). Statement about the value of a population parameter that must
be true if the null hypothesis is false. Represented by H1. Stated in one of three forms
>, < and. It is the opposite of the null hypothesis. E.g.: The average number of TV sets
in U.S. homes is not equal to 3 (H1: μ ≠ 3). Challenges the status quo. Never contains
the “=” , “≤” or “” sign. May or may not be proven. Is generally the hypothesis that the
researcher is trying to prove.
 Test statistic - Quantity based on sample data and null hypothesis used to test
between null and alternative hypotheses.
 Rejection region - Values of the test statistic for which we reject the null in favor of
the alternative hypothesis.
The hypothesis we want to test is if HA is “likely" true. There are two possible outcomes:
1. Reject H0 and accept HA because of sufficient evidence in the sample in favour or
H1.
2. Not reject H0 because of insufficient evidence to support HA.
Failure to reject H0 does not mean the null hypothesis is true. There is no formal outcome
that says “accept H0." We say, we “failed to reject H0”.
Hypothesis Testing Process

Steps for Hypothesis Testing:
 Null and Alternative Hypothesis
 Test Static
 P-Value and Interpretation
 Significance Level
The population mean age is 50. H0: μ = 50, H1: μ ≠ 50.
Suppose the sample mean age was X = 20. This is significantly lower than the claimed mean
population age of 50. If the null hypothesis were true, the probability of getting such a
different sample mean would be very small, so you reject the null hypothesis. Getting a
sample mean of 20 is so unlikely if the population mean was 50, you conclude that the
population mean must not be 50.

The Test Statistic and Critical Values
If the sample mean is close to the assumed population mean, the null hypothesis is not
rejected. If the sample mean is far from the assumed population mean, the null hypothesis is
rejected. The critical value of a test statistic creates a “line in the sand” for decision making -
it answers the question of how far is far enough.
Sampling Distribution of the Test Statistic. “Too Far Away” From Mean of Sampling
Distribution.
The test statistic is a value computed from the sample data, and it is used in making the
decision about the rejection of the null hypothesis.
 z-test: Test statistic for mean
 t- test: Test statistic for mean
 Chi square: Test statistic for standard deviation
Errors in Hypothesis Testing

Type I Error
 Reject a true null hypothesis
 Considered a serious type of error
 The probability of a Type I Error is 
 Called level of significance of the test
 Set by researcher in advance

Type II Error
 Failure to reject false null hypothesis
 The probability of a Type II Error is β
The confidence coefficient (1-α) is the probability of not rejecting H0 when it is true. The
confidence level of a hypothesis test is (1-α)*100%. The power of a statistical test (1-β) is
the probability of rejecting H0 when it is false.
Type I & II Error Relationship

Type I and Type II errors cannot happen at the same time. Type I error can only occur if H0
is true. Type II error only occur if H0 is false. Type I error probability (  ) is equal to Type II
error probability ( β ).
β increases when the difference between hypothesized parameter and its true value
decreases.
Level of Significance and the Rejection Region
This is a two-tail test because there is a rejection region in both tails.

Z Test of Hypothesis for the Mean (σ Known)
Convert sample statistic X to a ZSTAT test statistic.
Critical Value Approach to Testing

For a two-tail test for the mean, σ known convert sample statistic (X) to test statistic
(ZSTAT). Determine the critical Z values for a specified level of significance  from a table or
computer.
Decision Rule: If the test statistic falls in the rejection region, reject H0; otherwise do not
reject H0.
6 Steps in Hypothesis Testing

1. State the null hypothesis, H0 and the alternative hypothesis, H1
2. Choose the level of significance, , and the sample size, n
3. Determine the appropriate test statistic and sampling distribution
4. Determine the critical values that divide the rejection and non-rejection regions
5. Collect data and compute the value of the test statistic
6. Make the statistical decision and state the managerial conclusion.
If the test statistic falls into the non-rejection region, do not reject the null hypothesis H0. If
the test statistic falls into the rejection region, reject the null hypothesis. Express the
managerial conclusion in the context of the problem.
Hypothesis Testing Example

Test the claim that the true mean # of TV sets in US homes is equal to 3. (Assume σ = 0.8).
State the appropriate null and alternative hypotheses. H0: μ = 3 H1: μ ≠ 3 (This is a two-
tail test). Specify the desired level of significance and the sample size. Suppose that  = 0.05
and n = 100 are chosen for this test. Determine the appropriate technique σ is assumed
known so this is a Z test. Determine the critical values for  = 0.05 the critical Z values are

±1.96. Collect the data and compute the test statistic. Suppose the sample results are n = 100,
X = 2.84 (σ = 0.8 is assumed known).
So the test statistic is:
Xμ 2.84  3  .16

ZSTAT     2.0
σ 0.8 .08
n 100
Is the test statistic in the rejection region?
Reach a decision and interpret the result.
Since ZSTAT = -2.0 < -1.96, reject the null hypothesis. Conclude there is sufficient evidence
that the mean number of TVs in US homes is not equal to 3.
P-Value Approach to Testing

Probability of obtaining a test statistic equal to or more extreme than the observed sample
value given H0 is true. The p-value is also called the observed level of significance. It is the
smallest value of  for which H0 can be rejected.
Compare the P-value with .

 If p-value <  , reject H0
 If p-value   , do not reject H0

If the p-value is low then H0 must go.
The 5 Step P-value Approach to Hypothesis Testing

1. State the null hypothesis, H0 and the alternative hypothesis, H1
2. Choose the level of significance, , and the sample size, n
3. Determine the appropriate test statistic and sampling distribution
4. Collect data and compute the value of the test statistic and the p-value
5. Make the statistical decision and state the managerial conclusion.
If the p-value is < α then reject H0, otherwise do not reject H0. State the managerial conclusion
in the context of the problem.
p-value Hypothesis Testing Example

Test the claim that the true mean # of TV sets in US homes is equal to 3. (Assume σ = 0.8).
State the appropriate null and alternative hypotheses. H0: μ = 3 H1: μ ≠ 3 (This is a two-
tail test). Specify the desired level of significance and the sample size. Suppose that  = 0.05
and n = 100 are chosen for this test. Determine the appropriate technique σ is assumed
known so this is a Z test. Determine the critical values for  = 0.05 the critical Z values are
±1.96. Collect the data and compute the test statistic. Suppose the sample results are n = 100,
X = 2.84 (σ = 0.8 is assumed known).
So the test statistic is:
Xμ 2.84  3  .16

ZSTAT     2.0
σ 0.8 .08
n 100
How likely is it to get a ZSTAT of -2 (or something further from the mean (0), in either
direction) if H0 is true?
Is the p-value < α?

Since p-value = 0.0456 < α = 0.05 Reject H0. State the managerial conclusion in the context of
the situation. There is sufficient evidence to conclude the average number of TVs in US homes
is not equal to 3.

P-Value
Measure of the strength of evidence the sample data provides against the null hypothesis:
 P(Evidence - strong or stronger against H0 | H0 is true).
 The smallest value of alpha for which test results are statistically significant, or in
other words, statistically different than the null hypothesis value.
 Smallest value at which the null can still be rejected.
Example:
Fail to reject at a 1% level of sig, but can

reject at 5%
P  val : p  P(Z  zobs )
Two Tail Tests and Confidence Intervals

For X = 2.84, σ = 0.8 and n = 100, the 95% confidence interval is:
0.8 0.8
2.84 - (1.96) to 2.84  (1.96)
100 100
≤ μ ≤ 2.9968
Since this interval does not contain the hypothesized mean (3.0), we reject the null
hypothesis at  = 0.05.
Critical Region (Two Tailed)

Critical Region: Right Tailed Test
Critical Region: Left Tailed Test

Errors in Inference
Type I error
Erroneously rejecting the null hypothesis. Your result is significant (p < .05), so you reject
the null hypothesis, but the null hypothesis is actually true.
Type II error
Erroneously accepting the null hypothesis. Your result is not significant (p > .05), so you
don’t reject the null hypothesis, but it is actually false.
Controlling Type I Errors

The Type I error rate is controlled by the researcher. It is called the alpha rate, and
corresponds to the probability cut-off that one uses in a significance test. By convention,
researchers use an alpha rate of .05. The null hypothesis is rejected only when a statistic is
likely to occur 5% of the time or less when the null hypothesis is true. In principle, any
probability value could be chosen for making the accept/reject decision. 5% is used by
convention.
Power of a Statistical Test
P (reject the null hypothesis when it is false)= 1- β. (1-α) is the probability we accept the
null when it was in fact true. (1-β) is the probability we reject when the null is in fact false -
this is the power of the test. It is preferable to have a larger power. The power changes
depending on what the actual population parameter is.
Example
H0: μ = 2.7, HA: μ > 2.7. Random sample of students: n = 36, s = 0.6 and compute X .
Decision Rule
Set significance level α = 0.05. If p-value < 0.05, reject null hypothesis.
Let’s consider what our conclusion is based upon different observed sample means.

Reject null since p-value is (just barely!) smaller than 0.05.
Reject null since p-value is smaller than 0.05.
Reject null since p-value is smaller than 0.05.
Strategy for Designing a Good Hypothesis Test

Use pilot study to estimate standard deviation. Specify . Typically 0.01 to 0.10. Decide
what a meaningful difference would be between the mean in the null and the actual mean.
Decide power. Typically 0.80 to 0.99. Use software to determine sample size.

Consider mean demand for computers during assembly lead time. Rather than estimate the
mean demand, our operations manager wants to know whether the mean is different from
350 units.
In other words, someone is claiming that the mean time is 350 units and we want to check
this claim out to see if it appears reasonable. We can rephrase this request into a test of the
hypothesis: H0: = 350. The research hypothesis becomes: H1: ≠ 350
Recall: Standard deviation [σ]was assumed to be 75, the sample size [n] was 25, and the
sample mean was calculated to be 370.16.
Example:
When trying to decide whether the mean is not equal to 350, a large value of X (say, 600)
would provide enough evidence. IfX is close to 350 (say, 355) we could not say that this
provides a great deal of evidence to infer that the population mean is different than 350.
The two possible decisions that can be made:

 Conclude that there is enough evidence to support the alternative hypothesis (also
stated as: reject the null hypothesis in favor of the alternative)
 Conclude that there is not enough evidence to support the alternative hypothesis
(also stated as: failing to reject the null hypothesis in favor of the alternative)
Do not say that the null hypothesis is accepted when a statistician is around.
The testing procedure begins with the assumption: The null hypothesis is true. Until there
is a further statistical evidence, it is still assumed H0: = 350 (assumed to be TRUE)
The next step will be to determine the sampling distribution of the sample meanX
assuming the true mean is 350.

Three Ways to Determine
First Way
Unstandardized test statistic: Is X in the guts of the sampling distribution?
Depends on what you define as the “guts” of the sampling distribution. If we define the guts
as the center 95% of the distribution [this means  = 0.05], then the critical values that
define the guts will be: 1.96 standard deviations of X-Bar on either side of the mean of the
sampling distribution [350], or
UCV = 350 + 1.96*15 = 350 + 29.4 = 379.4
LCV = 350 – 1.96*15 = 350 – 29.4 = 320.6
Second Way
Standardized test statistic: The “guts” of the sampling distribution is defined to be the center
95% [ = 0.05]. If the Z-Score for the sample meanX is greater than 1.96, we know that will
be in the reject region on the right side or X. If the Z-Score for the sample mean is less than
-1.97, we know thatX will be in the reject region on the left side.
Z = (370.16 – 350)/15 = 1.344

Is this Z-Score in the guts of the sampling distribution???

Third Way
The p-value approach (which is generally used with a computer and statistical software)
Increase the “Rejection Region” until it “captures” the sample mean. For this example,
sinceX is to the right of the mean, calculate P(X > 370.16) = P(Z > 1.344) = 0.0901
Since this is a two tailed test, this area should be doubled for the p-value. p-value =
2*(0.0901) = 0.1802
Since we defined the guts as the center 95% [ = 0.05], the reject region is the other 5%.
Since the sample mean,X is in the 18.02% region, it cannot be in the 5% rejection region [
= 0.05].
Unstandardized Test Statistic: Since LCV (320.6) <X (370.16) < UCV (379.4), we reject the
null hypothesis at a 5% level of significance.
Standardized Test Statistic: Since -Z/2(-1.96) < Z(1.344) < Z/2 (1.96), we fail to reject the
null hypothesis at a 5% level of significance.
P-value: Since p-value (0.1802) > 0.05 [], we fail to reject the hull hypothesis at a 5% level
of significance.

Case Study
The mean(μ) body weight of the 20-29 year old men is 170 pounds. Standard deviation σ
was 40 pounds.
Null hypothesis H0: μ = 170 (“no difference”) Alternative hypothesis can be either. Ha : μ >
170 (one-sided test) or Ha : μ ≠ 170 (two-sided test). The rejection region is split equally
between the two tails.
For the illustrative example, μ0 = 170. We know σ = 40. Take an SRS of n = 64. Therefore,
Standard Error of Mean =40/√64 = 5.
If we found a sample mean of 173, then Zstat = 173-170 / 5 = 0.60
If we found a sample mean of 185, then Zstat = 185-170 / 5 = 3.00
α-Level (Significance Testing)
Let α ≡ probability of erroneously rejecting H0. Set α threshold (e.g., let α = .10, .05, or
whatever). Reject H0 when P ≤ α, Retain H0 when P > α

Example: Set α = .10. Find P = 0.27  retain H0. Set α = .01. Find P = .001  reject H0.
Interpretation
Conventions*
P > 0.10  non-significant evidence against H0
0.05 < P  0.10  marginally significant evidence
0.01 < P  0.05  significant evidence against H0
P  0.01  highly significant evidence against H0
Examples:
P =.27  non-significant evidence against H0
P =.01  highly significant evidence against H0
Hypothesis Testing: σ Unknown

If the population standard deviation is unknown, you instead use the sample standard
deviation S. Because of this change, you use the t distribution instead of the Z distribution to
test the null hypothesis about the mean. When using the t distribution you must assume the
population you are sampling from follows a normal distribution. All other steps, concepts,
and conclusions are the same.
t Test of Hypothesis for the Mean (σ Unknown)

Convert sample statistic (X) to a tSTAT test statistic.
Example: Two-Tail Test ( Unknown)

The average cost of a hotel room in New York is said to be $168 per night. To determine if
this is true, a random sample of 25 hotels is taken and resulted in an X of $172.50 and an S
of $15.40. Test the appropriate hypotheses at  = 0.05. (Assume the population distribution
is normal) H0: μ = 168 H1: μ ¹ 168.

Solution:
H0: μ = 168 H1: μ ¹ 168
a = 0.05
n = 25, df = 25-1=24
 is unknown, so use a t statistic
Critical Value: ±t24,0.025 = ± 2.0639
Do not reject H0: insufficient evidence that true mean cost is different than $168.
Example Two-Tail t Test Using A p-value from Excel

Since this is a t-test we cannot calculate the p-value without some calculation aid. The Excel
output below does this:
Example Two-Tail t Test Using A p-value from Minitab

One-Sample T
Test of mu = 168 vs not = 168
N Mean StDev SE Mean 95% CI T P
25 172.50 15.40 3.08 (166.14, 178.86) 1.46 0.157
p-value > α. So do not reject H0.

Two Tail Tests to Confidence Intervals
For X = 172.5, S = 15.40 and n = 25, the 95% confidence interval for µ is:
172.5 - (2.0639) 15.4/25 to 172.5 + (2.0639) 15.4/25
166.14 ≤ μ ≤ 178.86
Since this interval contains the Hypothesized mean (168), we do not reject the null
hypothesis at  = 0.05.
One-Tail Tests
In many cases, the alternative hypothesis focuses on a particular direction.
H0: μ ≥ 3 H1: μ < 3. This is a lower-tail test since the alternative hypothesis is focused on
the lower tail below the mean of 3.
H0: μ ≤ 3 H1: μ > 3. This is an upper-tail test since the alternative hypothesis is focused on
the upper tail above the mean of 3.
Lower-Tail Tests
There is only one critical value, since the rejection area is in only one tail.
Upper-Tail Tests
There is only one critical value, since the rejection area is in only one tail.

Example: Upper-Tail t Test for Mean ( unknown)
A phone industry manager thinks that customer monthly cell phone bills have increased,
and now average over $52 per month. The company wishes to test this claim. Assume a
normal population.
Form hypothesis test:

H0: μ ≤ 52 the average is not over $52 per month
H1: μ > 52 the average is greater than $52 per month (i.e., sufficient evidence exists to
support the manager’s claim)
Example: Find Rejection Region

Suppose that  = 0.10 is chosen for this test and n = 25. Find the rejection region:
Example: Decisions
Reach a decision and interpret the result.
Do not reject H0 since tSTAT = 0.55 ≤ 1.318, there is not sufficient evidence that the mean
bill is over $52.

Example: Utilizing the p-value for the Test
Calculate the p-value and compare to  (p-value below calculated using excel spreadsheet).
Do not reject H0 since p-value = .2937 >  = .10.

t Test for the Hypothesis of the Mean
Data
Null Hypothesis µ= 52.00
Level of Significance 0.1
Sample Size 25
Sample Mean 53.10
Sample Standard Deviation 10.00
Intermediate Calculations
Standard Error of the Mean 2.00 =B8/SQRT(B6)
Degrees of Freedom 24 =B6-1
t test statistic 0.55 =(B7-B4)/B11
Upper Tail Test

Upper Critical Value 1.318 =TINV(2*B5,B12)
p-value 0.2937 =TDIST(ABS(B13),B12,1)
Do Not Reject Null Hypothesis =IF(B18<B5, "Reject null hypothesis",
"Do not reject null hypothesis")
Proportions
Hypothesis Tests for Proportions
Involves categorical variables. Two possible outcomes:
1. Possesses characteristic of interest
2. Does not possess characteristic of interest
Fraction or proportion of the population in the category of interest is denoted by π. Sample

proportion in the category of interest is denoted by p.
X number in categoryof interest in sample

p 
n sample size

When both nπ and n(1-π) are at least 5, p can be approximated by a normal distribution
with mean and standard deviation.
The sampling distribution of p is approximately normal, so the test statistic is a ZSTAT value:
Z Test for Proportion: Number in Category of Interest

An equivalent form to the last slide, but in terms of the number in the category of interest,
X:
Example
A marketing company claims that it receives 8% responses from its mailing. To test this
claim, a random sample of 500 were surveyed with 25 responses. Test at the  = 0.05
significance level.
Check:
n π = (500)(.08) = 40
n(1-π) = (500)(.92) = 460

Solution:
Conclusion: There is sufficient evidence to reject the company’s claim of 8% response rate.
p-Value Solution
Calculate the p-value and compare to . (For a two-tail test the p-value is always two-tail)
The Power of the Test

The power of the test is the probability of correctly rejecting a false H0. Suppose we
correctly reject H0: μ  52, when in fact the true mean is μ = 50.

Type II Error
Suppose we do not reject H0:   52 when in fact the true mean is  = 50.
Beta
Calculating β
Suppose n = 64 , σ = 6 , and  = .05.

Calculating β and Power of the Test
Conclusions:
 A one-tail test is more powerful than a two-tail test
 An increase in the level of significance () results in an increase in power
 An increase in the sample size results in an increase in power
Potential Pitfalls and Ethical Considerations

 Use randomly collected data to reduce selection biases
 Do not use human subjects without informed consent
 Choose the level of significance, α, and the type of test (one-tail or two-tail) before
data collection
 Do not employ “data snooping” to choose between one-tail and two-tail test, or to
determine the level of significance
 Do not practice “data cleansing” to hide observations that do not support a stated
hypothesis
 Report all pertinent findings including both statistical significance and practical
importance.

Regression Analysis
Scatter Plot
A scatter plot can be used either when one continuous variable that is under the control of
the experimenter and the other depends on it or when both continuous variables are
independent. If a parameter exists that is systematically incremented and/or decremented
by the other, it is called the control parameter or independent variable and is customarily
plotted along the horizontal axis. The measured or dependent variable is customarily plotted
along the vertical axis. If no dependent variable exists, either type of variable can be plotted
on either axis or a scatter plot will illustrate only the degree of correlation (not causation)
between two variables.
A scatter plot can suggest various kinds of correlations between variables with a
certain confidence interval. For example, weight and height, weight would be on y axis and
height would be on the x axis. Correlations may be positive (rising), negative (falling), or null
(uncorrelated). If the pattern of dots slopes from lower left to upper right, it indicates a
positive correlation between the variables being studied. If the pattern of dots slopes from
upper left to lower right, it indicates a negative correlation. A line of best fit (alternatively
called 'trendline') can be drawn in order to study the relationship between the variables. An
equation for the correlation between the variables can be determined by established best-fit
procedures. For a linear correlation, the best-fit procedure is known as linear regression and
is guaranteed to generate a correct solution in a finite time. No universal best-fit procedure
is guaranteed to generate a correct solution for arbitrary relationships. A scatter plot is also
very useful when we wish to see how two comparable data sets agree to show nonlinear
relationships between variables. The ability to do this can be enhanced by adding a smooth
line such as LOESS. Furthermore, if the data are represented by a mixture model of simple
relationships, these relationships will be visually evident as superimposed patterns.
Examples:

Linear Correlation
No relationship
Correlation Coefficient
The population correlation coefficient ρ (rho) measures the strength of the association
between the variables. The sample correlation coefficient r is an estimate of ρ and is used to
measure the strength of the linear relationship in the sample observations.
Sample correlation coefficient
Where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent variable
y = Value of the dependent variable
Features of ρ and r
 Unit free
 Range between -1 and 1
 The closer to -1, the stronger the negative linear relationship
 The closer to 1, the stronger the positive linear relationship
 The closer to 0, the weaker the linear relationship

Examples of Approximate r Values
Calculation Example:

Excel Correlation Output
Significance Test for Correlation

Hypotheses
H0: ρ = 0 (no correlation)
HA: ρ ≠ 0 (correlation exists)
Test statistic
r
t (With n – 2 degrees of freedom)
1 r2
n2
Example: Is there evidence of a linear relationship between tree height and trunk diameter
at the .05 level of significance?
H0: ρ = 0 (No correlation)

H1: ρ ≠ 0 (correlation exists)
 =.05 , df = 8 - 2 = 6
r .886
t   4.68
1 r2 1  .886 2
n2 82
Test Solution:

Decision: Reject H0.
Conclusion: There is evidence of a linear relationship at the 5% level of significance.
Introduction to Regression Analysis

Regression analysis is used to predict the value of a dependent variable based on the value
of at least one independent variable. Explain the impact of changes in an independent
variable on the dependent variable.
 Dependent variable: The variable that is to be explained
 Independent variable: The variable used to explain the dependent variable
Only one independent variable, x. Relationship between x and y is described by a linear

function. Changes in y are assumed to be caused by changes in x.
Types of Regression Models

Population Linear Regression
Linear Regression Assumptions

 Error values (ε) are statistically independent
 Error values are normally distributed for any given value of x
 The probability distribution of the errors is normal
 The probability distribution of the errors has constant variance
 The underlying relationship between the x variable and the y variable is linear
Estimated Regression Model

The sample regression line provides an estimate of the population regression line.
The individual random error terms ei have a mean of zero.

Least Squares Criterion
Let ei = (𝑦𝑖 − 𝑦̂𝑖 ) be the prediction error for observation 𝑖. Sum of Squares of Errors,
𝑛
𝑆𝑆𝐸 = ∑𝑖=1 𝑒𝑖2 . For good fit, SSE should be minimum that is “Least Squares”.
Minimization for Least Squares Criterion

b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the
squared residuals. From calculus we know:
𝛽0 = 𝑦 − 𝛽1 𝑥 ….(1)
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 −𝛽0 ∑𝑛𝑖=1 𝑥𝑖

𝛽1 = ∑𝑛𝑖=1 𝑥2𝑖
….(2)

The Least Squares Equation

Substituting for𝛽1 from (2) in (1) & rearranging we get, the formula for 𝛽0 and 𝛽1 :
 ( x  x )( y  y )
1  ,  0  y  1x
2
 (x  x)
 x y
 xy 
Algebraic Equivalent: 1  n ,  0  y  1x
2
2 ( x)
x 
n
Interpretation
 𝜷𝟎 : 𝛽0 is the estimated average value of 𝑦 when the value of 𝑥 is zero. Traditionally
it is the “bias” of the model.
 𝜷𝟏 : 𝛽1 is the estimated change in the average value of 𝑦 as a result of a one-unit
change in 𝑥. A sensitivity measure, “Slope” or “rate” of the model
Simple Linear Regression Example

A real estate agent wishes to examine the relationship between the selling price of a home
and its size (measured in square feet). A random sample of 10 houses is selected
Dependent variable (y) = house price in $1000s
Independent variable (x) = square feet

Sample Data for House Price Model
Excel Output
Graphical Presentation
House price model: Scatter plot and regression line.

Interpretation b0
house price  98.24833  0.10977 (square feet)
b0 is the estimated average value of Y when the value of X is zero (if x = 0 is in the range of
observed x values). Here, no houses had 0 square feet, so: b0 = 98.24833 indicates that, for
houses within the range of sizes observed, $98,248.33 is the portion of the house price not
explained by square feet.
Interpretation b1
b1 measures the estimated change in the average value of Y as a result of a one-unit change
in X. Here, b1 = .10977 tells us that the average value of a house increases by: 0.10977($1000)
= $109.77, on average, for each additional one square foot of size.
Least Squares Regression Properties

The sum of the residuals from the least squares regression line is 0.  ( y yˆ )  0
The sum of the squared residuals is a minimum  ( y yˆ ) 2
The simple regression line always passes through the mean of the y variable and the mean
of the x variable. The least squares coefficients are unbiased estimates of β0 and β1.
Explained and Unexplained Variation
Total variation is made up of two parts:
Where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
ŷ = Estimated value of y for the given x value
SST = total sum of squares: Measures the variation of the yi values around their mean y
SSE = error sum of squares: Variation attributable to factors other than the relationship
between x and y
SSR = regression sum of squares: Explained variation attributable to the relationship

between x and y.

Coefficient of Determination, R2
The coefficient of determination is the portion of the total variation in the dependent
variable that is explained by variation in the independent variable. The coefficient of
determination is also called R-squared and is denoted as R2.
SSR
R2  where, 0  R 2  1
SST
SSR sum of squares explained by regression

R2  
SST total sum of squares
Note:
In the single independent variable case, the coefficient of determination is: R2 = r2
Where:
R2 = Coefficient of determination
r = Simple correlation coefficient
Examples of Approximate R2 Values

Perfect linear relationship between x and y: 100% of the variation in y is explained by
variation in x.
0 < R2 < 1. Weaker linear relationship between x and y: Some but not all of the variation in
y is explained by variation in x.

R2 = 0
No linear relationship between x and y: The value of Y
does not depend on x.
(None of the variation in y is explained by variation in
x)
Output
Standard Error of Estimate

The standard deviation of the variation of observations around the regression line is
SSE
estimated by: s 
n  k 1
Where,
SSE = Sum of squares error
n = Sample size
k = number of independent variables in the model

The Standard Deviation of the Regression Slope
The standard error of the regression slope coefficient (b1) is estimated by:
sε sε
s b1  
 (x  x)2  x 2  ( x) 2
n
Where:
s b1 = Estimate of the standard error of the least squares slope
SSE
sε  = Sample standard error of the estimate
n2
Excel Output
Comparing Standard Errors

Inference about the Slope: t Test
T test for a population slope is there a linear relationship between x and y?
Null and alternative hypotheses

H0: β1 = 0 (no linear relationship)
H1: β1 ¹ 0 (linear relationship does exist)
b1  β1
Test statistic t 
s b1
d.f.  n  2
Where:
b1 = Sample regression slope coefficient
β1 = Hypothesized slope
sb1 = Estimator of the standard error of the slope
Estimated Regression Equation:

house price  98.25  0.1098 (sq.ft.)
The slope of this model is 0.1098.
Does square footage of the house affect its sales price?
Inference about the Slope: t Test Example

Decision: Reject H0
Conclusion: There is sufficient evidence that square footage affects house price.
Regression Analysis for Description

Confidence Interval Estimate of the Slope: b1  t/2s b1 d.f. = n - 2
Excel Printout for House Prices:
At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858). Since
the units of the house price variable is $1000s, we are 95% confident that the average
impact on sales price is between $33.70 and $185.80 per square foot of house size. This
95% confidence interval does not include 0.
Conclusion: There is a significant relationship between house price and square feet at the
.05 level of significance.
Confidence Interval for the Average y, Given x

Confidence interval estimate for the mean of y given a particular xp.
Size of interval varies according to distance away from mean, x
1 (x p  x)
2
yˆ  t/2s ε 
n  (x  x) 2
Confidence Interval for an Individual y, Given x

Confidence interval estimate for an Individual value of y given a particular xp.
1 (x p  x)
2
yˆ  t/2s ε 1  
n  (x  x) 2
This extra term adds to the interval width to reflect the added uncertainty for an individual
case.

Interval Estimates for Different Values of x
Estimated Regression Equation:

house price  98.25  0.1098 (sq.ft.)
Predict the price for a house with 2000 square feet.
Example: House Prices

Predict the price for a house with 2000 square feet:
house price  98.25  0.1098 (sq.ft.)
 98.25  0.1098(2000)
 317.85
The predicted price for a house with 2000 square feet is 317.85($1,000s) = $317,850
Estimation of Mean Values: Example

Confidence Interval Estimate for E(y)|xp. Find the 95% confidence interval for the average
price of 2,000 square-foot houses. Predicted Price Yi = 317.85 ($1,000s)
1 (x p  x)
2
ŷ  t α/2s ε   317.85  37.12

n  (x  x) 2
The confidence interval endpoints are 280.66-354.90, or from $280,660-$354,900.

Estimation of Individual Values: Example
Prediction Interval Estimate for y|xp. Find the 95% confidence interval for an individual
house with 2,000 square feet. Predicted Price Yi = 317.85 ($1,000s)
1 (x p  x)
2
ŷ  t α/2s ε 1    317.85  102.28

n  (x  x) 2
The prediction interval endpoints are 215.50-420.07, or from $215,500-$420,070.

Logistic Regression
Review of Linear Estimation
It is known how to handle linear estimation models of the type: 𝑌 = 𝛽0 + 𝛽1 𝑋1 + … . +
𝛽𝑛 𝑋𝑛 + 𝜖 ≡ 𝑋 𝛽 + 𝜖
Sometimes, by transforming or adding variables to get the equation to be linear:

 Taking logs of Y and/or the X’s
 Adding squared terms
 Adding interactions
Multiple Linear Regression

We model the mean of a numeric response as linear combination of the predictors
themselves or some functions based on the predictors, i.e.
Here the terms in the model are the predictors
Here, the terms in the model are k different functions of the n predictors. For the classic
multiple regression model: 𝐸(𝑌|𝑋) = 𝛽0 + 𝛽1 𝑋1 + … . + 𝛽𝑛 𝑋𝑛 + 𝜖
The regression coefficients 𝛽𝑖 represent the estimated change in the mean of the response Y
associated with a unit change in 𝑋𝑖 while the other predictors are held constant. They
measure the association between Y and Xi adjusted for the other predictors in the model.
Non-linear Estimation
In all these models Y, the dependent variable, was continuous. Independent variables could
be dichotomous (dummy variables), but not the dependent variable. Non-linear estimation
come with dichotomous Y variables.
These arise in many data problems that we intend to model

 Customer purchases item online: Yes/No
 Credit scoring[Good customer]: YES/No
 Students will pass mathematics courses: Yes/No
 Involved in an Armed Conflict: Yes/No

Logistic Regression
Models relationship between set of variables Xi
 Dichotomous (yes/no, smoker/nonsmoker,…)
 Categorical (social class, race, ... )
 Continuous (age, weight, gestational age, ...)
 Dichotomous categorical response variable Y
Example:
 Success/Failure, Remission/No Remission
 Survived/Died, CHD/No CHD, Low Birth Weight/Normal Birth Weight
Link Functions
A link function is a function linking the actual Y to the estimated Y in an econometric model.
Example: Logs
 Start with Y = Xβ + ε
 Then change to log(Y) ≡ Y′ = Xβ + ε
 Run this like a regular OLS equation
 Then you have to “back out” the results
If the coefficient on some particular X is β, then, a 1 unit ∆X β⋅∆(Y′) = β⋅∆[log(Y))] = eβ⋅∆Y

Since for small values of β, eβ⋅ ≈ 1+ β, as β% increase in Y. This is why natural log
transformations are used rather than base-10 logs. In general, a link function is some F(⋅)
such that: F(Y) = Xβ + ε
In the example, F(Y) = log(Y)
Dichotomous Independent Variables

How does this apply to situations with dichotomous dependent variables?
Assume that 𝑌 ∈ {0,1}, what would happen if this is run as a linear regression?
As a specific example, take the election of minorities to the Uttar Pradesh legislature:
 Y= 1, Minority elected
 Y= 0, Minority not elected

Linear Fit: Dichotomous Independent Variable
The line doesn’t fit the data very well. If we take values of Y between 0 and 1 to be
probabilities, this doesn’t make sense.
Redefining the Dependent Variable

How to solve this problem?
Transform the dichotomous Y into a continuous variable Y′, 𝑌′ ∈ (− ∞, + ∞). A link
function is needed, that takes Y dichotomous to Y’ that is real values and continuous.
′
Once we have Y’, we can solve 𝐹(𝑌) = 𝑌 = 𝑋 𝛽 + 𝜖
What function F(Y) goes from the [0, 1] interval to the real line?
At least one function is known that goes the other way around. That is, given any real value
it produces a number (probability) between 0 and 1. This is the cumulative normal
distribution Φ. That is, given any Z-score, Φ ∈ [0,1].
= Φ(X 𝛽 + 𝜖 )
Φ−1 (𝑌) = 𝑋𝛽 + 𝜖
𝑌’ = 𝑋𝛽 + 𝜖
The link function F(Y) = 𝛷−1 (𝑌) known as the Probit link. This term was coined in the
1930’s by biologists studying the dosage-cure rate link. This is short for “probability unit”.

In a Probit model, the value of Xβ is taken to be the z-value of a normal distribution. Higher
values of Xβ mean that the event is more likely to happen have to be careful about the
interpretation of estimation results here. A one unit change in Xi leads to a βi change in the
z-score of Y. The estimated curve is an S-shaped cumulative normal distribution.
This fits the data much better than the linear estimation. Always lies between 0 and 1. Can
estimate, for instance, the BVAP at which Pr(Y=1) = 50%. This is the “point of equal
opportunity”

Consider the problem of transforming Y from {0,1} to the real line. Here’s the alternative
approach based on the odds ratio: If some event occurs with probability p, then the odds of
it happening are O(p) = p/(1-p)
 p = 0 → O(p) = 0
 p = ¼ → O(p) = 1/3 (“Odds are 1-to-3 against”)
 p = ½ → O(p) = 1 (“Even odds”)
 p = ¾ → O(p) = 3 (“Odds are 3-to-1 in favor”)
 p = 1 → O(p) = ∞
The odds ratio is always non-negative. As a final step, then, take the log of the odds ratio.
Logit Function
logit(Y) = log[O(Y)] = log[y/(1-y)]
Why is it needed?
At first, this was computationally easier than working with normal distributions. It has some
properties that can be investigated with multinomial dependent variable. The density
function associated with it is very close to a standard normal distribution.

The logit function is similar, but has thinner tails than the normal distribution.
This translates back to original Y as:

𝑌
log ( )=𝑋𝛽
(1 − 𝑌)
𝑌
= 𝑒𝑋𝛽
(1−𝑌)
𝑒𝑋𝛽
𝑌 = (1+𝑒𝑋𝛽 )
Latent Variables
One way to state what’s going on is to assume that there is a latent variable Y* such that
𝑌∗ = 𝑋 𝛽 + 𝜖 , 𝜖 ∈ 𝑁(0, 𝜎2 )
Logistic Regression
Similar to linear regression, two main differences Y (outcome or response) is categorical
Yes/No, Approve/Reject or Responded/Did not respond. Result is expressed as a
probability of being in either group.

𝒀 = 𝒍𝒐𝒈𝒊𝒕(𝒀) = 𝑿 𝜷
𝑒X 𝛽
Note: Pr(Y = 1|X) =
1+𝑒 X 𝛽
where:
“exp” or “e” is the exponential function (e=2.71828…)
p is probability that the event y occurs given x, and can range between 0 and 1
p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or "logit"
All other components of the regression model are the same.
Coronary Heart Disease (CD) and Age, sampled individuals were examined for signs of CD
(present = 1 / absent = 0) and the potential relationship between this outcome and their age
(yrs.) was considered.
This is a portion of the raw data for the 100 subjects who participated in the study.
How can we analyse these data?
The mean age of the individuals with some signs of coronary heart disease is 51.28 years
vs. 39.18 years for individuals without signs (t = 5.95, p < .0001).

E (CD | Age)  .54  .02  Age

e.g. For an individual 50 years of age
E (CD | Age  50)  .54  .02  50  .46 ??
The smooth regression estimate is “S-shaped” but what does the estimated mean value
represent?
Answer: P(CD|Age)
We can group individuals into age classes and look at the percentage/proportion showing
signs of coronary heart disease. Notice the “S-shape” to the estimated proportions vs. age.
We can group individuals into age classes and look at the percentage/proportion showing
signs of coronary heart disease. Notice the “S-shape” to the estimated proportions vs. age.

Logistic Function
Logit Transformation
The logistic regression model is given by:
This is called the Logit Transformation
Dichotomous Predictor
Consider a dichotomous predictor (X) which represents the presence of risk (1 = present).
For the odds ratio associated with risk presence taking the natural logarithm, OR  e 1
The estimated regression coefficient associated with a 0-1 coded dichotomous predictor is
the natural log of the OR associated with risk presence. ln(OR)  1

The logistic model can be written. The odds for success can be expressed as:
 P(Y | X )   P 
ln    ln     o  1 X
 1  P(Y | X )  1 P 
P
 e  o  1 X
1 P
Consider a dichotomous predictor (X) which represents the presence of risk (1 = present)
Risk Factor (X)
Disease (Y) Present Absent
(X = 1) ( X = -1 )
Yes (Y = 1) P (Y  1 X  1) P (Y  1 X  1)
No (Y = 0) 1  P (Y  1 X  1) 1  P (Y  1 X  1)
Odds for Disease with Risk Present e o  1

Therefore the odds ratio (OR)   o  1  e 2 1
Odds for Disease with Risk Absent e
For the odds ratio associated with risk presence taking the natural logarithm, OR  e2 1
Twice the estimated regression coefficient associated with a +1 / -1 coded dichotomous

predictor is the natural log of the OR associated with risk presence. ln(OR)  21
Example: Signs of CD and Age

Fit Model Y = CD (CD if signs present, No otherwise) X = Age (years)

Consider the risk associated with a C year increase in age.
Odds for Age  x  c e o  1 ( x c )

Odds Ratio (OR)   o  1x  ec1
Odds for Age  x e
For example, Consider a 10 year increase in age, find the associated OR for showing signs of
CD, i.e. c = 10 OR = ecb = e10*.111 = 3.03. Estimate that the odds for exhibiting signs of CD
increase threefold for each 10 years of age. Similar calculations could be done for other
increments as well. For example for a c = 1 year increase OR = eb = e.111 = 1.18 or an 18%
increase in odds per year.
Is it possible to assume that the increase in risk associated with a c unit increase is constant
throughout one’s life? Is the increase going from 20  30 years of age the same as going from
50  60 years? If that assumption is not reasonable, then one must be careful when
discussing risk associated with a continuous predictor.
Example: Age at 1st Pregnancy and Cervical Cancer

Use Fit Model: Y = Disease Status and X = Risk Factor Status. When the response Y is a
dichotomous categorical variable the Personality box will automatically change to Nominal
Logistic, i.e. Logistic Regression will be used. Remember when a dichotomous categorical
predictor is used JMP uses +1/-1 coding. If you want you can code them as 0-1 and treat is as
numeric.
ô  2.183
ˆ1  0.607
The estimated odds ratio is:
ln(OR)  2ˆ1  2(.607)  1.214

OR  e1.214  3.37
Women whose first pregnancy is at or before age 25 have 3.37 times the odds for
developing cervical cancer than women whose 1st pregnancy occurs after age 25.

Thus the estimated odds ratio is:
Putting It All Together

With the knowledge of interpreting each of the variable types in a logistic model, one can
consider multiple logistic regression models with all the variable types included in the
model. Taking a look at risk associated with certain factors adjusted for the other covariates
included in the model is possible.

Time Series Modeling

The Importance of Forecasting
Governments forecast unemployment rates, interest rates, and expected revenues from
income taxes for policy purposes. Marketing executives forecast demand, sales, and
consumer preferences for strategic planning. College administrators forecast enrollments to
plan for facilities and for faculty recruitment. Retail stores forecast demand to control
inventory levels, hire employees and provide training.
Common Approaches to Forecasting
Time-Series Data
Numerical data obtained at regular time intervals. The time intervals can be annually,
quarterly, monthly, weekly, daily, and hourly, etc.
Example:
A time-series plot is a two-

dimensional plot of time series data.
The vertical axis measures the
variable of interest. The horizontal
axis corresponds to the time periods.

Time Series Components
Trend Component
Long-run increase or decrease over time (overall upward or downward movement). Data
taken over a long period of time. Trend can be upward or downward. Trend can be linear
or non-linear.
Sales Sales
Time Time
Downward linear trend Upward nonlinear trend
Seasonal Component
Short-term regular wave-like patterns. Observed within 1 year. Often monthly or quarterly.
Cyclical Component
Long-term wave-like patterns. Regularly occur but may vary in length. Often measured
peak to peak or trough to trough.

Irregular Component
Unpredictable, random, “residual” fluctuations. Due to random variations of nature and
accidents or unusual events. “Noise” in the time series.
Smoothing Methods
A time series plot helps to figure out whether there is a trend component. Often it helps if
one can “smooth” the time series data. Two popular smoothing methods are:
1. Moving Averages: Calculate moving averages to get an overall impression of the
pattern of movement over time. Averages of consecutive time series values for a
chosen period of length L
2. Exponential Smoothing: A weighted moving average.
Moving Averages
 Used for smoothing
 A series of arithmetic means over time
 Result dependent upon choice of L (length of period for computing means)
 Last moving average of length L can be extrapolated one period into future for a short term
forecast
Examples: For a 5 year moving average, L = 5. For a 7 year moving average, L = 7, etc.
Y1  Y2  Y3  Y4  Y5
First average: MA(5) 
5
Y2  Y3  Y4  Y5  Y6
Second average: MA(5) 
5
Annual Data

Calculating Moving Averages
Each moving average is for a consecutive block of 5 years.
Annual vs. Moving Average

The 5-year moving average smoothens the data and makes it easier to see the underlying
trend.
Annual vs. 5-Year Moving Average
60
50
40
Sales
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11
Year
Annual 5-Year Moving Average
Exponential Smoothing
Used for smoothing and short term forecasting (one period into the future). A weighted
moving average weights decline exponentially. Most recent observation weighted most.
The weight (smoothing coefficient) is W

 Subjectively chosen
 Ranges from 0 to 1
 Smaller W gives more smoothing, larger W gives less smoothing

The weight is close to 0 for smoothing out unwanted cyclical and irregular components.
Close to 1 for forecasting.
Exponential smoothing model
E1  Y1
Ei  WYi  (1  W ) Ei 1 For i = 2, 3, 4 …
Where:
Ei = exponentially smoothed value for period i
Ei-1 = exponentially smoothed value already computed for period i - 1
Yi = observed value in period i
W = weight (smoothing coefficient), 0 < W < 1
Example
Suppose we use weight W = 0.2
Fluctuations have been smoothed

NOTE: The smoothed value in this case is generally a little low, since the trend is upward
sloping and the weighting factor is only .2

60
50
40
Sales
30
20
10
0
1 2 3 4 5 6 7 8 9 10
Time Period
Sales Smoothed
The smoothed value in the current period (i) is used as the forecast value for next period (i
+ 1) Yî 1  Ei
Use data analysis/exponential smoothing
Three Methods for Trend-Based Forecasting
Linear Trend Forecasting

Estimate a trend line using regression analysis. Use time (X) as the independent variable:
Yˆ  b0  b1 X

The linear trend forecasting equation is Ŷi  21.905  9.5714 Xi .
Forecast for time period 6:

Ŷ  21.905  9.5714 (6)
 79.33

Nonlinear Trend Forecasting
A nonlinear regression model can be used when the time series exhibits a nonlinear trend.
Quadratic form is one type of a nonlinear model: Yi  0  1Xi   2 Xi2   i
Compare adj. r2 and standard error to that of linear model to see if this is an improvement.
Can try other functional forms to get best fit.
Exponential Trend Model

Another nonlinear trend model: Yi  β 0β1 i ε i
X
Transform to linear form: log(Yi )  log( β0 )  Xi log(β1 )  log( ε i )
Exponential trend forecasting equation: log( Ŷi )  b0  b1Xi

Where b0 = estimate of log(β0)
b1 = estimate of log(β1)
Interpretation: (β̂1  1) 100% is the estimated annual compound growth rate (in %).
Trend Model Selection Using Differences

Use a linear trend model if the first differences are approximately constant
(Y2  Y1 )  (Y3  Y2 )    (Yn  Yn-1 )
Use a quadratic trend model if the second differences are approximately constant
[(Y3  Y2 )  (Y2  Y1 )]  [(Y4  Y3 )  (Y3  Y2 )]

   [(Yn  Yn -1 )  (Yn -1  Yn -2 )]
Use an exponential trend model if the percentage differences are approx. constant
(Y2  Y1 ) (Y  Y2 ) (Y  Yn -1 )
100%  3 100%    n 100%
Y1 Y2 Yn -1
Autoregressive Modeling
Used for forecasting. Takes advantage of autocorrelation
 1st order - correlation between consecutive values
 2nd order - correlation between values 2 periods apart
pth order Autoregressive model:
Yi  A0  A1Yi-1  A 2 Yi-2    A p Yi-p  δi

Example: The Office Concept Corp. has acquired a number of office units (in thousands of
square feet) over the last eight years. Develop the second order Autoregressive model.
Year: 97, 98, 99, 00, 01, 02, 03, 04
Units: 4, 3, 2, 3, 2, 2, 4, 6
Develop the 2nd order table. Use Excel or Minitab to estimate a regression model.
Use the second-order equation to forecast number of units for 2005.
Ŷi  3.5  0.8125Yi 1  0.9375Yi  2

Ŷ2005  3.5  0.8125(Y2004)  0.9375(Y2003)
 3.5  0.8125(6)  0.9375(4)
 4.625
Autoregressive Modeling Steps

1. Choose p (note that df = n – 2p – 1)
2. Form a series of “lagged predictor” variables Yi-1 , Yi-2 , … ,Yi-p
3. Use Excel or Minitab to run regression model using all p variables
4. Test significance of Ap
 If null hypothesis rejected, this model is selected
 If null hypothesis not rejected, decrease p by 1 and repeat
Choosing a Forecasting Model

Perform a residual analysis look for pattern or trend. Measure magnitude of residual error
using squared differences. Measure magnitude of residual error using absolute differences.
Use simplest model principle of parsimony.

Residual Analysis
Measuring Errors
Choose the model that gives the smallest measuring errors.
n
Sum of squared errors (SSE): SSE   (Yi  Ŷi ) 2
i 1
Sensitive to outliers.
n
 Y  Ŷ
i i
Mean Absolute Deviation (MAD): M AD  i 1
n
Less sensitive to extreme observations.
Principal of Parsimony
Suppose two or more models provide a good fit for the data. Select the simplest model.
Simplest model types:

 Least-squares linear
 Least-squares quadratic
 1st order autoregressive
More complex types:

 2nd and 3rd order autoregressive
 Least-squares exponential
Forecasting With Seasonal Data

Time series are often collected monthly or quarterly. These time series often contain a
trend component, a seasonal component, and the irregular component. Suppose the
seasonality is quarterly. Define three new dummy variables for quarters:
 Q1 = 1 if first quarter, 0 otherwise

 Q2 = 1 if second quarter, 0 otherwise
 Q3 = 1 if third quarter, 0 otherwise (Quarter 4 is the default if Q1 = Q2 = Q3 = 0)
(Quarter 4 is the default if Q1 = Q2 = Q3 = 0).
Exponential Model with Quarterly Data

Yi  β0β1 i β 2 1 β3 2 β 4 3 ε i
X Q Q Q
(β1–1) x 100% is the quarterly compound growth rate
βi provides the multiplier for the ith quarter relative to the 4th quarter (i = 2, 3, 4)
Transform to linear form:

log(Yi )  log( β 0 )  Xi log(β1 )  Q1log(β 2 )  Q2log(β3 )  Q3log(β 4 )  log( ε i )
Estimating the Quarterly Model

Exponential forecasting equation
log( Ŷi )  b0  b1Xi  b 2Q1  b3Q2  b 4Q3
Where, b0 = estimate of log(β0), so 10b0  β̂ 0

b1 = estimate of log(β1), so 10b1  β̂1 etc…
(β̂1  1) 100% = estimated quarterly compound growth rate (in %)
β̂ 2 = estimated multiplier for first quarter relative to fourth quarter
β̂ 3 = estimated multiplier for second quarter rel. to fourth quarter
β̂ 4 = estimated multiplier for third quarter relative to fourth quarter
Quarterly Model Example

Suppose the forecasting equation is log( Ŷi )  3.43  .017Xi  .082Q1  .073Q2  .022Q3

Index Numbers
Index numbers allow relative comparisons over time. Index numbers are reported relative
to a Base Period Index. Base period index = 100 by definition. Used for an individual item or
group of items.
Pi
Ii  100
Pbase
Where
Ii = index number for year i
Pi = price for year i
Pbase= price for the base year
Example:
Airplane ticket prices from 1995 to 2003.

Index Numbers: Interpretation
Prices in 1996 were 90% of base year prices
P 288
I1996  1996 100  (100)  90
P2000 320
Prices in 2000 were 100% of base year prices (by definition, since 2000 is the base year)
P 320
I 2000  2000 100  (100)  100
P2000 320
Prices in 2003 were 120% of base year prices

P 384
I 2003  2003 100  (100)  120
P2000 320
Aggregate Price Indexes

An aggregate index is used to measure the rate of change from a base period for a group of
items.
Unweighted Aggregate Price Index

Unweighted aggregate price index formula:
n
P i
(t )
i = item
I (t )
U  i 1
n
100 t = time period
P
i 1
i
(0) n = total number of items
IU(t ) = unweighted price index at time t

n
P
i 1
i
(t )
= sum of the prices for the group of items at time t
n
P
i 1
i
(0)
= sum of the prices for the group of items in time period 0

Example
I 2004 
P 2004
100 
410
(100)  118.8
P 2001 345
Unweighted total expenses were 18.8% higher in 2004 than in 2001.
Weighted Aggregate Price Indexes
Common Price Indexes

 Consumer Price Index (CPI)
 Producer Price Index (PPI)
 Stock Market Indexes:
o Dow Jones Industrial Average
o S&P 500 Index
o NASDAQ Index
Pitfalls in Time-Series Analysis

Assuming the mechanism that governs the time series behavior in the past will still hold in
the future. Using mechanical extrapolation of the trend to forecast the future without
considering personal judgments, business experiences, changing technologies, and habits,
etc.

Introduction to Machine Learning

Human Body Temperature Distribution
According to their Emotional State:
Threat Perception in Real Time Security Systems

Can we recognize emotions from posture and thermographs of an individual in a group or
crowd? Can we link recognized emotions to predict possible threat from an individual in a
crowd monitored using thermographs and gait analysis from video cameras?
Desired features of an algorithm:

 Programmable/ Automate
 Realistic computation time
 Accurate & Precise
 Independent of data accuracy/precision

Considerations for the data: Required accuracy and precision in thermograph and video
resolution, magnification.
Machine Learning
Algorithms and techniques used for data analytics. Studies how to automatically learn to
make accurate predictions based on past observations. Machine learning is programming
computers to optimize a performance criterion by tuning set of parameters. These tuned
programs then perform same task on unseen data.
Diagrammatic Representation
What Can Machine Learning Do?

Machine Learning is used when…
1. Human Expertise does not exist. E.g. Navigating on Mars
2. Humans are unable to explain their expertise. E.g.: Speech Recognition/Mine
Detection
3. Solution changes or evolves in time. E.g.: Routing on a computer network
4. Solution needs to be adapted to particular cases. E.g.: user biometrics, virtual agent
based solutions)
Applications
 Finance: Credit scoring, fraud detection
 Manufacturing: Optimization, troubleshooting
 Bioinformatics: Motifs, alignment

 Web mining: Search engines
 Retail: Market basket analysis, Customer relationship management (CRM)
 Medicine: Medical diagnosis, Prognosis
 Telecommunications: Quality of service optimization
Algorithms & Machine Learning Models

The success of machine learning system also depends on the algorithms. The algorithms
control the search to find and build the knowledge structures. The learning algorithms
should extract useful information from training examples.
Machine Learning Algorithms:

 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning
Supervised Learning
We are given attributes, 𝑿 and Feature
Training
targets 𝒚 knowledgeable external Data Extraction
supervisor:
 Regression
 Classification ML Model
 Decision trees
 Random forest
ML Algorithm
Performance
Metric
Regression: Examples
 Reading your mind: Happiness state is related to brain region intensities.
 Predicting stock prices depends on: Recent Stock Prices, News Events and Related
commodities.
Classification: Examples
Credit Scoring: Differentiating between low-risk and high-risk customers from their income
and savings

Outlook of the day and Weather Derivatives.
Classification: Applications
These applications are also known as Pattern Recognition.
 Face Recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style.
 Character recognition: Different handwriting styles.
 Speech recognition: Temporal dependency. Use of a dictionary or the syntax of the
language.
 Sensor fusion: Combine multiple modalities; Example: visual (lip image) and
acoustic for speech.
 Medical diagnosis: From symptoms to illnesses.
 Web Advertising: Predict if a user clicks on an ad on the Internet.

Unsupervised Learning
We are given only attributes, 𝑿 and no
targets: Training Feature
Extraction
 Clustering Data
Intelligence:
 Finding Association (in Segmentation,
features) Pattern, Cluster
ML Model
 Image Compression
 Probability Distribution
Estimation
ML Algorithm
 Dimension Reduction
Performance
Metric
Document Clustering and Text Mining
Lingo4G: Large-scale text

clustering
 Topic Discovery
 Document Clustering
 Document Retrieval
 No External Taxonomies
 Scalable
Learning Associations
Basket analysis: P (Y | X ) probability that somebody who buys X also buys Y where X and
Y are products/services. Example: P ( chips | beer ) = 0.7
Market-Basket Transactions

Object Recognition
 Image Object Recognition: Recognize objects in the image
 Blind Source Separation: Recognize source/s in a mixed music signal
Reinforcement Learning
 Mimics intelligent system
 Observers interaction of environment and system actions
 Optimize goal/rewards and leads to continuous, self-learning
 It is not a method but a process as a whole to build knowledge
 Corrective action even if system sees a new situation
Applications of Reinforcement Learning

1. Decision Making
2. Robot, Chess Machine
3. Stochastic Approximations
4. Optimal Control Theory
Machine Learning and Traditional Statistics
Machine Learning Traditional Statistics

Emphasize predictions, usually no Emphasizes super-population inference
super-population model specified
Evaluates results via prediction Focuses on a-priori hypotheses
performance
Concern for over fitting but not model Simpler models preferred over complex
complexity per se ones (parsimony), even if the more
complex models perform slightly better
Emphasis on performance Emphasis on parameter interpretability
Generalizability is obtained through Statistical modelling or sampling
performance on novel datasets assumptions connects data to a
population of interest
Concern over performance and Concern over assumptions and
robustness robustness

Machine Learning Design Study
Data Science Process
Data Scientist Role
Course Content: Machine learning Algorithms

Distance-Based Linear Models

Partitional Clustering
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to one
another. They are different from (or unrelated to) the objects in other groups.
Notion of a Cluster can be Ambiguous
What is Clustering?
Attach label to each observation or data points in a set. It comes under “unsupervised
classification”. Clustering is alternatively called as “grouping” if you would want to assign
same label to a data points that are “close” to each other. Clustering algorithms rely on a
distance metric between data points. Sometimes, the distance metric is more important than
the clustering algorithm.
Why Clustering?
 Understanding: Group related documents for browsing, group genes and proteins
that have similar functionality, or group stocks with similar price fluctuations
 Summarization: Reduce the size of large data sets and feature selection pool
 Data compression

What is NOT Cluster Analysis?
 Supervised Classification: Have class label information.
 Simple Segmentation: Dividing students into different registration groups
alphabetically, by last name.
 Results of a Query: Groupings are a result of an external specification.
 Graph Partitioning: Some mutual relevance and synergy, but areas are not
identical.
Types of Clustering
A clustering is a set of clusters
Partitional Clustering

Hierarchical Clustering
Notion of “Closeness” Based on Distance Metric

One can intuitively relate objects based on the notion of the “distance metric” and then
group/cluster these objects by comparing the distance.
 Shops close to a person’s house: Metric is “physical distance”.
 Group of people based on their mother tongue: Metric is “language they speak”.
All dialects of particular language are ignored.
 Group of people such as poor and billionaire: Metric is the wealth they amassed.
Metric: Mathematically Speaking

A metric or distance function is a function that defines a distance between each pair of
elements of a set.
A metric on a set X is a function

d: X × X → [0, ∞),
and for all x, y, 𝑧 ∈ 𝑋 following conditions are satisfied
1. 𝑑(𝑥, 𝑦) ≥ 0 non-negativity or separation axiom
2. 𝑑(𝑥, 𝑦) = 0 ⟺ 𝑥 = 𝑦 identity of indiscernible
3. 𝑑(𝑥, 𝑦) = 𝑑(𝑦, 𝑥) symmetry
4. 𝑑(𝑥, 𝑧) ≤ 𝑑(𝑥, 𝑧) + 𝑑(𝑦, 𝑧) triangle inequality
Where [0, ∞) is the set of non-negative real numbers.

Some Examples
Clustering Strategy using Distance Metric

To partition the data into non-overlapping clusters:
 We need to define boundaries of these clusters.
 Within the cluster (“Intra-Cluster”) all data points have “similar” measure with
respect to distance metric we used.
 “Inter-cluster” distance is at least equal to largest distance between a data point in
the cluster and its centre.
For example: Two clusters for the shop near my house and my friend’s house. This
philosophy is implemented in “K-Means Clustering Algorithm”.
K-means Overview
 Partitional clustering approach
 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest centroid
 Number of clusters, K, must be specified
 The basic algorithm is very simple
Algorithm: k-means
Decide on a value for k (number of classes). Initialize the k cluster centers (randomly, if
necessary). Decide the class memberships of the M objects by assigning them to the nearest
cluster center. Re-estimate the k cluster centers. If none of the M objects changed
membership in the last iteration, exit. Otherwise go to 3.
Mathematically, K-means Algorithm

1. Decide on value of number of clusters ‘k’.
2. Choose K initial cluster centres z1(1), z2(1),…… zK(1). (This is done randomly by the
algorithm).

3. At the ith iterative step, distribute the samples X among the K clusters Ck(i) defined
as:
𝑥𝑝 ∈ 𝐶𝑘 (𝑖) 𝑖𝑓 𝑚 (𝑥𝑝 − 𝑧𝑘 (𝑖)) ≤ 𝑚 (𝑥𝑝 − 𝑧𝑙 (𝑖))
for all l =1, 2, …, k and p = 1, 2, …, M

Where, xp is an element of X.
m is the measure which represents the distance.
4. Compute the new cluster centres zk,

5. If none of the M objects changed membership in the last iteration, exit. Otherwise go
to step 3.
Flow Chart: K-means Algorithm
Illustration of K-mean Algorithm

 Select number of clusters
(e.g. k = 5)
 Select K cluster centers
locations (randomly).
 Distribute sample
amongst Ck clusters. (For
each data point finds out
which Center it’s closest
to.)
 Update the center of each
cluster…and jumps to
there…repeat step 3 to 4
until terminated.

K-means will converge for common similarity measures. Most of the convergence happens
in the first few iterations. Often the stopping condition is changed to ‘Until relatively few
points change clusters’.
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters

Most Common Measure is Sum of Squared Error (SSE). For each point, the error is the
distance to the nearest cluster. To get SSE, we square these errors and sum them. x is a data
point in cluster Ci and zi is the representative point for cluster Ci can show that mi
corresponds to the center (mean) of the cluster. Given two clusters, we can choose the one
with the smallest error. One easy way to reduce SSE is to increase K, the number of clusters.
A good clustering with smaller K can have a lower SSE than a poor clustering with higher K
Example: Real life Application of K-means Clustering

Colour quantization is the process of reducing the number of distinct colours in an image.
The intent is to preserve the colour appearance of the image as much as possible, while
reducing the number of colours, whether for memory limitations or compression. It is best
used when building Content-Based Image Retrieval (CBIR) systems. K-means clustering can
be used for image compression.
Colour Quantization Problem

What’s there?
An image that is stored with 24 bits/pixel and can have up to 16 million colours, and a colour
screen with 8 bits/pixel that can display only 256 colours.
What to find?
The best 256 colours among all 16 million colours such that the image using only the 256
colours in the palette looks as close as possible to the original image. Colour quantization is
used to map from high to lower resolution. Can always quantize uniformly, but this wastes
the colour map by assigning entries to colours not existing in the image, or would not assign
extra entries to colours frequently used in the image.
For example, if the image is a seascape, we expect to see many shades of blue and maybe no
red. So the distribution of the colour map entries should reflect the original density.
True - Colour vs. Index - Colour Images
True-colour image Index-colour image

Each pixel is represented by a Each pixel is represented by an index into
vector of 3 components [R, G, B] a colour map
For eg: 7 X 5 pixel image For eg: 7 X 5 pixel image
File size = 7*5*3 bytes=105bytes File size = 6*3 bytes (pallet) + (7*5*1
bytes) (colour) = 53 bytes

Image Compression
Date: 1998/04/05
Dimension: 480x640
Raw data size: 480*640*3 bytes = 900KB
File size: 49.1KB
Compression ratio = 900/49.1 = 18.33
Application: Image Compression

Goal: Convert a pixel image from true colours to indexed colours with minimum distortion.
Steps
 Collect data from a true-colour image
 Perform K-means clustering to obtain cluster centers as the indexed colours
 Compute the compression rate
before  m * n * 3 * 8 bits
after  m * n * log 2 c   c * 3 * 8 bits
before m * n *3*8 24 24
   
after m * n * log 2 c   c * 3 * 8 log c   24c log 2 c 
2
m*n
Some quantities of the k-means clustering
Number of pixels, n = 480x640 = 307200. These are number of vectors to be clustered.

Each pixel is represented by a vector
Dimension, d = 3 (R, G, B)
Number of colours in the palette, m = 256 (no. of clusters)
Indexing of pixels for a 2*3*3 image.

MATLAB Code
X = imread('annie19980405.jpg');
image(X)
[m, n, p]=size(X);
index=reshape(1:m*n*p, m*n, 3)';
data=double(X(index));
maxI=6;
for i=1:maxI
centerNum=2î;
fprintf('i=%d/%d: no. of centers=%d\n', i, maxI, centerNum);
center=kMeansClustering(data, centerNum);
distMat=distPairwise(center, data);
[minValue, minIndex]=min(distMat);
X2=reshape(minIndex, m, n);
map=center'/255;
figure; image(X2); colourmap(map); colourbar; axis image;
end
Choice of K
Can WK(C), i.e., the within cluster distance as a function of K serve as any indicator?
𝐾
𝑊𝑘 = ∑ 𝑁𝑘 ∑ 𝑑2 (𝑥𝑖 , 𝑧𝑘 )
𝑘=1 {𝑥𝑖 ∈𝐶𝑘 }
Note that WK(C) decreases monotonically with increasing K. That is the within cluster
scatter decreases with increasing centroids. Instead look for gap statistics (successive
difference between WK(C)):
{WK  WK 1 : K  K *}  {WK  WK 1 : K  K *}

K-means: Limitations
K is a user input; Alternatively BIC (Bayesian information criterion) or MDL (minimum
description length) can be used to estimate K. K-means converges, but it finds a local
minimum of the cost function. It is an approximation to an NP-hard combinatorial
optimization problem. Works only for numerical observations. Outliers can considerable
trouble to K-means. K-means has problems when clusters are of differing sizes, densities and
non-globular shapes.
Limitations of K-means: Differing Sizes
Limitations of K-means: Differing Density

Limitations of K-means: Non-globular Shapes
Overcoming K-means Limitations
One solution is to use many clusters. Find parts of clusters, but need to put together.

K-Means Clustering: Epilougue

Despite certain limitations, the k-means algorithm is a major workhorse in clustering
analysis:
 It works well on many realistic data sets,
 Is relatively fast, easy to implement, and easy to understand.
Many clustering algorithms that improve on or generalize k-means, such as k-medians, k-

medoids, k-means++, and the EM algorithm for Gaussian mixtures, all reflect the same
fundamental insight, that points in a cluster ought to be close to the center of that cluster.
K-medoids Clustering
K-means is appropriate when it can be worked with Euclidean distances. K-means can
work only with numerical, quantitative variable types. Euclidean distances do not work
well in at least two situations
1. Some variables are categorical
2. Outliers can be potential threats
A general version of K-means algorithm called K-medoids can work with any distance
measure. K-medoids clustering is computationally more intensive.
K-medoids Algorithm
Step 1: For a given cluster assignment C, find the observation in the cluster minimizing the
total distance to other points in that cluster:
ik  arg min  d ( x , x ).

{i:C ( i )  k } C ( j )  k
i j
Step 2: Assign
Step 3: Given a set of cluster centers {m1, …, mK}, minimize the total error by assigning
each observation to the closest (current) cluster center:
mk  xi , k  1,2,, K
k
Iterate steps 1 to 3:

C (i)  arg min d ( xi , mk ), i  1,, N
1 k  K
 Generalized K-means
 Computationally much costlier that K-means
 Apply when dealing with categorical data
 Apply when data points are not available, but only pair-wise distances are available
 Converges to local minimum.
Other Partitional Clustering Algorithms

Fuzzy K-means: Unlike K-means each point belongs to two or more cluster with certain
weight.
Fuzzy K-means Application: Classify Cancer Cells

Using a small brush, cotton stick, or wooden stick, a specimen is taken from the uterin cervix
and smeared onto a thin, rectangular glass plate, a slide. The purpose of the smear screening
is to diagnose pre-malignant cell changes before they progress to cancer. The smear is
stained using the Papanicolau method, hence the name Pap smear. Different characteristics
have different colours, easy to distinguish in a microscope. A cyto-technician performs the
screening in a microscope. It is time consuming & prone to error, as each slide may contain
up to 300.000 cells.
Dysplastic cells have undergone precancerous changes. They generally have longer and
darker nuclei, and they have a tendency to cling together in large clusters. Mildly dysplastic
cells have enlarged and bright nuclei. Moderately dysplastic cells have larger and darker
nuclei. Severely dysplastic cells have large, dark, and often oddly shaped nuclei. The
cytoplasm is dark, and it is relatively small.
Possible Features
 Nucleus and cytoplasm area
 Nucleus and cyto brightness
 Nucleus shortest and longest diameter
 Cyto shortest and longest diameter

 Nucleus and cyto perimeter
 Nucleus and cyto no of maxima
 (...)
Classes are Nonseparable
Hard Classifier (HCM): K-means
A cell is either one or the other class defined by a colour.

Fuzzy Classifier (FCM)
A cell can belong to several clusters to a degree, i.e., one column may have several colours.
Other Distinctions Between Sets of Clusters

 Exclusive versus non-exclusive: In non-exclusive clusterings, points may belong to
multiple clusters. Can represent multiple classes or ‘border’ points.
 Fuzzy versus non-fuzzy: In fuzzy clustering, a point belongs to every cluster with
some weight between 0 and 1. Weights must sum to 1. Probabilistic clustering has
similar characteristics.
 Partial versus complete: In some cases, we only want to cluster some of the data.
 Heterogeneous versus homogeneous: Cluster of widely different sizes, shapes,
and densities.
Types of Clusters
 Well-separated clusters
 Center-based clusters
 Contiguous clusters
 Density-based clusters
 Property or Conceptual
 Described by an Objective Function

Illustration of K-mean Algorithm
Select number of clusters (e.g. k = 5)
Select K cluster centers locations

(randomly).
Distribute sample amongst Ck

clusters. (For each data point finds
out which Center its closest to.)
Update the center of each cluster.
1
𝑧𝑘 = ∑ 𝑥
𝑁𝑘 𝑥∈𝐶𝑘 (𝑖)
and jumps to there. Repeat step 3 to

4 until terminated.
K-Nearest Neighbor
Different Learning Methods
 Eager Learning
Explicit description of target function on the whole training set
 Instance-based Learning
 Learning=storing all training instances
 Classification=assigning target function to a new instance
 Referred to as “Lazy” learning
Classification
Given: Dataset of instances with known categories
Goal: Using the “knowledge” in the dataset, classify a given instance
Predict the category of the given instance that is rationally consistent with the dataset.
Instance-Based Learning
K-Nearest Neighbor Algorithm
 Weighted Regression
 Case-based reasoning

For a given instance T, get the top k dataset instances that are “nearest” to T. Select a
reasonable distance measure. Inspect the category of these k instances, choose the category
C that represent the most instances. Conclude that T belongs to category C.
Features
 All instances correspond to points in an n-dimensional Euclidean space.
 Classification is delayed till a new instance arrives.
 Classification done by comparing feature vectors of the different points.
 Target function may be discrete or real-valued.
K-Nearest Neighbor Classifier

Learning by Analogy
A new example is assigned to the most common class among the (K) examples that are most
similar to it
K-Nearest Neighbour Algorithm

To determine the class of a new example E calculate the distance between E and all examples
in the training set. Select K-nearest examples to E in the training set. Assign E to the most
common class among its K-nearest neighbors.
Distance Between Neighbors

Each example is represented with a set of numerical attributes.
Jay: Age=35, Income=95K and No. of credit cards=3

Rina: Age=41, Income=215K and No. of credit cards=2
“Closeness” is defined in terms of the Euclidean distance between two examples. The
Euclidean distance between X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) is defined as:
n
D( X , Y )   (x  y )
i 1
i i
2
Distance (Jay,Rina) = √(35-41)2 +(95,000-215,000)2 +(3 − 2)2
K-Nearest Neighbor: Instance Based Learning

No model is built: Store all training examples. Any processing is delayed until a new instance
must be classified.

Strengths & Weaknesses

Strengths Weaknesses
Simple to implement and use Need a lot of space to store all
examples
Comprehensible: easy to explain Takes more time to classify a new
prediction example than with a model (need to
calculate and compare distance from
new example to all other examples)
Robust to noisy data by averaging
k-nearest neighbors
Some appealing applications (will
discuss next in personalization)

Jay: Age=35, Income=95K and No. of credit cards=3

Rina: Age=41, Income=215K and No. of credit cards=2
Distance (Jay, Rina)=sqrt [(35-45)2+(95,000-215,000)2 +(3-2)2]

Distance between neighbors could be dominated by some attributes with relatively large
numbers (e.g., income in our example)
Important to normalize some features
(e.g., map numbers to numbers between 0-1)
Example:
Income: Highest income = 500K
Davis’s income is normalized to 95/500, Rina income is normalized to 215/500, etc.)
Normalization of Variables

Distance works naturally with numerical attributes
d(Rina,Johm)= √(35-37)2+(35-50)2 +(3-2)2 =15.16
What if we have nominal attributes?
Example: Married
Non-Numeric Data
Feature values are not always numbers.
 Boolean values: Yes or no, presence or absence of an attribute
 Categories: Colors, educational attainment, gender
Boolean values => convert to 0 or 1. Applies to yes-no/presence-absence attributes.
Non-binary characterizations
 Use natural progression when applicable; e.g., educational attainment: GS, HS,
College, MS, PHD => 1,2,3,4,5
 Assign arbitrary numbers but be careful about distances; e.g., color: red, yellow, blue
=> 1,2,3
Preprocessing Your Dataset

Dataset may need to be preprocessed to ensure more reliable data mining results.
Conversion of non-numeric data to numeric data. Calibration of numeric data to reduce
effects of disparate ranges particularly when using the Euclidean distance metric.
k-NN Variations
 Value of k
o Larger k increases confidence in prediction
o Note that if k is too large, decision may be skewed
 Weighted evaluation of nearest neighbors
o Plain majority may unfairly skew decision
o Revise algorithm so that closer neighbors have greater “vote weight”
 Other distance measures
o City-block distance (Manhattan dist)
Add absolute value of differences
o Cosine similarity

Measure angle formed by the two samples (with the origin)
o Jaccard distance
Determine percentage of exact matches between the samples (not including
unavailable data)
Distance-Weighted Nearest Neighbor Algorithm

Assign weights to the neighbors based on their ‘distance’ from the query point. Weight
‘may’ be inverse square of the distances. All training points may influence a particular
instance Shepard’s method.
How to Choose “K”?

For k = 1, …,5 point x gets classified
correctly red class.
For larger k classification of x is

wrong blue class.
How to Find Optimal Value of “K” ?

Use P- fold cross validation divide training data into p-parts. Select only (p-1) parts for
training and remaining 1 part for testing. There are (p-1) combinations of training and test
set pairs. For each of the (p-1) combination learn K-NN model with different K and compute
prediction error for test set. Compute the average test error for different K. Select K with
minimum average test error.
K-NN: Computational Complexity

Basic k-NN algorithm stores all examples. Suppose we have n examples each of dimension
d
 O(d) to compute distance to one example
 O(nd) to find one nearest neighbor
 O(knd) to find k closest examples examples
 Complexity is O(knd)
This is prohibitively expensive for large number of samples. But we need large number of
samples for k-NN to work well.
Remarks
Advantages
 Can be applied to the data from any distribution
 Very simple and intuitive
 Good classification if the number of samples is large enough
Disadvantages

 Choosing best k may be difficult
 Computationally heavy, but improvements possible
 Need large number of samples for accuracy
 Can never fix this without assuming parametric distribution
K-Nearest Neighbor: Synthetic Control
Support Vector Machine

Decision Trees
IF (Outlook = Sunny) ^ (Humidity = High)

THEN PlayTennis =NO
IF (Outlook = Sunny)^ (Humidity =

Normal) THEN PlayTennis = YES

Classification Tasks
Learning Task
Given: Expression profiles of leukemia patients and healthy persons.
Compute: A model distinguishing if a person has leukemia from expression data.
Classification Task
Given: Expression profile of a new patient + a learned model
Determine: If a patient has leukemia or not.
Problems in Classifying Data

 Often high dimension of data.
 Hard to put up simple rules.
 Amount of data.
 Need automated ways to deal with the data.
 Use computers – data processing, statistical analysis, try to learn patterns from the
data (Machine Learning)
Tennis Example
Introduction: Linear Separators

Binary classification can be viewed as the task of separating classes in feature space.

Which of the linear separators is optimal?
All hyperplanes in Rd are parameterized

by a vector (w) and a constant b.
Can be expressed as w•x+b=0 (remember

the equation for a hyperplane from
algebra!)
The aim is to find such a hyperplane

f(x)=sign(w•x+b), that correctly classify
the data.
Selection of a Good Hyper-Plane ρ

Objective: Select a `good' hyper-plane
using only the data! r
Intuition: (Vapnik 1965) - assuming linear

separability
(i) Separate the data

(ii) Place hyper-plane `far' from data
Maximizing the Margin

The distance from a point (𝑥0 , 𝑦0 ) to a line ρ
𝐴𝑥 + 𝐵𝑦 + 𝑐 = 0 is:
r
|Ax0 +B y0 +c|
√A2 +B2
Distance from example xi to the separator

is:
|w T xi + b|
r=
‖w‖

Classification Margin
Examples closest to the ρ
hyperplane are support vectors.

Margin ρ of the separator is the r
distance between support vectors.
Maximum Margin Classification

Maximizing the margin is good
according to intuition. Implies that
only support vectors matter; other
training examples are ignorable.
Linear SVM Mathematically

Let training set {(xi, yi)}i=1..n, xiRd, yi  {-1, 1} be separated by a hyperplane with margin ρ.
Then for each training example (xi, yi):
For every support vector xs the above inequality is an equality.
After rescaling w and b by ρ/2 in the equality, we obtain that distance between each xs and
y ( w T x s  b) 1
r s 
w w
the hyperplane is:
2
  2r 
w
Then the margin can be expressed through (rescaled) w and b as:

Can formulate the quadratic optimization problem:
2

w
Find w and b such that is maximized and for all (xi, yi), i=1..n : yi(wTxi + b) ≥ 1
Which can be reformulated as:
Find w and b such that Φ(w) = ||w||2=wTw is minimized and for all (xi, yi), i=1..n : yi (wTxi
+ b) ≥ 1.
Solving the Optimization Problem

Need to optimize a quadratic function subject to linear constraints. Quadratic optimization
problems are a well-known class of mathematical programming problems for which several
(non-trivial) algorithms exist. The solution involves constructing a dual problem where a
Lagrange multiplier αi is associated with every inequality constraint in the primal (original)
problem:
Find w and b such that

Φ(w) =wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1
Find α1…αn such that
Q(α) =∑ 𝛼𝑖 - 1/2 ∑𝑖 ∑𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙𝑇𝑖 𝒙𝑇𝑗 is maximized and
(α) =∑ (1)
𝛼𝑖 - ∑
1/2 ∑𝑖𝑖=∑0𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝒙𝑇𝑖 𝒙𝑇𝑗 is maximized and
𝛼𝑖 𝑦
(2) 𝛼𝑖 ≥ 0 for all
The Optimization Problem Solution

Given a solution α1…αn to the dual problem, solution to the primal is:
w =Σαiyixi b = yk - Σαiyixi Txk for any αk > 0
Each non-zero αi indicates that corresponding xi is a support vector. Then the classifying
function is (note that we don’t need w explicitly):
f(x) = ΣαiyixiTx + b
Notice that it relies on an inner product between the test point x and the support vectors xi
we will return to this later. Also keep in mind that solving the optimization problem involved
computing the inner products xiTxj between all training points.
Soft Margin Classification

What if the training set is not linearly separable? Slack variables ξi can be added to allow
misclassification of difficult or noisy examples, resulting margin called soft.

The old formulation:

Φ(w) =wTw is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1
Modified formulation incorporates slack variables:

Φ(w) =wTw + CΣξi is minimized
and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi
≥0
Parameter C can be viewed as a way to control overfitting: it “trades off” the relative
importance of maximizing the margin and fitting the training data.
Solution:
Dual problem is identical to separable case (would not be identical if the 2-norm penalty
for slack variables CΣξi2 was used in primal objective, we would need additional Lagrange
multipliers for slack variables):
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
Again, xi with non-zero αi will be support vectors. Solution to the dual problem is:
w =Σαiyixi
b= yk(1- ξk) - ΣαiyixiTxk for any k s.t. αk>0
No need to compute w explicitly for classification: f(x) = ΣαiyixiTx + b.
Theoretical Justification for Maximum Margins
What has Vapnik proved?
The class of optimal linear separators has VC dimension h bounded as:
 D 2  
h  min  2 , m0   1
   
Where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of the
training examples, and m0 is the dimensionality.
Intuitively, this implies that regardless of dimensionality m0 we can minimize the VC

dimension by maximizing the margin ρ. Complexity of the classifier is kept small regardless
of dimensionality.

Linear SVMs: Overview
The classifier is a separating hyperplane. Most “important” training points are support
vectors; they define the hyperplane. Quadratic optimization algorithms can identify which
training points xi are support vectors with non-zero Lagrangian multipliers αi. Both in the
dual formulation of the problem and in the solution training points appear only inside
inner products:
Find α1…αN such that

Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
f(x) = ΣαiyixiTx + b
Non-linear SVMs
Datasets that are linearly separable with some noise work out great:
But what are we going to do if the dataset is just too hard?
How about… mapping data to a higher-dimensional space:
The original feature space can always be mapped to some higher-dimensional feature
space where the training set is separable.

The “Kernel Trick”
The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj. If every
datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the
inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
A kernel function is a function that is eqiuvalent to an inner product in some feature space.
Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
A kernel function implicitly maps data to a high-dimensional space (without the need to
compute each φ(x) explicitly).
What Functions are Kernels?

For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be tedious.
Mercer’s theorem: Every semi-positive definite symmetric function is a kernel.
Semi-positive definite symmetric functions correspond to a semi-positive definite

symmetric Gram matrix:
Examples of Kernel Functions

Linear: K(xi,xj)= xiTxj Mapping Φ: x → φ(x), where φ(x) is x itself
Polynomial of power p: K(xi,xj)= (1+ xiTxj)p Mapping Φ: x → φ(x), where φ(x) has
dimensions
Gaussian (radial-basis function): K(xi,xj) = Mapping Φ: x → φ(x), where φ(x) is infinite-

dimensional: every point is mapped to a function (a Gaussian); combination of functions for
support vectors is the separator.

Higher-dimensional space still has intrinsic dimensionality d (the mapping is not onto), but
linear separators in it correspond to non-linear separators in original space.
Non-linear SVMs Mathematically

Dual problem formulation:
Find α1…αn such that

Q(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
The solution is: f(x) = ΣαiyiK(xi, xj)+ b
Optimization techniques for finding αi’s remain the same!
SVM locates a separating hyperplane in the feature space and classify points in that space. It
does not need to represent the space explicitly, simply by defining a kernel function. The
kernel function plays the role of the dot product in the feature space.
Properties of SVM
 Flexibility in choosing a similarity function
 Sparseness of solution when dealing with large data sets only support vectors are
used to specify the separating hyperplane
 Ability to handle large feature spaces
 Overfitting can be controlled by soft margin approach
 Nice math property: A simple convex optimization problem which is guaranteed to
converge to a single global solution
 Feature Selection
Weakness of SVM
It is sensitive to noise: A relatively small number of mislabeled examples can dramatically
decrease the performance. It only considers two classes:
How to do multi-class classification with SVM?
Answer:
With output arity m, learn m SVM’s

 SVM 1 learns “Output==1” vs “Output != 1”
 SVM 2 learns “Output==2” vs “Output != 2”
 SVM m learns “Output==m” vs “Output != m”
To predict the output for a new input, just predict with each SVM and find out which one
puts the prediction the furthest into the positive region.

SVM Applications
SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing
popularity in late 1990s.
 SVMs are currently among the best performers for a number of classification tasks
ranging from text to genomic data.
 SVMs can be applied to complex data types beyond feature vectors (e.g. graphs,
sequences, relational data) by designing kernel functions for such data.
 SVM techniques have been extended to a number of tasks such as regression
[Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
 Most popular optimization algorithms for SVMs use decomposition to hill-climb over
a subset of αi’s at a time, e.g. SMO [Platt ’99] and [Joachims ’99]
 Tuning SVMs remains a black art: Selecting a specific kernel and parameters is
usually done in a try-and-see manner.
SVM: Iris Dataset

Decision Tree, Random Forest and Bagging

What is a Decision Tree?
An inductive learning task use particular facts to make more generalized conclusions. A
predictive model based on a branching series of Boolean tests. These smaller Boolean tests
are less complex than a one-stage classifier.
If we leave at 10 AM and there are no cars stalled on the road, what will our commute time
be?
Decision Trees as Rules

We did not have represent this tree graphically. We could have represented as a set of
rules. However, this may be much harder to read. Notice that all attributes to not have to
be used in each path of the decision. As we will see, all attributes may not even appear in
the tree.
How to Create a Decision Tree?

 We first make a list of attributes that we can measure these attributes (for now)
must be discrete.
 We then choose a target attribute that we want to predict.
 Then create an experience table that lists what we have seen in the past.

Sample Experience Table
Choosing Attributes
The previous experience decision table showed 4 attributes hour, weather, accident and
stall. But the decision tree only showed 3 attributes hour, accident and stall. Why is that?
Methods for selecting attributes (which will be described later) show that weather is not a
discriminating attribute.
We use the principle of Occam’s Razor: Given a number of competing hypotheses, the simplest
one is preferable. The basic structure of creating a decision tree is the same for most
decision tree algorithms. The difference lies in how we select the attributes for the tree. We
will focus on the ID3 algorithm(Iterative Dichotomiser 3) developed by Ross Quinlan in
1975. There is an extension of ID3 algorithm referred to as C4.5 is an extension of ID3 that
accounts for unavailable values, continuous attribute value ranges, pruning of decision
trees, rule derivation, and so on.
Decision Tree Algorithms

The basic idea behind any decision tree algorithm is as follows:
1. Choose the best attribute(s) to split the remaining instances and make that attribute
a decision node
2. Repeat this process for recursively for each child
3. Stop when all the instances have the same target attribute value, there are no more
attributes and there are no more instances.

Identifying the Best Attributes
Refer back to our original decision tree
How did we know to split on “Leave At” and then on “Stall” and “Accident” and not
“Weather”?
ID3 Heuristic
To determine the best attribute, we look at the ID3 heuristic. ID3 splits attributes based on
their entropy. Entropy is the measure of disinformation.
Entropy
Calculation of Entropy
𝑛
|𝑆𝑖 | |𝑆𝑖 |
𝐸(𝑆) = ∑ − ∗ log 2 ( )
|𝑆| |𝑆|
𝑖=1
𝑆 = set of examples
𝑆𝑖 = subset of S with value vi under the target attribute
n = size of the range of the target attribute
Entropy is minimized when all values of the target attribute are the same. If we know that
commute time will always be short, then entropy = 0. Entropy is maximized when there is
an equal chance of all values for the target attribute (i.e. the result is random). If commute
time = short in 3 instances, medium in 3 instances and long in 3 instances, entropy is
maximized.
Example
Suppose S has 25 examples, 15 positive and 10 negatives [15+, 10-]. Then the entropy of S
relative to this classification is:
2
|𝑆𝑖 | |𝑆𝑖 |
𝐸(𝑆) = ∑ − ∗ 𝑙𝑜𝑔2 ( )
|𝑆| |𝑆|
𝑖=1
|15| |15| |10| |10|

𝐸(𝑆) = − ∗ log 2 ( )− ∗ log 2 ( )
|25| |25| |25| |15|

Some Intuitions
The entropy is 0 if the outcome is ``certain’’. The entropy is maximum if we have no
knowledge of the system (or any outcome is equally possible). Entropy of a 2-class problem
with regard to the portion of one of the two groups.
ID3
ID3 splits on attributes with the lowest entropy. We calculate the entropy for all values of
an attribute as the weighted sum of subset entropies as follows:
𝑛 |𝑆𝑖 |
∑ ∗ 𝐸(𝑆𝑖 )
𝑖=1 |𝑆|
Where, n is the range of the attribute we are testing. We can also measure information gain
(which is inversely proportional to entropy) as a measure of expected reduction in entropy
for selecting particular attribute for split.
Information Gain
Information gain measures the expected reduction in entropy, or uncertainty.
|𝑆𝑣 |
𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸(𝑠) − ∑ ∗ 𝐸(𝑆𝑣 )
|𝑆|
𝑣∈𝐴
Where, A(S) is the set of all possible values for attribute S, and
𝑆𝑣 the subset of S for which attribute A has value v,
𝑆𝑣 = {𝑝 ∈ 𝑆 | 𝐴(𝑝) = 𝑣}
The first term in the equation for Gain is just the entropy of the original collection S and the
second term is the expected value of the entropy after S is partitioned using attribute A. Given
our commute time sample set, we can calculate the entropy of each attribute at the root node.

Examples
Before partitioning, the entropy is E(10/20, 10/20) = - 10/20 log2 (10/20) - 10/20 log2
(10/20) = 1
Using the ``where’’ attribute, divide into 2 subsets

 Entropy of the first set E(home) = - 6/12 log2 (6/12) - 6/12 log2 (6/12) = 1
 Entropy of the second set E(away) = - 4/8 log2 (6/8) - 4/8 log2 (4/8) = 1
Expected entropy after partitioning 12/20 * E(home) + 8/20 * E(away) = 1
Using the ``when’’ attribute, divide into 3 subsets

 Entropy of the first set E(5pm) = - 1/4 log2 (1/4) - 3/4 log2 (3/4);
 Entropy of the second set E(7pm) = - 9/12 log2 (9/12) - 3/12 log2 (3/12);
 Entropy of the second set E(9pm) = - 0/4 log2 (0/4) - 4/4 log2(4/4) = 0
Expected entropy after partitioning 4/20 * E(1/4, 3/4) + 12/20 * E(9/12, 3/12) + 4/20 *
E(0/4, 4/4) = 0.65
Information gain, G =1-0.65 = 0.35
Decision
Knowing the ``when’’ attribute values provides larger information gain than ``where’’.
Therefore the ``when’’ attribute should be chosen for testing prior to the ``where’’ attribute.
Similarly, we can compute the information gain for other attributes. At each node, choose the
attribute with the largest information gain.

Stopping Rule
Every attribute has already been included along this path through the tree, or the training
examples associated with this leaf node all have the same target attribute value (i.e., their
entropy is zero).
Evaluation
Training accuracy
How many training instances can be correctly classify based on the available data? Is high
when the tree is deep/large, or when there is less confliction in the training instances.
However, higher training accuracy does not mean good generalization
Testing accuracy
Given a number of new instances, how many of them can we correctly classify? Cross
validation.
Continuous Attribute
Each non-leaf node is a test, its edge partitioning the attribute into subsets (easy for
discrete attribute). For continuous attribute:
 Partition the continuous value of attribute A into a discrete set of intervals
 Create a new boolean attribute 𝐴𝑐 , looking for a threshold c,
𝐴𝑐 = {𝑇𝑟𝑢𝑒 𝑖𝑓 𝐴 < 𝑐
𝐹𝑎𝑙𝑠𝑒
How to choose c?
Pruning Trees
There is another technique for reducing the number of attributes used in a tree – pruning.
Two types of pruning:
1. Pre-pruning (forward pruning)
2. Post-pruning (backward pruning)
Prepruning: In prepruning, we decide during the building process when to stop adding
attributes (possibly based on their information gain). However, this may be problematic –
Why? Sometimes attributes individually do not contribute much to a decision, but
combined, they may have a significant impact.
Postpruning: Postpruning waits until the full decision tree has built and then prunes the
attributes. Two techniques:
1. Subtree Replacement
2. Subtree Raising

Subtree Replacement
Entire subtree is replaced by a single leaf node.
Node 6 replaced the subtree. Generalizes tree a little more, but may increase accuracy.
Subtree Raising
Entire subtree is raised onto another node
Entire subtree is raised onto another node. This was not discussed in detail as it is not clear
whether this is really worthwhile (as it is very time consuming).

Strengths of ID3 algorithm
 Can generate understandable rules
 Perform classification without much computation
 Can handle continuous and categorical variables
 Provide a clear indication of which fields are most important for prediction or
classification.
Problems with ID3

ID3 is not optimal uses expected entropy reduction, not actual reduction. Must use discrete
(or discretized) attributes what if we left for work at 9:30 AM? We could break down the
attributes into smaller values…
Problems with Decision Trees

While decision trees classify quickly, the time for building a tree may be higher than
another type of classifier. Decision trees suffer from a problem of errors propagating
throughout a tree. A very serious problem as the number of classes increases. Deep trees
(large size trees) can result into overfitting
Overfitting in Decision Tree Learning
Overfitting: Formal definition

Consider error of hypothesis h over
Training data: errortrain(h)
Entire distribution D of data: errorD(h)
Hypothesis hH overfits training data if there is an alternative hypothesis h’H such that
errortrain(h) < errortrain(h’) and errorD(h) > errorD(h’)

How can we avoid overfitting?
 Stop growing when data split not statistically significant
 Grow full tree then post-prune
 Use ensemble of decision trees
Random Forests
An ensemble of decision trees
1. Split the learning data into number of samples use Bootstrap sampling and generate
large number of bootstrap samples.
2. Generate a decision tree for each bootstrap sample.
3. All trees vote to produce a final answer. The majority vote is considered.
This process is called “Bagging” (Bootstrap Aggregation).
Why do this?
It was found that optimal cut points can depend strongly on the training set used. [High
variance]. This led to the idea of using multiple trees to vote for a result. Averaging the
outputs of trees reduces overfitting to noise. Pruning is not needed. For the use of multiple
trees to be most effective the trees should be independent as possible. Splitting using a
random subset of features hopefully achieves this.
Typically 5 – 100 trees are used. Often only a few trees are needed. Results seem fairly
insensitive to the number of random attributes that are tested for each split. A common
default is to use the square oot of the number of attributes. Trees are fast to generate because
fewer attributes have to be tested for each split and no pruning is needed. Memory needed
to store the trees can be large.
Lower test error with increase in ensemble size.

Increased training time with increase in ensemble size

Classification using Naïve Bayes

Background
There are three methods to establish a classifier
1. Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
2. Model the probability of class memberships given input data
Example: logistic regression, perceptron with the cross-entropy cost
3. Make a probabilistic model of data within each class
Examples: Naïve Bayes, model based classifiers
1 and 2 are examples of discriminative or conditional classification. 2 and 3 are both

examples of probabilistic classification and 3 is an example of generative classification.
Probability Basics
Prior, conditional and joint probability for random variables:
 Prior probability
 Conditional probability
 Joint probability
 Relationship
 Independence
 Bayesian Rule
P( X | C ) P(C ) Likelihood  Prior

P(C | X)  Posterior 
P( X) Evidence
Probabilistic Classification
Establishing a probabilistic model for classification.
Discriminative model

Generative model
MAP classification rule
Maximum A Posterior (MAP) assign x to c* if
P(C  c* | X  x)  P(C  c | X  x) c  c* , c  c1 ,  ,cL
Generative classification with the MAP rule: Apply Bayesian rule to convert them into
posterior probabilities:
P( X  x | C  ci ) P(C  ci )
P(C  ci | X  x) 
P ( X  x)
 P( X  x | C  ci ) P(C  ci )
for i  1,2,  , L
Then apply the MAP rule.
Naïve Bayes
Bayes classification
P(C | X)  P(X | C ) P(C )  P( X1 ,  , X n | C ) P(C )
Difficulty: learning the joint probability
Naïve Bayes classification assumption that all input features are conditionally independent!
P( X 1 , X 2 ,  , X n | C )  P( X 1 | X 2 ,  , X n , C ) P( X 2 ,  , X n | C )
 P( X 1 | C ) P( X 2 ,  , X n | C )
 P( X 1 | C ) P( X 2 | C )    P( X n | C )
[ P( x1 | c* )    P( xn | c* )]P(c* )  [ P( x1 | c)    P( xn | c)]P(c), c  c* , c  c1 ,  , cL

MAP classification rule:
Naïve Bayes classification

Let’s Understand Through an Example: Play Badminton Data
For the day <sunny, cool, high, strong>, what’s the play prediction?
For four external factors, we calculate for each we calculate the conditional probability
table.

Learning Phase
Example
Test Phase
Given a new instance, predict its label x’= (Outlook=Sunny, Temperature=Cool,
Humidity=High, Wind=Strong). Look up tables achieved in the learning phrase
Decision making with the MAP rule

P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes)
= 0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No)
= 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

Algorithm: Continuous-valued Features
Numberless values for a feature. Conditional probability often modeled with the normal
distribution.
1  ( X j  ji ) 2 
ˆ
P( X j | C  ci )  exp   
2  ji  2  2 
 ji 
 ji : mean (avearage) of feature values X j of examples for which C  ci
 ji : standard deviation of feature values X j of examples for which C  ci
for X  ( X 1 ,  , X n ), C  c1 ,  , cL
Learning Phase:
P(C  ci ) i  1,  , L
Output: n L normal distributions and X  (a1,  , an )
Test Phase: Given an unknown instance

Instead of looking-up tables, calculate conditional probabilities with all the normal
distributions achieved in the learning phrase. Apply the MAP rule to make a decision.
Example: Continuous-valued Features

Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
Estimate mean and variance for each class:
Learning Phase: output two Gaussian models for P(temp|C)

1  ( x  21.64) 2  1  ( x  21.64) 2 
Pˆ ( x | Yes )  exp     exp   
2.35 2  2  2.352  2.35 2  11.09 
1  ( x  23.88) 2  1  ( x  23.88) 2 
Pˆ ( x | No)  exp     exp   
7.09 2  2  7.09 2  7.09 2  50.25 
Relevant Issues
Violation of Independence Assumption. For many real world tasks,
P( X1 ,  , X n | C )  P( X1 | C )    P( X n | C )
Nevertheless, naïve Bayes works surprisingly well anyway! Zero conditional probability
Problem.

X j  a jk , Pˆ ( X j  a jk | C  ci )  0
If no example contains the feature value
In this circumstance, Pˆ ( x1 | ci )    Pˆ (a jk | ci )    Pˆ ( xn | ci )  0 during test
For a remedy, conditional probabilities re-estimated with
n  mp
Pˆ ( X j  a jk | C  ci )  c
nm
nc : number of training examples for which X j  a jk and C  ci
n : number of training examples for which C  ci
p : prior estimate (usually, p  1 / t for t possible values of X j )
m : weight to prior (number of " virtual" examples, m  1)
Avoiding the zero-Probability Problem

Naïve Bayesian prediction requires each conditional probability be non-zero. Otherwise the
predicted probability will be zero.
𝑛
𝑃(𝑥|𝐶𝑖 ) = ∏ 𝑃(𝑥𝑘 |𝐶𝑖 )

𝑘=1
Example: Suppose a dataset with 1000 tuples, [income Low] = 0, [income Medium] = 990
and [income High] = 10
Use Laplacian correction (or Laplacian Estimator): Adding 1 to each case

1
𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝐿𝑜𝑤) = = 1/1003
3 + 1000
𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑀𝑒𝑑𝑖𝑢𝑚) = 991/1003
𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝐻𝑖𝑔ℎ) = 11/1003
The “corrected” probability estimates are close to their “uncorrected” counterparts.
Naïve Bayes is often a good choice if you don’t have much training data!

Naïve Bayes: Titanic Dataset

Artificial Neural Network

The Biological Neuron
A neuron's dendritic tree is connected to a thousand neighbouring neurons. When one of
those neurons fire, a positive or negative charge is received by one of the dendrites. The
strengths of all the received charges are added together through the processes of spatial and
temporal summation. Signals “move” via electrochemical signals. The synapses release a
chemical transmitter the sum of which can cause a threshold to be reached causing the
neuron to “fire”.
Prehistory
W.S. McCulloch & W. Pitts (1943). “A logical calculus of the ideas immanent in nervous
activity”, Bulletin of Mathematical Biophysics, 5, 115-137. This seminal paper pointed out
that simple artificial “neurons” could be made to perform basic logical operations such as
AND, OR and NOT.
Nervous Systems as Logical Circuits

Groups of these “neuronal” logic gates could carry out any computation, even though each
neuron was very limited.
Could computers built from these simple units reproduce the computational power of
biological brains? Were biological neurons performing logical operations?

The Perception
Frank Rosenblatt (1962). Principles of Neurodynamics, Spartan, New York, NY. Subsequent
progress was inspired by the invention of learning rules inspired by ideas from
neuroscience. Rosenblatt’s Perceptron could automatically learn to categorise or classify
input vectors into types.
Linear Neurons
The neuron has a real-valued output which is a weighted sum of its inputs. The aim of
learning is to minimize the discrepancy between the desired output and the actual output:
 How do we measure the discrepancies?
 Do we update the weights after every training case?
 Why don’t we solve it analytically?
A Motivation Example
Each day you get lunch at the cafeteria. Your diet consists of fish, chips, and a drink. You get
several portions of each. The cashier only tells you the total price of the meal. After several
days, you should be able to figure out the price of each portion.
Each meal price gives a linear constraint on the prices of the portions:
price  x w x w x w
fish fish chips chips drink drink
The obvious approach is just to solve a set of simultaneous linear equations, one per meal.
But we want a method that could be implemented in a neural network. The prices of the

portions are like the weights in of a linear neuron. We will start with guesses for the
weights and then adjust the guesses to give a better fit to the prices given by the cashier.
Cashier’s Brain
Cashier’s Brain with Arbitrary Initial Weights: Model
Residual error = 350
The learning rule is:
wi   xi ( y  yˆ )
With a learning rate of 1/35, the weight

changes are +20, +50, +30
This gives new weights of 70, 100, 80. Notice

that the weight for chips got worse!
Behavior of the Iterative Learning Procedure

 Do the updates to the weights always make them get closer to their correct values?
No!
 Does the online version of the learning procedure eventually get the right answer?
Yes, if the learning rate gradually decreases in the appropriate way.
 How quickly do the weights converge to their correct values? It can be very slow if
two input dimensions are highly correlated (e.g. ketchup and chips).
 Can the iterative procedure be generalized to much more complicated, multi-layer,
non-linear nets? YES!
Deriving the Delta Rule

Define the error as the squared residuals summed over all training cases:

E
E 1
1
E
2
2
n
n
1

 ((2yyn  (yyˆˆ ny))n22  yˆ n ) 2
n
n
n
Now differentiate to get error derivatives for weights.
E yˆˆ n EEn
E  1  y nyˆ n En
1
 22 
E
wwii
n 
1 n
wi 
 ˆ
y
wi n i n ynwi yˆ n
2w ˆn

 xxi ,n (( yyn 
   yyˆˆ n ))
 i,n ( yn  yˆn )
i ,n nn
n
n x
n
The batch delta rule changes the weights in proportion to their error derivatives summed
over all training cases. E
E

wwii 
 w
wii E
wi  
wi
The Error Surface
The error surface lies in a space with a horizontal axis for each weight and one vertical axis
for the error.
 For a linear neuron, it is a quadratic bowl.
 Vertical cross-sections are parabolas.
 Horizontal cross-sections are ellipses.
Online versus Batch Learning

Batch learning does steepest descent on the error surface.

Online learning zig-zags around the direction of steepest descent.
Constraint from
training case 1
w1
Constraint from
training case 2
w2
Adding Biases
A linear neuron is a more flexible model if we b w1 w2
include a bias. We can avoid having to figure
out a separate learning rule for the bias by
using a trick. A bias is exactly equivalent to a
weight on an extra input line that always has
an activity of 1.
yˆ  b   xi wi
i
1 x1 x2
Transfer Functions
Determines the output from a summation of the weighted inputs of a neuron. Maps any real
numbers into a domain normally bounded by 0 to 1 or -1 to 1, i.e. squashing functions. Most
 
common functions are sigmoid functions: O j  f j   wij xi 
 i 
1
Logistic: f ( x) 
1  ex
e x  ex
Hyperbolic tangent: f ( x) 
e x  ex
Activation Functions
The activation function is generally non-linear. Linear functions are limited because the
output is simply proportional to the input.

Neuron Models
The choice of activation function  determines the neuron model.
Examples:
a if v  c
Step function:  (v)  
b if v  c
a if v  c

Ramp function:  (v)  b if v  d
a  ((v  c)(b  a) /( d  c)) otherwise

Sigmoid function with z,x,y parameters.
1
Gaussian function:  (v)  z 
1  exp(  xv  y)
1  1  v   2 
 (v )  exp    
2   2    
 

Step Function Ramp Function
c d
Sigmoid Function
Gaussian Function
The Gaussian function is the probability function of the normal distribution. Sometimes
also called the frequency curve.

The Key Elements of Neural Networks
At each neuron, every input has an

associated weight which modifies the
strength of each input. The neuron
simply adds together all the inputs and
calculates an output to be passed on.
Preprocessing the Input Vectors

Instead of trying to predict the answer directly from the raw inputs we could start by
extracting a layer of “features”. Sensible if we already know that certain combinations of
input values would be useful. The features are equivalent to a layer of hand-coded non-linear
neurons. So far as the learning algorithm is concerned, the hand-coded features are the input.
Statistical and ANN Terminology

A perceptron model with a linear transfer function is equivalent to a possibly multiple or
multivariate linear regression model [Weisberg 1985; Myers 1986]. A perceptron model
with a logistic transfer function is a logistic regression model [Hosmer and Lemeshow 1989].
A perceptron model with a threshold transfer function is a linear discriminant function
[Hand 1981; McLachlan 1992; Weiss and Kulikowski 1991].
Network Architectures
Three different classes of network architectures:
 Single-layer feed-forward
 Multi-layer feed-forward
 Recurrent
The architecture of a neural network is linked with the learning algorithm used to train
Single Layer Feed-forward

Multi-layer Feed-forward NN (FFNN)
FFNN is a more general network architecture, where there are hidden layers between input
and output layers. Hidden nodes do not directly receive inputs nor send outputs to the
external environment. FFNNs overcome the limitation of single-layer NN. They can handle
non-linearly separable learning tasks.
FFNN Neuron Model

The classical learning algorithm of FFNN is based on the gradient descent method. For this
reason the activation function used in FFNN are continuous functions of the weights,
differentiable everywhere. The activation function for node i may be defined as a simple
form of the sigmoid function in the following manner:
1
Φ(𝑉) =
1 + 𝑒 −𝑉
where A > 0, Vi =  Wij * Yj , such that Wij is a weight of the link from node i to node j and
Yj is the output of node j.
Training Algorithm: Backpropagation

The Backpropagation algorithm learns in the same way as single perceptron. It searches for
weight values that minimize the total error of the network over the set of training examples
(training set).
Back propagation consists of the repeated application of the following two passes:
1. Forward pass: In this step, the network is activated on one example and the error
of (each neuron of) the output layer is computed.
2. Backward pass: In this step the network error is used for updating the weights. The
error is propagated backwards from the output layer through the network layer by
layer. This is done by recursively computing the local gradient of each neuron.
Back propagation adjusts the weights of the NN in order to minimize the network total
mean squared error.

Consider a network of three layers. Let us use i to represent nodes in input layer, j to
represent nodes in hidden layer and k represent nodes in output layer. wij refers to weight
of connection between a node in input layer and node in hidden layer. The following
equation is used to derive the output value Yj of node j.
Yj  1
1 eV j
Where, Vj =  xi . wij - j , 1 i  n; n is the number of inputs to node j, and j is threshold

for node j.
Total Mean Squared Error

The error of output neuron k after the activation of the network on the n-th training
example (x(n), d(n)) is: ek(n) = dk(n) – yk(n)
The network error is the sum of the squared errors of the output neurons:
E(n)   e 2k (n)
The total mean squared error is the average of the network errors of the training examples.
N
 E (n)
1
EAV  N
n 1
Weight Update Rule

The Backprop weight update rule is based on the gradient descent method. It takes a step
in the direction yielding the maximum decrease of the network error E. This direction is the
opposite of the gradient of E.
Iteration of the Backprop algorithm is usually terminated when the sum of squares of
errors of the output values for all training data in an epoch is less than some threshold such
as 0.01.
E
wij  wij  wij w ij  -
w ij

Backprop Learning Algorithm (Incremental-mode)
n=1; initialize weights randomly; while (stopping criterion not satisfied or n
<max_iterations) for each example (x,d) run the network with input x and compute the
output y update the weights in backward order starting from those of the output layer:
with w ji computed using the (generalized) Delta rule end-for n = n+1; end-while.
Stopping Criterions
 Total mean squared error change: Back-prop is considered to have converged
when the absolute rate of change in the average squared error per epoch is
sufficiently small (in the range [0.1, 0.01]).
 Generalization based criterion: After each epoch, the NN is tested for
generalization. If the generalization performance is adequate then stop. If this
stopping criterion is used then the part of the training set used for testing the
network generalization will not use for updating the weights.
Applications
Healthcare Applications of ANNs
 Predicting/confirming myocardial infarction, heart attack, from EKG output waves.
Physicians had a diagnostic sensitivity and specificity of 73.3% and 81.1% while
ANNs performed 96.0% and 96.0%
 Identifying dementia from EEG patterns, performed better than both Z statistics and
discriminant analysis; better than LDA for (91.1% vs. 71.9%) in classifying with
Alzheimer disease.
 Papnet: A Pap Smear screening system by Neuromedical Systems in used by US FDA
 Predict mortality risk of preterm infants, screening tool in urology, etc.
Classification Applications of ANNs

 Credit Card Fraud Detection: AMEX, Mellon Bank, Eurocard Nederland
 Optical Character Recognition (OCR): Fax Software
 Cursive Handwriting Recognition: Lexicus
 Petroleum Exploration: Arco & Texaco
 Loan Assessment: Chase Manhattan for vetting commercial loans
 Bomb detection by SAIC
Time Series Applications of ANNs

 Trading systems: Citibank London (FX).
 Portfolio selection and Management: LBS Capital Management (>US$1b), Deere &
Co. pension fund (US$100m).
 Forecasting weather patterns & earthquakes.
 Speech technology: verification and generation.
 Medical: Predicting heart attacks from EKGs and mental illness from EEGs.

Advantages of Using ANNs
 Works well with large sets of noisy data, in domains where experts are unavailable
or there are no known rules.
 Simplicity of using it as a tool
 Universal approximator.
 Does not impose a structure on the data.
 Possible to extract rules.
 Ability to learn and adapt.
 Does not require an expert or a knowledge engineer.
 Well suited to non-linear type of problems.
 Fault tolerant
Neural Network and Computers

Computers:
Computers have to be explicitly programmed:
 Analyze the problem to be solved.
 Write the code in a programming language.
Neural networks:
 Neural networks learn from examples
 No requirement of an explicit description of the problem.
 No need for a programmer.
 The neural computer adapts itself during a training period, based on examples of
similar problems even without a desired solution to each problem. After sufficient
training the neural computer is able to relate the problem data to the solutions,
inputs to outputs, and it is then able to offer a viable solution to a brand new
problem.
 Able to generalize or to handle incomplete data.
NNs vs. Computers
Digital Computers Neural Networks
Deductive Reasoning. We apply known Inductive Reasoning. Given input and
rules to input data to produce output. output data (training examples), we
construct the rules.
Computation is centralized, synchronous, Computation is collective, asynchronous,
and serial. and parallel.
Memory is packetted, literally stored, and Memory is distributed, internalized, short
location addressable. term and content addressable.
Not fault tolerant. One transistor goes Fault tolerant, redundancy, and sharing of
and it no longer works. responsibilities.
Exact. Inexact.
Static connectivity. Dynamic connectivity.
Applicable if well-defined rules with Applicable if rules are unknown or
precise input data. complicated, or if data are noisy or partial.

What Can You Do with an NN and What Not?
In principle, NNs can compute any computable function, i.e., they can do everything a normal
digital computer can do. Almost any mapping between vector spaces can be approximated
to arbitrary precision by feed-forward NNs. In practice, NNs are especially useful for
classification and function approximation problems usually when rules such as those that
might be used in an expert system cannot easily be applied. Work is in progress to apply NNs
successfully to problems that concern manipulation of symbols and memory. There are no
methods for training NNs that can magically create information that is not contained in the
training data.

Data Science

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Science

Încărcat de

Drepturi de autor:

Formate disponibile

Data

Confidential and restricted. Do not distribute. (c) Imarticus Learning 2

Confidential and restricted. Do not distribute. (c) Imarticus Learning 3

Confidential and restricted. Do not distribute. (c) Imarticus Learning 4

Confidential and restricted. Do not distribute. (c) Imarticus Learning 5

Confidential and restricted. Do not distribute. (c) Imarticus Learning 6

Confidential and restricted. Do not distribute. (c) Imarticus Learning 7

Confidential and restricted. Do not distribute. (c) Imarticus Learning 8

Confidential and restricted. Do not distribute. (c) Imarticus Learning 9

Confidential and restricted. Do not distribute. (c) Imarticus Learning 10

Confidential and restricted. Do not distribute. (c) Imarticus Learning 11

Confidential and restricted. Do not distribute. (c) Imarticus Learning 12

Design and Analysis of Experiment

Data Collection Techniques

What is Design of Experiment?

Confidential and restricted. Do not distribute. (c) Imarticus Learning 13

DoE: Some Terminology

DoE: Examples of Experiments from Daily Life

Example Factors Response

Design of Experiment: Cake Baking

Baking the Cake: Steps in Design of Experiments

Confidential and restricted. Do not distribute. (c) Imarticus Learning 14

DoE for Cake Baking: Factor Levels

2 One-factor-at-a-time (OFAT) approach

3 Factorial approach (invented in the 1920’s)

There are 27 different combination of 3 factors with 3 levels.

Confidential and restricted. Do not distribute. (c) Imarticus Learning 15

History of Design of Experiment

Four Eras of DOE

Some Major Players in DOE

Confidential and restricted. Do not distribute. (c) Imarticus Learning 16

Why Design of Experiment?

Building Blocks of DoE

Blocking: “Factor out” variable not studied

One-factor-at-a-time Experiments (OFAT)

Confidential and restricted. Do not distribute. (c) Imarticus Learning 17

The golf experiment:

Confidential and restricted. Do not distribute. (c) Imarticus Learning 18

Factorial v/s OFAT

Example: Effect of Re and k/D on friction factor f

Reynold’s number = Factor A; k/D = Factor B

Levels for A: 104 (low), 106 (high)

Levels for B: 0.0001 (low), 0.001 (high)

Confidential and restricted. Do not distribute. (c) Imarticus Learning 19

Effect (A) = -0.66, Effect (B) = 0.22, Effect (AB) = 0.17

% contribution: A = 84.85%, B = 9.48%, AB = 5.67%

Friction in Pipes: Experiments of Nikuradse

Confidential and restricted. Do not distribute. (c) Imarticus Learning 20

Confidential and restricted. Do not distribute. (c) Imarticus Learning 21

DESIGN-EXPERT Pl ot Interaction Graph DESIGN-EXPERT Pl ot

Log10(f) B: k/D Log10(f)

Desi gn Poi nts -1.567 -1.611

DESIGN-EXPERT Pl ot Log10(f) DESIGN-EXPERT Pl ot

Central Composite Design

Randomized Design and ANOVA

Confidential and restricted. Do not distribute. (c) Imarticus Learning 22

Shelf Problem Data

Plotting Treatment Means

Confidential and restricted. Do not distribute. (c) Imarticus Learning 23

Different Treatment Effects in Two-Way ANOVA

Confidential and restricted. Do not distribute. (c) Imarticus Learning 24

Some Useful Quantities