Business Stat Module 2 PDF

Module 2: Correlation and Linear Regression
!
!
!
!
!
!
Correlation: Measure of linear association between two variables
Example: Consider the scatter plot of Experience and Salary and compare it to that of Score and Salary. What
do you notice?
!
experience(and(salary(
120"
100"
80"
60"
40"
20"
0"
0"
2"
4"
6"
8"
10"
12"
14"
Figure: Scatter Plot of Experience and Salary (in 000s)
16"
!
The Pearson Correlation Coefficient, denoted by rxy (or xy for the population) captures this positive
relation between Experience and Salary.
Excel command: =correl(array1, array2) or =pearson(array1,array2)
The Pearson correlation coefficient between Experience and Salary is 0.91, suggesting a strong positive
correlation between the two variables.
Property: It can be shown that rxy (or xy ) ranges from -1 to 1.
Figure: Positive, Negative and Zero linear relations
1!
In general: The closer the absolute value of correlation is to 1, the stronger the linear association
Extreme cases:
rxy = 0: no linear association between the two variables
rxy = 1: perfect positive linear association between the two variables, the scatter plot will be an
upward-sloping straight line
rxy = -1: perfect negative linear association between the two variables, the scatter plot will be a
downward-sloping straight line
!
!!!!!!!!!!!
!
!
!
!
Linear Regression as a Descriptive Tool
Linear Regression Analysis falls under the class of Probability Models. We will deal with probability in
modules 3-5. After that we will take another look at regression analysis. For now, we look at linear regression
as a purely descriptive tool.
Objective: Quantify the impact of a set of variables, called covariates, on the variable of interest, called the
response variable.
!
Trend!Line!in!Excel:!Insert!>!Scatter!>!Right!Click!on!a!Data!Point>!Add!Trend!Line!>!select!Linear!>!check!
Display!Equation!on!chart!and!Display!R;squared!value!on!chart.!
!
!
experience(and(salary(
y"="2.8217x"+"62.634"
R"="0.83073"
120"
100"
80"
60"
40"
20"
0"
0"
2"
4"
6"
8"
!
!
!
2!
10"
12"
14"
16"
!
Q. What is the interpretation of the value 2.8217 (marginal impact)?
Q. What is the interpretation of the value 62.634?

!
!
!
!
!
!
!
!
Q. What is the estimated salary for a programmer with 5 years of experience (in-sample prediction)?
Q. What is the estimated salary for a programmer with 17 years of experience (out of sample prediction)?
!
!
!
!
!
!
How close is the estimate in question 3 above? Can we obtain better estimates? As we argued earlier, salary is
determined by a lot of factors. Here we included only one of those factors experience. Because of the
missing variables in our analysis we have a biased estimate of the impact of experience on salary. To correct
this problem, therefore, we need to include those additional variables.
3!
Multiple Linear Regression Models

The programmer salary survey example includes 4 covariates, also referred to as independent variables and a
response variable, referred to as the dependent variable
independent (explanatory) variables
id
1
2
3
4
5
6
7
8
gender
female
male
female
male
male
female
female
male
degree
Grad
Grad
Grad
UG
Grad
Grad
Grad
UG
experience
4
3
6
10
3
1
1
14
dependent (explained) variable
score
98
99
64
66
82
64
87
91
salary
83.62
81.17
80.52
79.59
74.18
66.79
71.71
97.54
Note: Gender and Degree are categorical (qualitative) variables, whereas Experience and Score are
quantitative variables. We will deal with the former shortly. For now let us add score to our regression
analysis.
In Excel: Data > Data Analysis > Regression > Select Input Y Range > Select Input X Range > Check
box Labels > OK
4!
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.94
0.89
0.89
4.89
50
ANOVA
Regression
Residual
Total
df
2
47
49
SS
9247.23
1124.05
10371.28
MS
4623.62
23.92
F
193.33
Significance F
0.00
Intercept
experience
score
Coefficients
36.60
3.03
0.31
Standard Error
5.23
0.15
0.06
t Stat
7.00
19.66
5.14
P-value
0.00
0.00
0.00
Lower 95%
26.09
2.72
0.19
Upper 95%
47.12
3.33
0.43
The Estimated Regression Equation:
estimated salary (in 000s) = 36.6 + 3.03*experience + 0.31*score
Interpreting the Coefficients

Estimated coefficient of experience = 3.03: Score held fixed, when experience increases by one year, salary is
expected to increase by $3,030.
Interpretation: Consider two programmers who had the same score on the test but differ in experience by one
year. The one with the additional year of experience on average makes $3030.
Estimated coefficient of score = 0.31:
5!
Coefficient of Determination (R2): How well does our model explain the variation in salary?
R2 measures the percentage of the variation in the dependent variable that is explained by the
estimated multiple regression model.
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.94
0.89
0.89
4.89
50
ANOVA
df
2
47
49
Regression
Residual
Total
SS
9247.23
1124.05
10371.28
MS
4623.62
23.92
F
193.33
Significance F
0.00
SSR + SSE = SST

R2 = SSR/SST
Qualitative (Categorical) Independent Variables
In many situations we need to work with qualitative variables such as gender (male, female), method of
payment (cash, check, credit card). These variables do not have values, but in statistical analysis we assign
values to the possible outcomes (usually 0 and 1). These are called dummy variables.
id
1
2
3
4
5
6
7
8
gender
female
male
female
male
male
female
female
male
degree
Grad
Grad
Grad
UG
Grad
Grad
Grad
UG
experience
4
3
6
10
3
1
1
14
score
98
99
64
66
82
64
87
91
6!
graduate degree
1
1
1
0
1
1
1
0
male
0
1
0
1
1
0
0
1
salary
83.62
81.17
80.52
79.59
74.18
66.79
71.71
97.54
SUMMARY OUTPUT
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.99
0.97
0.97
2.52
50
ANOVA
Regression
Residual
Total
Intercept
experience
score
graduate degree
male
df
4
45
49
SS
10085.60
285.68
10371.28
MS
2521.40
6.35
Coefficients
34.85
3.01
0.28
8.28
-0.92
Standard Error
2.70
0.08
0.03
0.73
0.72
t Stat
12.91
37.59
8.89
11.42
-1.27
Q. Estimated regression equation?
Q. Interpretation of the coefficient of the dummy variable:
7!
F Significance F
397.17
0.00
P-value
0.00
0.00
0.00
0.00
0.21
Lower 95%
29.41
2.85
0.22
6.82
-2.37
Upper 95%
40.28
3.17
0.34
9.74
0.53
In general: If a qualitative variable has k possible outcomes, then (k 1) dummy variables are required, with
each dummy variable being coded as 0 or 1.
For example, if we had 3 categories for the level of education - Bachelors, Masters and Ph.D., we would
need (3-1) = 2 dummy variables, say Masters and Ph.D., which are codes as follows:
Highest degree Masters
Bachelors
0
Masters
1
Ph.D.
0
8!
Ph.D.
0
0
1

Business Stat Module 2 PDF

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Business Stat Module 2 PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Module 2: Correlation and Linear Regression

Correlation: Measure of linear association between two variables

Figure: Scatter Plot of Experience and Salary (in 000s)

Figure: Positive, Negative and Zero linear relations

Linear Regression as a Descriptive Tool

Q. What is the interpretation of the value 62.634?

Multiple Linear Regression Models

dependent (explained) variable

The Estimated Regression Equation:

estimated salary (in 000s) = 36.6 + 3.03experience + 0.31score

Interpreting the Coefficients

SSR + SSE = SST

Q. Estimated regression equation?

Q. Interpretation of the coefficient of the dummy variable:

S-ar putea să vă placă și