Documente Academic
Documente Profesional
Documente Cultură
!
!
!
!
!
!
Example: Consider the scatter plot of Experience and Salary and compare it to that of Score and Salary. What
do you notice?
!
experience(and(salary(
120"
100"
80"
60"
40"
20"
0"
0"
2"
4"
6"
8"
10"
12"
14"
16"
!
The Pearson Correlation Coefficient, denoted by rxy (or xy for the population) captures this positive
relation between Experience and Salary.
Excel command: =correl(array1, array2) or =pearson(array1,array2)
The Pearson correlation coefficient between Experience and Salary is 0.91, suggesting a strong positive
correlation between the two variables.
Property: It can be shown that rxy (or xy ) ranges from -1 to 1.
1!
In general: The closer the absolute value of correlation is to 1, the stronger the linear association
Extreme cases:
rxy = 0: no linear association between the two variables
rxy = 1: perfect positive linear association between the two variables, the scatter plot will be an
upward-sloping straight line
rxy = -1: perfect negative linear association between the two variables, the scatter plot will be a
downward-sloping straight line
!
!!!!!!!!!!!
!
!
!
!
Linear Regression Analysis falls under the class of Probability Models. We will deal with probability in
modules 3-5. After that we will take another look at regression analysis. For now, we look at linear regression
as a purely descriptive tool.
Objective: Quantify the impact of a set of variables, called covariates, on the variable of interest, called the
response variable.
!
Trend!Line!in!Excel:!Insert!>!Scatter!>!Right!Click!on!a!Data!Point>!Add!Trend!Line!>!select!Linear!>!check!
Display!Equation!on!chart!and!Display!R;squared!value!on!chart.!
!
!
experience(and(salary(
y"="2.8217x"+"62.634"
R"="0.83073"
120"
100"
80"
60"
40"
20"
0"
0"
2"
4"
6"
8"
!
!
!
2!
10"
12"
14"
16"
!
Q. What is the interpretation of the value 2.8217 (marginal impact)?
Q. What is the estimated salary for a programmer with 17 years of experience (out of sample prediction)?
!
!
!
!
!
!
How close is the estimate in question 3 above? Can we obtain better estimates? As we argued earlier, salary is
determined by a lot of factors. Here we included only one of those factors experience. Because of the
missing variables in our analysis we have a biased estimate of the impact of experience on salary. To correct
this problem, therefore, we need to include those additional variables.
3!
id
1
2
3
4
5
6
7
8
gender
female
male
female
male
male
female
female
male
degree
Grad
Grad
Grad
UG
Grad
Grad
Grad
UG
experience
4
3
6
10
3
1
1
14
score
98
99
64
66
82
64
87
91
salary
83.62
81.17
80.52
79.59
74.18
66.79
71.71
97.54
Note: Gender and Degree are categorical (qualitative) variables, whereas Experience and Score are
quantitative variables. We will deal with the former shortly. For now let us add score to our regression
analysis.
In Excel: Data > Data Analysis > Regression > Select Input Y Range > Select Input X Range > Check
box Labels > OK
4!
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.94
0.89
0.89
4.89
50
ANOVA
Regression
Residual
Total
df
2
47
49
SS
9247.23
1124.05
10371.28
MS
4623.62
23.92
F
193.33
Significance F
0.00
Intercept
experience
score
Coefficients
36.60
3.03
0.31
Standard Error
5.23
0.15
0.06
t Stat
7.00
19.66
5.14
P-value
0.00
0.00
0.00
Lower 95%
26.09
2.72
0.19
Upper 95%
47.12
3.33
0.43
5!
Coefficient of Determination (R2): How well does our model explain the variation in salary?
R2 measures the percentage of the variation in the dependent variable that is explained by the
estimated multiple regression model.
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.94
0.89
0.89
4.89
50
ANOVA
df
2
47
49
Regression
Residual
Total
SS
9247.23
1124.05
10371.28
MS
4623.62
23.92
F
193.33
Significance F
0.00
id
1
2
3
4
5
6
7
8
gender
female
male
female
male
male
female
female
male
degree
Grad
Grad
Grad
UG
Grad
Grad
Grad
UG
experience
4
3
6
10
3
1
1
14
score
98
99
64
66
82
64
87
91
6!
graduate degree
1
1
1
0
1
1
1
0
male
0
1
0
1
1
0
0
1
salary
83.62
81.17
80.52
79.59
74.18
66.79
71.71
97.54
SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.99
0.97
0.97
2.52
50
ANOVA
Regression
Residual
Total
Intercept
experience
score
graduate degree
male
df
4
45
49
SS
10085.60
285.68
10371.28
MS
2521.40
6.35
Coefficients
34.85
3.01
0.28
8.28
-0.92
Standard Error
2.70
0.08
0.03
0.73
0.72
t Stat
12.91
37.59
8.89
11.42
-1.27
7!
F Significance F
397.17
0.00
P-value
0.00
0.00
0.00
0.00
0.21
Lower 95%
29.41
2.85
0.22
6.82
-2.37
Upper 95%
40.28
3.17
0.34
9.74
0.53
In general: If a qualitative variable has k possible outcomes, then (k 1) dummy variables are required, with
each dummy variable being coded as 0 or 1.
For example, if we had 3 categories for the level of education - Bachelors, Masters and Ph.D., we would
need (3-1) = 2 dummy variables, say Masters and Ph.D., which are codes as follows:
Highest degree Masters
Bachelors
0
Masters
1
Ph.D.
0
8!
Ph.D.
0
0
1