Sunteți pe pagina 1din 8

Module 2: Correlation and Linear Regression

!
!
!
!
!
!

Correlation: Measure of linear association between two variables

Example: Consider the scatter plot of Experience and Salary and compare it to that of Score and Salary. What
do you notice?
!

experience(and(salary(
120"
100"
80"
60"
40"
20"
0"
0"

2"

4"

6"

8"

10"

12"

14"

Figure: Scatter Plot of Experience and Salary (in 000s)

16"

!
The Pearson Correlation Coefficient, denoted by rxy (or xy for the population) captures this positive
relation between Experience and Salary.
Excel command: =correl(array1, array2) or =pearson(array1,array2)
The Pearson correlation coefficient between Experience and Salary is 0.91, suggesting a strong positive
correlation between the two variables.
Property: It can be shown that rxy (or xy ) ranges from -1 to 1.

Figure: Positive, Negative and Zero linear relations

1!

In general: The closer the absolute value of correlation is to 1, the stronger the linear association
Extreme cases:
rxy = 0: no linear association between the two variables
rxy = 1: perfect positive linear association between the two variables, the scatter plot will be an
upward-sloping straight line
rxy = -1: perfect negative linear association between the two variables, the scatter plot will be a
downward-sloping straight line
!
!!!!!!!!!!!
!

!
!
!

Linear Regression as a Descriptive Tool

Linear Regression Analysis falls under the class of Probability Models. We will deal with probability in
modules 3-5. After that we will take another look at regression analysis. For now, we look at linear regression
as a purely descriptive tool.
Objective: Quantify the impact of a set of variables, called covariates, on the variable of interest, called the
response variable.
!
Trend!Line!in!Excel:!Insert!>!Scatter!>!Right!Click!on!a!Data!Point>!Add!Trend!Line!>!select!Linear!>!check!
Display!Equation!on!chart!and!Display!R;squared!value!on!chart.!
!
!

experience(and(salary(

y"="2.8217x"+"62.634"
R"="0.83073"

120"
100"
80"
60"
40"
20"
0"
0"

2"

4"

6"

8"

!
!
!

2!

10"

12"

14"

16"

!
Q. What is the interpretation of the value 2.8217 (marginal impact)?

Q. What is the interpretation of the value 62.634?


!
!
!
!
!
!
!
!
Q. What is the estimated salary for a programmer with 5 years of experience (in-sample prediction)?

Q. What is the estimated salary for a programmer with 17 years of experience (out of sample prediction)?
!
!
!
!
!
!

How close is the estimate in question 3 above? Can we obtain better estimates? As we argued earlier, salary is
determined by a lot of factors. Here we included only one of those factors experience. Because of the
missing variables in our analysis we have a biased estimate of the impact of experience on salary. To correct
this problem, therefore, we need to include those additional variables.

3!

Multiple Linear Regression Models


The programmer salary survey example includes 4 covariates, also referred to as independent variables and a
response variable, referred to as the dependent variable
independent (explanatory) variables

id
1
2
3
4
5
6
7
8

gender
female
male
female
male
male
female
female
male

degree
Grad
Grad
Grad
UG
Grad
Grad
Grad
UG

experience
4
3
6
10
3
1
1
14

dependent (explained) variable

score
98
99
64
66
82
64
87
91

salary
83.62
81.17
80.52
79.59
74.18
66.79
71.71
97.54

Note: Gender and Degree are categorical (qualitative) variables, whereas Experience and Score are
quantitative variables. We will deal with the former shortly. For now let us add score to our regression
analysis.
In Excel: Data > Data Analysis > Regression > Select Input Y Range > Select Input X Range > Check
box Labels > OK

4!

SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

0.94
0.89
0.89
4.89
50

ANOVA
Regression
Residual
Total

df
2
47
49

SS
9247.23
1124.05
10371.28

MS
4623.62
23.92

F
193.33

Significance F
0.00

Intercept
experience
score

Coefficients
36.60
3.03
0.31

Standard Error
5.23
0.15
0.06

t Stat
7.00
19.66
5.14

P-value
0.00
0.00
0.00

Lower 95%
26.09
2.72
0.19

Upper 95%
47.12
3.33
0.43

The Estimated Regression Equation:

estimated salary (in 000s) = 36.6 + 3.03*experience + 0.31*score

Interpreting the Coefficients


Estimated coefficient of experience = 3.03: Score held fixed, when experience increases by one year, salary is
expected to increase by $3,030.
Interpretation: Consider two programmers who had the same score on the test but differ in experience by one
year. The one with the additional year of experience on average makes $3030.
Estimated coefficient of score = 0.31:

5!

Coefficient of Determination (R2): How well does our model explain the variation in salary?
R2 measures the percentage of the variation in the dependent variable that is explained by the
estimated multiple regression model.

Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

0.94
0.89
0.89
4.89
50

ANOVA
df
2
47
49

Regression
Residual
Total

SS
9247.23
1124.05
10371.28

MS
4623.62
23.92

F
193.33

Significance F
0.00

SSR + SSE = SST


R2 = SSR/SST
Qualitative (Categorical) Independent Variables
In many situations we need to work with qualitative variables such as gender (male, female), method of
payment (cash, check, credit card). These variables do not have values, but in statistical analysis we assign
values to the possible outcomes (usually 0 and 1). These are called dummy variables.

id
1
2
3
4
5
6
7
8

gender
female
male
female
male
male
female
female
male

degree
Grad
Grad
Grad
UG
Grad
Grad
Grad
UG

experience
4
3
6
10
3
1
1
14

score
98
99
64
66
82
64
87
91

6!

graduate degree
1
1
1
0
1
1
1
0

male
0
1
0
1
1
0
0
1

salary
83.62
81.17
80.52
79.59
74.18
66.79
71.71
97.54

SUMMARY OUTPUT
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

0.99
0.97
0.97
2.52
50

ANOVA
Regression
Residual
Total

Intercept
experience
score
graduate degree
male

df
4
45
49

SS
10085.60
285.68
10371.28

MS
2521.40
6.35

Coefficients
34.85
3.01
0.28
8.28
-0.92

Standard Error
2.70
0.08
0.03
0.73
0.72

t Stat
12.91
37.59
8.89
11.42
-1.27

Q. Estimated regression equation?

Q. Interpretation of the coefficient of the dummy variable:

7!

F Significance F
397.17
0.00

P-value
0.00
0.00
0.00
0.00
0.21

Lower 95%
29.41
2.85
0.22
6.82
-2.37

Upper 95%
40.28
3.17
0.34
9.74
0.53

In general: If a qualitative variable has k possible outcomes, then (k 1) dummy variables are required, with
each dummy variable being coded as 0 or 1.
For example, if we had 3 categories for the level of education - Bachelors, Masters and Ph.D., we would
need (3-1) = 2 dummy variables, say Masters and Ph.D., which are codes as follows:
Highest degree Masters
Bachelors
0
Masters
1
Ph.D.
0

8!

Ph.D.
0
0
1

S-ar putea să vă placă și