Sunteți pe pagina 1din 30

Covariance and correlation

Coefficient of Correlation Values


Perfect Negative Correlation No Linear Correlation Perfect Positive Correlation

1.0

.5

+.5

+1.0

Increasing degree of negative correlation

Increasing degree of positive correlation

Covariance and correlation


Are used to quantify the relationship that exists between the pairs of variables. Caution
Sometimes, statistical relationship exists between variables even when it is difficult to justify a causal relationship. One must be cautious in drawing inferences about causal relationships based solely on statistical relationships.

Sample covariance
Is a measure of how two variables move together. For a sample of n pairs of data (xi,yi), the sample covariance is defined as Sxy= [ (xi-x)(yi-y ) ]/(n-1)
-

Sample correlation coefficient


r xy = sxy / sxsy Sxy = sample covariance Sx and Sy = sample standard deviation r varies between -1 and +1 R2 is CD or coefficient of Determination and gives the proportion of the variation explained by the relationship.

Coefficient of Correlation
r SS xy SS xx SS yy
2

where

SS xx x

x
n

SS yy y

SS xy xy

n x y n

Coefficient of Correlation Example


Youre a marketing analyst for Hasbro Toys. Ad $ Sales (Units) 1 1 2 1 3 2 4 2 5 4 Calculate the coefficient of correlation.

Solution Table
xi
1 2 3 4

yi
1 1 2 2

2 xi

2 yi

xiyi
1 2 6 8

1 4 9 16

1 1 4 4

5
15

4
10

25
55

16
26

20
37

Coefficient of Correlation Solution


SS xx x 2

x
n

(15) 2 55 10 5

(10) 2 SS yy y 2 26 6 n 5 x y 37 (15)(10) 7 SS xy xy n 5

SS xy SS xx SS yy

7 .904 10 6

Coefficient of Correlation Thinking Challenge


Youre an economist for the county cooperative. You gather the following data: Fertilizer (lb.) Yield (lb.) 4 3.0 6 5.5 10 6.5 12 9.0 Find the coefficient of correlation.

1984-1994 T/Maker Co.

Solution Table*
xi
4
6 10

yi
3.0
5.5 6.5

xi

2 yi

xiyi
12
33 65

16
36 100

9.00
30.25 42.25

12
32

9.0
24.0

144
296

81.00
162.50

108
218

Coefficient of Correlation Solution*


SS xx x 2

x
n

(32) 2 296 40 4

(24) 2 SS yy y 2 162.5 18.5 n 4 x y 218 (32)(24) 26 SS xy xy n 4

SS xy SS xx SS yy

26 .956 40 18.5

Sample correlation coefficient


r xy = sxy / sxsy Sxy = sample covariance Sx and Sy = sample standard deviation r varies between -1 and +1 R2 is CD or coefficient of Determination and gives the proportion of the variation explained by the relationship.

Advertising Sales

Advertising 1291600 6063430

Sales 30146353

Advertising
Advertising Sales 1 0.971711

Sales
1

Sales and Advertising are strongly correlated.

Advertising Advertising Sales 1 0.971711

Sales

But we dont know what sales response we might expect for a given level of advertising?

Correlation Models
Answers How strong is the linear relationship between two variables? Coefficient of correlation
Sample correlation coefficient denoted r Values range from 1 to +1 Measures degree of association Does not indicate causeeffect relationship

Regression Analysis
Sales = 4.694 (advertising) + 49492 This model suggests that as advertising increases, sales increases proportionately. RA allows us to develop models that describe the relationships between a dependent variable and one or more independent variables to use for estimation or prediction. As well as to test the significance of the relationship statistically.

Simple Linear Regression


y^ = b0 + b1x b0 = the estimated y-intercept b1 = the estimated slope of the regression line y^ = the estimated value of the dependent variable

1
2
3

1 Standard Regression Statistics

Multiple R is correlation coefficient Adjusted R square reflects the sample size and is useful when comparing this model with others that include additional variables. R Square or coefficient of determination, provides a measure of how well a regression line fits the data.

yi

yi

SSE

yi

yi
SST

yi

y
SSR

R 2 = SSR/SST

Strength of regression relationship


When two variables are not correlated ( that is the slope of the regression line is zero), the best estimate for any value of the independent variable is simply the mean. Any variation from the mean is due to random error. If the slope is not zero, then a portion of the deviation from the mean is explained by regression, and the remainder is due to error.

SST = total sum of squares (residual)= SSE = sum of squared error =


yi y

yi

SSR = sum of square due to regression=

SST = SSE + SSR and R 2 = SSR/SST it indicates the fraction of the total variation in the dependent variable about its mean that is explained by the regression line.

0R21
R2 = 0 (NONE OF THE VARIATION IS EXPLAINED BY THE REGRESSION LINE)
R2 = 1 ( ALL THE VARIATION EXPLAINED BY REGRESSION LINE THAT IS ALL THE DATA IS ON REGRESSION LINE)