Sunteți pe pagina 1din 24

Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage

between correlation and linear regression. Correlation and regression provide us with different, but complementary, information on the relationship between two quantitative variables.


Slide 1

CreditCardData.sav has five variables for 8 cases. The data for the 8 cases is shown in the Data View to the left. The names and labels for each of the variables is shown below in the Variable View.

The goal of this analysis is to study the relationship between family size and number of credit cards. Finding the relationship will help us predict the number of credit cards a family typically has relative to the number of family members. If a family had fewer than expected, they would be a good candidate for us to extend another credit card offer.

Slide 2

Creating a histogram of the dependent variable, ncards, shows a distribution that is about as normal as we could expect for only 8 cases. I have superimposed the red normal curve and blue mean line on the histogram.

For any quantitative variable, our best estimate of the values for cases in the distribution is the mean, because it minimizes the errors or differences between the estimated value and the actual score represented by each of the bars in the histogram.

Slide 3

To demonstrate that the mean is the best value to estimate, I created a worksheet in Excel that compares the error associated with three different estimates of values for each case: the mean of 7, an estimate lower than the mean: 6, and an estimate higher than the mean: 8.

Error is calculated as the sum of the squared deviations from the value used as the estimate. Columns C, F and I contain the deviations from each of the estimates 7, 6, and 8). Columns D, G, and J contain the squared deviations, and the summed total at the base of the columns. Using the mean of 7 as the estimate, there are 22 units of error. Using either 6 or 8 results in 30 units of error. The measure of error is called the Total Sum of Squares. Slide 4

The graph for the relationship between two quantitative variables is the scatterplot, with the independent variable Family Size on the horizontal x-axis, and the dependent variable Number of Credit Cards on the vertical y-axis.

Each dot represents the combination of scores for one case. For example, this dot represents a family of 5 that had 8 credit cards.

I have superimposed the blue dotted mean line for Number of Credit Cards on the scatterplot. We see that the scores for two cases actually fall on the mean line, while the other six are at varying distances from the mean line.

Slide 5

The purple lines are the deviations the differences between individual scores and the mean of the dependent variable.

If we square the deviations and sum the squares, we have the Total Sum of Squares.

The differences are often phrased as distances, i.e. the vertical distance between the mean line and the score for this cases on the dependent variable is 3. Slide 6

I have added the green vertical dotted line at the mean number of credit cards, 4.25.

The regression line will pass through the intersection of the means of both variables, and will minimize the total sum of the differences between the individual scores and the regression line.

Slide 7

One way to think about linear regression is that we are rotating a line through the intersection of the means of the two variables.

Each time we rotate the line, we would compute the total sum of squares. We stop when we have found the line that has the smallest total sum of squares.

There is a direct method for finding the regression line that does not require this trial and error strategy. Slide 8

If there is no relationship, the blue regression line will be on top of (or very close) to the dotted blue mean line for the dependent variable.

No relationship means that we can not reduce the error or total sum of squares of the dependent variable by using the relationship to the independent variable.

Slide 9

The points along the regression line represent the estimated values for all possible values of the independent variable.

For example, if we wanted to estimate the number of cards for a family of 4, we would draw a vertical line from the 4 on the horizontal axis up to the regression line, and from the regression line left to the vertical axis. The location on the vertical axis is the estimated number of cards that a family of 4 would have, i.e. about 6.8 cards.

Slide 10

The differences between the estimated value and the actual value for the cases are deviations that are called residuals (the light blue lines). They represent errors in predicting the values of the dependent variable based on the value of the independent variable. We had two cases with a family size of 4. Our estimated value was overstated for one of the cases, and understated for the other case. Slide 11

The formula for the regression line can be extracted from the SPSS output. For this example, the regression equation is: ncards = 2.871 + .971 x famsize

Slide 12

We can plug the regression equation into Excel and estimate the number of cards for each case.

To compute the residuals, we subtract the actual value for ncards from the estimated value for the case.

If we square the residuals, and sum the squares, we have the amount of error associated with using the regression line to estimate each case, 5.485758.

Slide 13

If we plug the total sum of squares and the sum of squared residuals into an Excel spreadsheet, we can compute the reduction in the total sum of squared errors associated with using the information in the independent variable, as represented by the regression equation.

We can compute the percentage of total error reduced by the regression equation, we end up with the value of R, the percentage of variance explained by the regression relationship. Our calculation for R agrees with the value of R Square in the SPSS output.

Slide 14

R is often interpreted as the percentage of variance explained. We can convert our Sum of Squares column to Variance by dividing by the number of cases in the sample minus one (8 1).

If we compute the percentages using variances instead of sum of squares, we end with exactly the sample value for R, 0.750647.

R is also interpreted as the proportional reduction in error ( a PRE statistic), which we can also phrase as an increase in accuracy. We should remember the no matter whether we interpret R as explaining variance or reducing error, the statistic applies to the total error in distribution, not to the error in individual cases. Slide 15

We can also think of regression and correlation as based on the pattern of deviations for the two variable across the cases in the distribution. To present this, we will first compute the standard scores for each variable. As standard scores, the value for each case is the deviation from the mean of 0 which is the mean of the distribution of standard scores.

Slide 16

Plotting the z-scores for both variables produces the same pattern in the scatterplot that we found with the raw data.

As we would expect for standard scores, the green dotted line for the mean zscore for family size is at zero, as is the dotted blue line for the standard scores for number of credit cards.

Slide 17

We add lines for the deviation from the means for both variables.

The green deviation lines represent differences from the mean z-score for family size.

The blue deviation lines represent differences from the mean z-score for number of credit cards.

Slide 18

The strength of the relationship will depend on the agreement of the deviations for each case, i.e. the extent to which the green line deviation for a case agrees with the blue line deviation.

For some points, the length of the green deviation line is similar to the length of the blue deviation line.

For other points, the length of the green deviation line is shorter than the length of the blue deviation line.

Slide 19

Overall, the pattern of the deviations is similar. Green deviations above the mean are paired with blue deviations above the mean. Green deviations below the mean are paired with blue deviations below the mean. Though the length of the deviations for individual cases varies, the overall pattern suggests a strong relationship.

Slide 20

To compute the correlation coefficient, we multiply the zscores, and sum across all the cases.

To compute Pearsons r, we divide the sum of the zscore products by the number of cases minus one.

The value for Pearsons r that we computed agrees with the value supplied by SPSS.

Finally, if we square the value of Pearsons r, we have the same value as R Square in the SPSS regression output. Slide 21

If we return to the regression results for the raw data instead of the standard scores, we can show the link between Pearsons r and the slope in the regression equation.

Think of the standard deviation to be a measure of average difference from the mean for all of the cases for each of the variables. The standard deviation for number of cares is 1.773 and the standard deviation for family size is 1.581.

Recall that the slope of the regression line represents change in the dependent variable associated with a one unit change in the independent variable. Thus, when a family had one more member, we would predict that they had .971 more credit Slide 22 cards.

If the relationship between the two variables were perfect (one predicted the other without error), we could compute slope of the line using the average amount of differences in each of the distributions the standard deviations.

On average, the number of cards would go up 1.773 cards for a difference of 1.581 members in a family. We can simplify this by dividing the standard deviation for number of cards by the standard deviation for family size: 1.773 1.581 = 1.121 Thus, if the relationship were perfect, we would increase our estimate of the number of cards in a family by 1.121 for every additional member of a family.

Slide 23

If the relationship between the two variables were perfect, Pearsons r would be 1.0 (or -1.0 if the relationship were inverse). However, we know that Pearsons r is less than that, actually it is 0.866. If the slope of the regression line were 1.121 when the relationship were perfect, then we might expect the slope to be 0.866 x 1.121 when the relationship was less than perfect. And it fact, that turns out to be true, since: 0.866 x 1.121 = 0.971 The slope of the regression line is the ratio of the standard deviations multiplied by the correlation coefficient.

Slide 24