Linear regression provides additional statistical information
about the relationship between two quantitative variables. The coefficient of determination, R, which indicates the percentage of variance in the dependent variable that is accounted by variability in the independent variable The regression equation is the formula for the trend or fit line which enables us to predict the dependent variable for any given value of the independent variable The regression equation has two parts the intercept and the slope The intercept is the point on the vertical axis where the regression line crosses. It generally does not provide useful information.
10/8/2014 Slide 2
The slope is the change in the dependent variable for a one unit change in the independent variable. The slope tells us the direction and magnitude of change.
The regression line represents the predicted value of the dependent variable for each value of the independent variable.
The difference between the predicted values and the actual values of the dependent variable are called residuals. Residuals are the errors that we cannot predict. Residuals provide us with an important diagnostic tool for determining that linear regression is an appropriate statistical technique for analyzing the relationship between two quantitative variables.
10/8/2014 Slide 3
Linear regression requires us to satisfy three assumptions about the distributions of the two quantitative variables: No outliers A linear relationship between the variables Equal variance of the residuals across predicted values
The evaluation of the conformity of the analysis to these assumptions is generally based upon visual analysis of the scatterplot of the dependent variable by the independent variable and the residual plot a scatterplot of the residuals on the vertical axis by the predicted values on the horizontal axis. Numeric results are also available to evaluate each of these assumptions.
10/8/2014 Slide 4
If we do not satisfy the assumptions, we can: Report the results, noting the limitations produced by violation of the assumptions Report the results, ignoring the violations of assumptions, using the argument of robustness to violations Re-express one or both variables Omit the outliers Dichotomize the independent variable, splitting the values at the mean, median, or some other logical value
Simple linear regression refers to analysis with one independent variable. Multiple regression refers to analysis with more than one independent variables
Visualizing Linear Regression SW388R6 Data Analysis and Computers I
Slide 6 Visualizing Regression Analysis - 1 While we will base our problem solving on numeric statistical results computed by SPSS, we can use a scatterplot to demonstrate regression graphically.
We will use the variable "highest year of school completed" [educ] as the independent variable and "occupational prestige score" [prestg80] as the dependent variable from the GSS2000R data set to demonstrate the relationship graphically. SW388R6 Data Analysis and Computers I
Slide 7 Visualizing Regression Analysis - 2 The dots in the body of the chart represented the cases in the distribution. The independent variable is plotted on the x-axis, or the horizontal axis. The dependent variable is plotted on the y-axis, or the vertical axis. A scatterplot of prestg80 by educ produced by SPSS. SW388R6 Data Analysis and Computers I
Slide 8 Visualizing Regression Analysis - 3 I have drawn a green horizontal line through the mean of prestg80 (44.17). NOTE: the plots were created in SPSS by adding features to the default plot. The differences between the mean line and the dots (shown as pink lines), are the deviations.
The sum of the squared deviations is the measure of total error when the mean is used as the estimated score for each case. SW388R6 Data Analysis and Computers I
Slide 9 Visualizing Regression Analysis - 4 A regression line and the regression equation are added in red to the scatterplot. The pink deviations from the mean have been replaced with the orange deviations from the regression line. Deviations between cases and the regression line are called residuals. SW388R6 Data Analysis and Computers I
Slide 10 Visualizing Regression Analysis - 5 The existence of a relationship between the variables is supported when the sum of the squared orange residuals is significantly less than the sum of the squared pink deviations Recall that both deviations and residuals can be referred to as errors. If there is a relationship, we can characterize it as a reduction in error. SW388R6 Data Analysis and Computers I
Slide 11 Visualizing Regression Analysis 6 While it is difficult for us to square and sum deviations and residuals, SPSS regression output provides us with the answer. The squared sum of the pink deviations from the mean is the Total Sum of Squares in the ANOVA table (49104.91). The squared sum of the orange residuals from the regression line is the Residual Sum of Squares in the ANOVA table (37086.80). SW388R6 Data Analysis and Computers I
Slide 12 Visualizing Regression Analysis 7 The difference between the Total Sum of Squares and the Residual Sum of Squares is the Regression Sum of Squares.
The Regression Sum of Squares is the amount of error that can be eliminated by using the regression equation to estimate values of prestg80 instead of the mean of prestg80. The Regression Sum of Squares in the ANOVA table is 12018.11. SW388R6 Data Analysis and Computers I
Slide 13 Visualizing Regression Analysis 8 We can compute the proportion or error that was reduced by the regression by dividing the Regression Sum of Squares by the Total Sum of Squares:
12018.11 49104.91 = 0.245
SW388R6 Data Analysis and Computers I
Slide 14 Visualizing Regression Analysis 9 The reduction in error that we computed (0.245) is equal to the R Square that SPSS provides in the Model Summary table.
R is the coefficient of determination which is usually characterized as:
the proportion of variance in the dependent variable explained by the independent variable, or
the reduction in error (or increase in accuracy). In multiple regression, the symbol for coefficient of determination is R. In simple linear regression, the symbol is r. SW388R6 Data Analysis and Computers I
Slide 15 Visualizing Regression Analysis 10 The correlation coefficient, Multiple R, is defined as the positive square root of R Square. SPSS uses the same terminology for Multiple Regression and Simple Linear Regression.
This can be misleading in Simple Linear Regression when the correlation for the relationship between the two variables, r, can have a negative sign for an inverse relationship. While Multiple R will have the same the numeric value as r in Simple Linear Regression, we should look at beta in the table of coefficients to make certain that we are interpreting the direction of the relationship correctly. SW388R6 Data Analysis and Computers I
Slide 16 Visualizing Regression Analysis 11 The regression equation is based on the Unstandardized Coefficients (B) in the table of Coefficients.
The B coefficient labeled (Constant) is the intercept. The B coefficient for the variable educ is the slope of the regression line.
The regression equation for the relationship between prestg80 and educ is:
prestg80 = 12.928 + 2.359 x educ SW388R6 Data Analysis and Computers I
Slide 17 Visualizing Regression Analysis 12 The Standardized Coefficients (Beta) in the table of Coefficients are the regression coefficients for the relationship between the standardized dependent variable (z-scores) and the standardized independent variable (z- scores).
Since standardizing variables removes the unit of measurement from the coefficients, we can compare the Beta coefficients to interpret the relative importance of each independent variable in Multiple Regression.
In Simple Linear Regression, Beta will be equal to r, the correlation coefficient. Multiple R, r, and Beta all have the same numeric value, though Multiple R will be positive even when r and Beta are negative. SW388R6 Data Analysis and Computers I
Slide 18 Visualizing Regression Analysis 13 The sign of the Beta coefficient, as well as the sign of the B coefficient, tells us the direction of the relationship.
If the coefficients are positive, the relationship is characterized as direct or positive, meaning that higher values of the dependent variable are associated with higher values of the independent variables.
If the coefficients are negative, the relationship is characterized as inverse or negative, meaning that lower values of the dependent variable are associated with higher values of the independent variables. SW388R6 Data Analysis and Computers I
Slide 19 Visualizing Regression Analysis - 14 The regression line represents the estimated value of prestg80 for every value of educ.
To obtain the estimate, we draw a line perpendicular to the value on the x-axis to the point where it intersects the regression line. We then draw a line from the intersection point to the y-axis. The intersection point on the y-axis is the estimated value for the dependent variable. SW388R6 Data Analysis and Computers I
Slide 20 Visualizing Regression Analysis - 15 If we draw a vertical line from the educ value of 5 to the regression line and then to the horizontal axis, we see that the estimated value for prestg80 is about 25.
We can compute the exact value by substituting in the regression equation:
Prestg80 = 12.93 + 2.36 x 5 = 24.73
SW388R6 Data Analysis and Computers I
Slide 21 Visualizing Regression Analysis - 16 If we draw a vertical line from the educ value of 15 to the regression line and then to the horizontal axis, we see that the estimated value for prestg80 is about 50.
We can compute the exact value by substituting in the regression equation:
Prestg80 = 12.93 + 2.36 x 15 = 48.33
Examples of Residual Plots Slide 23 Example of null plot conforms to assumptions Slide 24 Example of violation of linearity assumption Slide 25 Example of violation of equal variance Slide 26 Example of presence of outliers