Sunteți pe pagina 1din 26

10/8/2014 Slide 1

Linear regression provides additional statistical information


about the relationship between two quantitative variables.
The coefficient of determination, R, which indicates the
percentage of variance in the dependent variable that is
accounted by variability in the independent variable
The regression equation is the formula for the trend or fit
line which enables us to predict the dependent variable
for any given value of the independent variable
The regression equation has two parts the intercept and
the slope
The intercept is the point on the vertical axis where the
regression line crosses. It generally does not provide
useful information.

10/8/2014 Slide 2

The slope is the change in the dependent variable for a
one unit change in the independent variable. The slope
tells us the direction and magnitude of change.

The regression line represents the predicted value of the
dependent variable for each value of the independent
variable.

The difference between the predicted values and the actual
values of the dependent variable are called residuals.
Residuals are the errors that we cannot predict.
Residuals provide us with an important diagnostic tool for
determining that linear regression is an appropriate statistical
technique for analyzing the relationship between two
quantitative variables.

10/8/2014 Slide 3

Linear regression requires us to satisfy three assumptions
about the distributions of the two quantitative variables:
No outliers
A linear relationship between the variables
Equal variance of the residuals across predicted values

The evaluation of the conformity of the analysis to these
assumptions is generally based upon visual analysis of the
scatterplot of the dependent variable by the independent
variable and the residual plot a scatterplot of the residuals
on the vertical axis by the predicted values on the horizontal
axis.
Numeric results are also available to evaluate each of these
assumptions.

10/8/2014 Slide 4

If we do not satisfy the assumptions, we can:
Report the results, noting the limitations produced by
violation of the assumptions
Report the results, ignoring the violations of assumptions,
using the argument of robustness to violations
Re-express one or both variables
Omit the outliers
Dichotomize the independent variable, splitting the values
at the mean, median, or some other logical value

Simple linear regression refers to analysis with one
independent variable.
Multiple regression refers to analysis with more than one
independent variables

Visualizing Linear Regression
SW388R6
Data Analysis and Computers I

Slide 6
Visualizing Regression Analysis - 1
While we will base our problem solving on numeric statistical
results computed by SPSS, we can use a scatterplot to
demonstrate regression graphically.

We will use the variable "highest year of school completed"
[educ] as the independent variable and "occupational prestige
score" [prestg80] as the dependent variable from the
GSS2000R data set to demonstrate the relationship graphically.
SW388R6
Data Analysis and Computers I

Slide 7
Visualizing Regression Analysis - 2
The dots in the body of the chart
represented the cases in the
distribution.
The independent variable is
plotted on the x-axis, or the
horizontal axis.
The dependent variable
is plotted on the y-axis,
or the vertical axis.
A scatterplot of prestg80 by educ
produced by SPSS.
SW388R6
Data Analysis and Computers I

Slide 8
Visualizing Regression Analysis - 3
I have drawn a green
horizontal line through the
mean of prestg80 (44.17).
NOTE: the plots were created in SPSS
by adding features to the default
plot.
The differences between the mean line and
the dots (shown as pink lines), are the
deviations.

The sum of the squared deviations is the
measure of total error when the mean is used
as the estimated score for each case.
SW388R6
Data Analysis and Computers I

Slide 9
Visualizing Regression Analysis - 4
A regression line and the
regression equation are
added in red to the
scatterplot.
The pink deviations from the
mean have been replaced with
the orange deviations from the
regression line. Deviations
between cases and the regression
line are called residuals.
SW388R6
Data Analysis and Computers I

Slide 10
Visualizing Regression Analysis - 5
The existence of a relationship between the
variables is supported when the sum of the
squared orange residuals is significantly less than
the sum of the squared pink deviations
Recall that both deviations and residuals can
be referred to as errors. If there is a
relationship, we can characterize it as a
reduction in error.
SW388R6
Data Analysis and Computers I

Slide 11
Visualizing Regression Analysis 6
While it is difficult for us to square and sum
deviations and residuals, SPSS regression
output provides us with the answer.
The squared sum of the pink
deviations from the mean is
the Total Sum of Squares in
the ANOVA table (49104.91).
The squared sum of the orange
residuals from the regression
line is the Residual Sum of
Squares in the ANOVA table
(37086.80).
SW388R6
Data Analysis and Computers I

Slide 12
Visualizing Regression Analysis 7
The difference between the Total Sum of Squares and the
Residual Sum of Squares is the Regression Sum of Squares.

The Regression Sum of Squares is the amount of error that can
be eliminated by using the regression equation to estimate
values of prestg80 instead of the mean of prestg80.
The Regression Sum of Squares
in the ANOVA table is
12018.11.
SW388R6
Data Analysis and Computers I

Slide 13
Visualizing Regression Analysis 8
We can compute the proportion or error that was
reduced by the regression by dividing the Regression
Sum of Squares by the Total Sum of Squares:

12018.11 49104.91 = 0.245

SW388R6
Data Analysis and Computers I

Slide 14
Visualizing Regression Analysis 9
The reduction in error that we computed (0.245)
is equal to the R Square that SPSS provides in
the Model Summary table.

R is the coefficient of determination which is
usually characterized as:

the proportion of variance in the dependent
variable explained by the independent
variable, or

the reduction in error (or increase in
accuracy).
In multiple regression, the symbol for
coefficient of determination is R. In simple
linear regression, the symbol is r.
SW388R6
Data Analysis and Computers I

Slide 15
Visualizing Regression Analysis 10
The correlation coefficient, Multiple R, is defined as
the positive square root of R Square. SPSS uses the
same terminology for Multiple Regression and Simple
Linear Regression.

This can be misleading in Simple Linear Regression
when the correlation for the relationship between the
two variables, r, can have a negative sign for an inverse
relationship. While Multiple R will have the same the
numeric value as r in Simple Linear Regression, we
should look at beta in the table of coefficients to make
certain that we are interpreting the direction of the
relationship correctly.
SW388R6
Data Analysis and Computers I

Slide 16
Visualizing Regression Analysis 11
The regression equation is based on the Unstandardized
Coefficients (B) in the table of Coefficients.

The B coefficient labeled (Constant) is the intercept. The B
coefficient for the variable educ is the slope of the regression
line.

The regression equation for the relationship between
prestg80 and educ is:

prestg80 = 12.928 + 2.359 x educ
SW388R6
Data Analysis and Computers I

Slide 17
Visualizing Regression Analysis 12
The Standardized Coefficients (Beta) in the table of Coefficients are the
regression coefficients for the relationship between the standardized
dependent variable (z-scores) and the standardized independent variable (z-
scores).

Since standardizing variables removes the unit of measurement from the
coefficients, we can compare the Beta coefficients to interpret the relative
importance of each independent variable in Multiple Regression.

In Simple Linear Regression, Beta will be equal to r, the correlation
coefficient. Multiple R, r, and Beta all have the same numeric value, though
Multiple R will be positive even when r and Beta are negative.
SW388R6
Data Analysis and Computers I

Slide 18
Visualizing Regression Analysis 13
The sign of the Beta coefficient, as well as the sign of the B
coefficient, tells us the direction of the relationship.

If the coefficients are positive, the relationship is
characterized as direct or positive, meaning that higher
values of the dependent variable are associated with
higher values of the independent variables.

If the coefficients are negative, the relationship is
characterized as inverse or negative, meaning that lower
values of the dependent variable are associated with
higher values of the independent variables.
SW388R6
Data Analysis and Computers I

Slide 19
Visualizing Regression Analysis - 14
The regression line represents the estimated value of
prestg80 for every value of educ.

To obtain the estimate, we draw a line perpendicular to the
value on the x-axis to the point where it intersects the
regression line. We then draw a line from the intersection
point to the y-axis. The intersection point on the y-axis is the
estimated value for the dependent variable.
SW388R6
Data Analysis and Computers I

Slide 20
Visualizing Regression Analysis - 15
If we draw a vertical line from the educ value of 5 to the
regression line and then to the horizontal axis, we see that
the estimated value for prestg80 is about 25.

We can compute the exact value by substituting in the
regression equation:

Prestg80 = 12.93 + 2.36 x 5 = 24.73

SW388R6
Data Analysis and Computers I

Slide 21
Visualizing Regression Analysis - 16
If we draw a vertical line from the educ value of 15 to the
regression line and then to the horizontal axis, we see that
the estimated value for prestg80 is about 50.

We can compute the exact value by substituting in the
regression equation:

Prestg80 = 12.93 + 2.36 x 15 = 48.33

Examples of Residual Plots
Slide 23
Example of null plot conforms to assumptions
Slide 24
Example of violation of linearity assumption
Slide 25
Example of violation of equal variance
Slide 26
Example of presence of outliers

S-ar putea să vă placă și