Sunteți pe pagina 1din 69

Lesson 15

Linear Regression
Lesson 15 Outline
Review correlation analysis
y
Dependent and Independent variables
Least Squares Regression line
Calculating
C l l ti the th slope
l
Calculating the Intercept
Residuals and Residual Plots
Identifying significant relationship: t-
t-test of the slope
R2 : coefficient of determination
Using the regression line for Prediction of Y from X
Relationship between correlation coefficient and linear
regression
g

PubH 6414 Lesson 15 2


Linear Regression and
C
Correlation
l i
Both
ot Linearea Regression
eg ess o a andd Co
Correlation
e at o Analysis
a ys s
can be used to explore the linear relationship
between two continuous (quantitative) random
variables.
variables
Correlation analysis is used when the interest
is in identifying if a relationship exists and
quantifying the strength of the relationship
Regression Analysis is used to identify a
relationship AND to predict the value of one
variable given a value of the other variable(s).

PubH 6414 Lesson 15 3


Review: Correlation Analysis

1
1. Plot the data using a scatter plot to get a
visual idea of the relationship
2. Calculate the correlation coefficient
1. Use Pearsons correlation coefficient if both
variables are continuous
2. Use Spearman rank correlation coefficient if
both variables are ordinal or one is ordinal
and the other continuous.

PubH 6414 Lesson 15 4


Review: Scatter Plots and
Association
i i
Plot the 2 variables in a scatter p
plot (EXCEL)
( )
The pattern of the dots in the plot indicates the
statistical relationship between the variables (the
strength
t th andd th
the di
direction)
ti )
Positive relationship pattern goes from lower left to
upper right.
Negative relationship pattern goes from upper left
to lower right.
The more the dots cluster around a straight line with
a positive or negative direction the stronger the linear
relationship.

PubH 6414 Lesson 15 5


Review: Correlation Coefficient

r
( x x )( y y )
[ ( x x ) ][ ( y y )
2 2
]

The statistic r is called the Correlation Coefficient


r estimated the population correlation coefficient:
(the Greek letter r)
The correlation coefficient provides a measure of the
linear association between two variables
r is always between 1 and 1

PubH 6414 Lesson 15 6


Review:
Correlation
C l i C
Coefficient
ffi i iin Excell
Use the CORREL function to find the correlation
coefficient
If data for one variable are in cells A1:A12 and data for
other variable are in cells B1:B12,
=CORREL(A1:A12,B1:B12) will return the Pearson
correlation coefficient.
C
Correlation
l ti coefficients
ffi i t closer
l to
t 1 or 1 indicate
i di t a
stronger linear relationship.
Correlation coefficients close to 0 indicate a weak linear
relationship.
However there could be a nonlinear relationship when
the correlation coefficient is close to 0.

PubH 6414 Lesson 15 7


Simple Linear Regression
Like correlation analysis, Linear regression analysis is a
technique
q that is used to explore
p the relationship
p
between two continuous random variables that have a
linear relationship.
Regression analysis allows us to investigate the change
in one variable that corresponds to a given change in the
other variable.
If only ONE variable is used to predict the value of the
other variable, the analysis is called simple linear
regression.
When
h two or more variablesbl are used d to predict
d theh
value of the other variable, the analysis is called
multiple linear regression (not covered in this course).

PubH 6414 Lesson 15 8


Linear Regression: Background
Regression is from a Latin root meaning going back
Linear regression as a statistical method was first described by Sir
Francis Galton in his paper "Regression Towards Mediocrity in
Hereditary Stature published in The Journal of the Anthropological
Institute 1886
Institute,
Galton described the relationship between mid-
mid-parent height (Mid
(Mid--
parent height = the average of the 2 parents height) and the height
of their offspring
p g
Taller mid-
mid-parent height had children with heights closer to the
average height
Shorter mid-
mid-pparent height
g had children with heights
g closer to the
average height
Galton called this phenomenon regression towards mediocrity

PubH 6414 Lesson 15 9


Sir Francis Galton: Regression

When mid-
mid-parents are taller than
mediocrity, their children tend to be
shorter than they
and
d
When mid-
mid-parents are shorter
th mediocrity,
than di it their
th i children
hild tend
t d
to be taller than they

PubH 6414 Lesson 15 10


Variables in Simple Linear
R
Regression
i Analysis
A l i
Dependent or response variable
variable-- a variable to be predicted
f
from or explained
l i d by
b the
th other
th variable
i bl
The response variable is typically labeled Y
Y is a continuous variable in simple linear regression
Independent or explanatory variable the variable used to
predict the dependant variable.
This
Thi variable
i bl is
i typically
t i ll labeled
l b l d X
X can also be called the predictive variable or the
regressor
g variable
For simple linear regression X is a continuous variable
For multiple linear regression X can be continuous or categorical

PubH 6414 Lesson 15 11


Identifying independent and
dependent variables
variables.
In regression analysis, its important to correctly identify
th d
the dependent
d t (Y) and d independent
i d d t (X) variables.
i bl
The study description should provide you with
information about which is the dependent variable and
which is the independent variable.
If the study description states that the goal is to predict variable
1 from variable 2, then variable 1 is the dependent variable (Y (Y)
and variable 2 is the independent variable (X (X).
Typically, if the variables are separated in time, the variable
collected first is the independent variable (X
(X) and the variable
collected later is the dependent variable (Y
(Y).
In Galtons regression analysis, the mid-
mid-parent height was the
independent variable and the offspring height was the
dependent variable
PubH 6414 Lesson 15 12
Linear Regression
g Overview
Look at a scatter plot of the data
Plot Y on the y-
y-axis and X on the x- x-axis
Does the
h relationship
l h appear to be b linear?
l ?
Estimate the regression line equation
Find the slope and intercept of the regression line
Check residuals
Is the relationship statistically significant?
Use a t-
t-test of the slope to determine significance
How well does the estimated regression line equation fit the
data?
Calculate R2 - the coefficient of determination
Use the estimated regression line equation to predict values
off th
the d
dependent
d t variable
i bl (Y) for
f specified
ifi d values
l off the
th
independent variable (X).
PubH 6414 Lesson 15 13
Simple Linear Regression:
An Example
l
Is there a linear relationship between body weight and plasma
volume that can be used to predict plasma volume from weight?
Plasma volume is the dependent variable Y since we are
interested in predicting this from body weight, the independent
variable X.
Body Plasma
Subject
Weight(kg) Volume(l)
1 58.0 2.75
2 70.0 2.86
3 74.0 3.37
4 63.5 2.76
5 62.0 2.62
6 70.5 3.49
7 71.0 3.05
8 66.0 3.12
PubH 6414 Lesson 15 14
Scatter plot of the Data
There is a positive relationship between plasma volume and body
weight.
With thi
this smallll number
b off d
data
t points
i t it iis diffi
difficult
lt to
t see
the
th linear
li
relationship but there is a general linear trend to the data
We want to identify a line that has a good fit to the data. This isnt
a deterministic relationship so the points won
wontt fall perfectly on the
line.
4
Plasma Volume (literrs)

3.5

2.5

2
50 55 60 65 70 75 80
Body Weight (kg)

PubH 6414 Lesson 15 15


Estimate the Regression Line
Equation
i
A few of the many possible lines through the data points are
ill t t d iin th
illustrated the plot.
l t HHow d
do we d
decide
id which
hi h line
li best
b t fits
fit th
the d
data?
t ?

4
me (liters)

3.5
asma Volum

2.5
Pla

2
50 55 60 65 70 75 80
Body Weight (kg)

PubH 6414 Lesson 15 16


Least Squares Regression Line
The linear regression line is the line that gets
closest to all of the points. This is called the
least squares regression line.
The least squares regression line minimizes the
sum of the squares of the vertical distance
between each observed data point (yi) and the
line
n
minimize (y
i 1
i point on linei ) 2

PubH 6414 Lesson 15 17


Vertical distances between each observed Y (yi) and the line
are in red
red. The sum of these distances squared is minimized
by the least squares regression line

4
e (L)

35
3.5
a Volume

3
Plasma

2.5

2
50 55 60 65 70 75 80
B d W
Body Weight
i ht (kg)
(k )

PubH 6414 Lesson 15 18


Least Squares Regression Line
E
Equation
ti
The equation
q for a line requires
q a slope
p and an interceptp
In regression analysis, we estimate the population
regression line with the least squares regression line
calculated
l l t d fromf sample l data:
d t the
th samplel regression
i line
li
The notation for the slope and intercept in the population
regression line are Greek letters
for the intercept

for the slope

The notation for the slope and intercept in the sample


regression line are Roman letters
a
ffor th
the iintercept
t t
b for the slope
PubH 6414 Lesson 15 19
The Population Regression Line

0 is
i the
h y - intercept
i off the
h line
li
1 is the slope
p of the regression
g line
is the error term - the difference between
the observed Y and the regression line

Y X

PubH 6414 Lesson 15 20


Sample Regression Line
0 aandd 1 aaree popu
population
at o pa
parameters
a ete s

Sample estimates for the regression parameters are :


a is the estimate for
b is the estimate for

Y a bX is the regression line calculated


f
from samplel data
d t
Y is the predicted value of Y

PubH 6414 Lesson 15 21


Least Squares Regression Line
a and b are estimates of the regression
g coefficients and
The regression coefficients are estimated from the sample data
by the least squares method
The intercept a is the estimated expected value of Y when X =
0
The slope b is the estimated expected change in Y
corresponding to a 1 unit increase in X

Y is the expected (or predicted) value of y, the point on the


line. It is called the fitted value of y
The follo
following
ing slide illustrates
ill st ates the least squares
sq a es regression
eg ession line
PubH 6414 Lesson 15 22
The Equation of a Regression
y Line
i
y

Y a bX

b
slope
a One-unit
Change
g in X
intercept
x
0
PubH 6414 Lesson 15 23
Interpretation of predicted
values
l off Y
The p predicted value of y is the expected
p y-value
y-
Since not all observed data points are exactly on the
regression line, there is a range of possible y- y-values (a
distribution) for each xx--value. In regression analysis the
distribution of y-
y-values for each x-x-value is assumed to be a
normal distribution.
The predicted values of y represent the mean values of the
distributions of y for each specified value of x.
The following slide illustrates this for 3 values of X: notice
th t th
that the mean off each
h di
distribution
t ib ti iis on the
th regression
i line
li
equation (the predicted value of y) and that the distribution
of yy--values are normal distributions.

PubH 6414 Lesson 15 24


Simple Linear Regression
Model
d l Illustrated
ll d

PubH 6414 Lesson 15 25


Assumptions for Regression
A l i
Analysis
There are several assumptions
p that should be met for
regression analysis:
For each value of X, the Y variable is assumed to have
a normal distribution the mean of the normal
distribution is the predicted value, Y
The normal distributions are assumed to have equal
variance across the entire range of X values
values. This
assumption is called homogeneity or homoscedasticity.
The predicted values of Y fall on the regression line
representing
ti ththe lilinear relationship
l ti hi between
b t X anddY
The Y observations are assumed to be independent
The observations are from a random sample

PubH 6414 Lesson 15 26


Interpretation of the Slope of the
R
Regression
i line
li
The slope b is the expected change in Y corresponding to a 1 unit
i
increase in
i X
b = 0: There is no linear association between Y and X

b > 0: There is a Positive linear association between Y and X


((as X increases the expected
p value of Y increases)
increases)

b < 0: There is a Negative linear association between Y and X


(as X increases the expected value of Y decreases)
decreases)

The following slide illustrates a positive, negative and 0 slope.

PubH 6414 Lesson 15 27


Illustration of Negative, Positive slopes
y andd slope
l =0
y b >00

b =0

b <0

x
0
PubH 6414 Lesson 15 28
Calculating the Slope of the
R
Regression
i Line
Li
The formula to calculate the slope of the least
squares regression line is given below

b
n
i 1 ( xi x )( yi y )
n
i 11 ( xi x ) 2

Notice that the numerator is the same as the


numerator in the formula for the correlation coefficient.

PubH 6414 Lesson 15 29


b for plasma (Y) and body weight (X) example
X Y (X- Xbar) (Y-Ybar) (X-Xbar)(Y-Ybar) (X-Xbar)2

58.0 2.75 -8.9 -0.3 2.24 78.8

70.0 2.86 3.1 -0.1 -0.45 9.8

74.0 3.37 7.1 0.4 2.62 50.8

63.5 2.76 -3.4 -0.2 0.82 11.4

62.0 2.62 -4.9 -0.4 1.86 23.8

70.5 3.49 3.6 0.5 1.77 13.1

71.0 3.05 4.1 0.0 0.20 17.0

66.0 3.12 -0.9 0.1 -0.10 0.8

Mean 66.875 3.0025

SUM 8.9575 205.375

PubH 6414 Lesson 15 30


Slope of regression line
From the previous slide the sum of (X-
(X-X)(Y
X)(Y--Y) = 8.9575.
Th sum off (X-
The (X-X)2 =205.375
205 375

b = 8.9575 / 205.375 = 0.043615

Interpretation of the slope: For every one unit increase


in X, the expected increase in Y is 0.0436 units (rounded
to 4 decimal places)
Plasma volume increases 0.0436 liters for every one
kg
g increase in bodyy weight.
g

The slope is positive indicating that as body weight (X)


increases, plasma volume (Y) also increases

PubH 6414 Lesson 15 31


Calculating the Intercept of
the
h regression
i line
li
The interceptp a of the regression
g line is the estimated
value of Y when X = 0
a is calculated from the average value of Y, the
average value
l off X and
d the
th estimated
ti t d slope
l b by
b the
th
following formula:

a Y bX

PubH 6414 Lesson 15 32


Intercept for Plasma Volume
E
Examplel

X 66.875
Y 3.0025
b 0.043615
a 3.0025 0.043615 * 66.875 0.0857
The intercept is the estimated expected value of Y when
X = 0. Intercepts do not always have realistic interpretations.
In this example, plasma volume is predicted to be 0.0857 liters
when
h b body
d weight
i ht = 0 kkg. which
hi h is
i nott a possibility.
ibilit
PubH 6414 Lesson 15 33
Regression Line Equation
Once the slope and the intercept have been calculated
th regression
the i equation
ti can beb constructed:
t t d

Y a bX
Y 0.0857 0.0436 X
This is the equation that will be used to predict plasma
volume
l (l) from
f body
b d weight
i ht (kg).
(k )
The regression equation calculated from sample data is
an estimate of the true population regression equation.

PubH 6414 Lesson 15 34


Regression Line Equation and
i
interpretation
i off the
h slope
l

A 1 unit increase in X for this data = 1 kg so the


interpretation of the slope in this regression line
equation
ti iis:
For each 1 kg increase in body weight, the expected
increase in plasma volume is .0436 liters.
What is the expected plasma volume increase for a 10
kg increase in body weight?
For a 10 kilogram increase in body weight, the
expected increase in plasma volume = 10*0.0436 =
0.436 liters.

PubH 6414 Lesson 15 35


What if the slope of the
regression
i line
li is
i negative?
i ?
If the slope of the regression line is negative we
would expect a decrease in Y with each unit
increase in X.
The slope is a measure of the expected change
in Y for each 11--unit increase in X
If the slope is positive, the expected change
in Y is an increase
If the slope is negative, the expected change
in Y is a decrease.

PubH 6414 Lesson 15 36


Regression Coefficients in
Excell
Excel has functions to calculate the slope and
the intercept of the least squares regression
line:
The SLOPE function returns b - the slope
=SLOPE(y--range, xx--range)
=SLOPE(y
The INTERCEPT function returns a - the
intercept
=INTERCEPT(y--range, xx--range)
=INTERCEPT(y
For both of these functions enter the yy--range
off d
data
t first
fi t and
d then
th the
th x-
x-range off th
the ddata.
t
PubH 6414 Lesson 15 37
Plasma Volume Example in
Excel
Th LLesson 15 Excel
The E l Module
M d l works k through
th h the
th
Plasma Volume / body weight regression
example:
Create a scatterplot of the data
work throughg the calculations of the Slope
p and
Intercept of the regression line
Use the Excel Slope and Intercept functions
After youve worked through the calculations once,
use the Excel functions to find the slope and
intercept for future regression problems
PubH 6414 Lesson 15 38
Residuals
Thee residual
es dua iss tthe
eddifference
e e ce bet
between
ee tthe
e
observed (Y) and the expected (Y) value of Y
Residual = Y Y
Y is the observed Y for any X
Y is the Y
Y--value on the regression line for
th t value
that l off X
The residual is the component of Y that is not
predicted by X
The least squares regression line is the line that
minimizes the squared
q residuals

PubH 6414 Lesson 15 39


Residuals for Plasma Volume
E
Examplel
X Y Y'
Y Residual
58.0 2.75 2.62 0.13 Calculate Y, the
70.0 2.86 3.14 -0.28 expected value of
74.0 3.37 3.31 0.06 Y using
Y, i ththe
63.5 2.76 2.86 -0.10 regression line
62 0
62.0 2 62
2.62 2 79
2.79 -0 17
-0.17 equation.
q
70.5 3.49 3.16 0.33
The residual is the
difference between
71.0 3.05 3.18 -0.13
Y and YY
66.0 3.12 2.96 0.16

Which point is closest to the regression line? (74, 3.37) has the smallest
residual
Which point is furthest from the regression line? (70.5, 3.49) has the largest
PubH 6414 Lesson 15 residual 40
Regression Line and Residuals
Largest residual

4
Smallest
ma Volume ((L)

3.5 residual
id l

3
Plasm

2.5

2
50 55 60 65 70 75 80
Body Weight (kg)

PubH 6414 Lesson 15 41


Analysis of Residuals
A Residual p plot is a plot
p of the residual values on the Y-Y-
axis and the x-x-values on the X-
X-axis
If there is a linear relationship between X and Y, the
correlation between X and the residuals should equal 0.
The scatterplot will be a random scatter of points with
no evident linear pattern.
A nonlinear relationship between X and Y will be more
evident in the residual plot of the (X, residual) data than
in the scatterplot of the original (X, Y) data
Th Excel
The E l Regression
R i analysis
l i tool
t l has
h an option
ti for
f
selecting the Residual plot. The Residual plot for the
plasma volume example is on the following slide.

PubH 6414 Lesson 15 42


Residual Plot for Plasma Volume
B d weight
Body i ht data
d t
body weight (kg) Residual Plot

0.4
0.3
0.2
esiduals

0.1
0
-0
0.1
1 0.0
00 20 0
20.0 40 0
40.0 60 0
60.0 80 0
80.0
Re

-0.2
-0.3
-0.4
body weight (kg)

No evidence of nonlinearity.
nonlinearity The points are equally distributed
around the value 0 with no evident positive or negative slope
PubH 6414 Lesson 15 43
(X, Y) Scatterplot for a nonlinear
( curvilinear)
(or ili ) relationship
l ti hi

16
14

12
10
8
6
4
2
0
0 10 20 30 40 50

When there is a curvilinear relationship between X and Y, the


l
least squares regression
i line
li does
d not represent the
h relationship
l i hi

PubH 6414 Lesson 15 44


Residual Plot for Curvilinear
R l ti
Relationship
hi
X Residual Plot

6
4
Residuals

2
0
2 0
-2 10 20 30 40 50
R

-4
-6
X

This is the residual pplot for the relationship p on the p


previous slide.
It illustrates that the relationship is not linear. The residual plot points
arent evenly distributed around the value 0.
PubH 6414 Lesson 15 45
Regression analysis for curvilinear
relationships
l ti hi
Simple linear regression analysis should not be used
when
h X and d Y have
h a curvilinear
ili relationship
l ti hi
There are several strategies for dealing with a curvilinear
relationship between X and Y
One option is to try a logarithmic transformation of
the data to see if this improves the linear relationship
Another option is to use piecewise regression fit
one regression line
l to the
h increasing portion off the
h
curve and a second regression line to the decreasing
portion of the curve
A third
thi d option
ti isi to
t include
i l d X2 or X3 in i the
th regression
i
equation (covered in PubH 6415 with multiple
regression models).

PubH 6414 Lesson 15 46


Linear Regression Procedure
Look at a scatter plot of the data
Plot Y on the y-
y-axis and X on the x-
x-axis
Add the trend line to the plot
Estimate the regression line equation
Find the slope
p and intercept p of the regression
g line
Check Residuals
Is the relationship between X and Y statistically significant?
Use
U a t-t-test
t t off th
the slope
l tto d
determine
t i significance
i ifi
How well does the estimated regression line equation fit the
data?
Calculate R2 - the coefficient of determination
Use the estimated regression line equation to predict values
of the dependent variable (Y) for specified values of the
independent variable (X).
PubH 6414 Lesson 15 47
Is the relationship between X and
Y significant?
i ifi t?
If the slope of the regression line = 0, this indicates there
i no linear
is li relationship
l ti hi between
b t the
th variables.
i bl If there
th is
i
no linear relationship the variables are considered to be
independent
A t-
t-test
t t off th
the slope
l estimate
ti t can bbe d
done tto ttestt ffor
independence between the X and Y variables
Null hypothesis: slope = 0
The
Th nullll hypothesis
h th i states
t t that
th t the
th variables
i bl are independent
i d d t
Alternative hypothesis: slope 0
The alternative hypothesis is that there is a significant
relationship between the variables
If the t- (p--value < ),
t-test of the slope result is significant (p
reject the null hypothesis and conclude that there is a
statistically significant relationship between the two
variables.
PubH 6414 Lesson 15 48
Notation for Population slope
andd Intercept
As in anyy hypothesis
yp test,, the null and alternative
hypotheses are stated about the population parameters,
not about the estimates.
Th population
The l ti parameterst for
f the
th slope
l andd intercept
i t t off
the regression line for the population are the Greek
letters 1 and 0
1 is the population parameter for the slope

0 is the population parameter for the intercept

The statistic for the t-


t-test of the slope will use the
estimated value of the slope (b) that is calculated from
the data.

PubH 6414 Lesson 15 49


t-test of the Slope
1. State the Hypotheses
Null hypothesis: = 0
Alternative hypothesis: 0

2. A tt--test will be used to test the hypothesis

3.
3 Significance level = 0.05
0 05

4. The degrees of freedom for a t- t-test of the slope are n-


n-2
where n=sample size
The critical values of the t-t-test are found using
TINV(0 05 df)
TINV(0.05, df). For the plasma volume example
example, n = 8 so
the critical values = TINV(0.05, 6) = 2.447 and -2.447
PubH 6414 Lesson 15 50
t-test of the slope
5. Calculate the test statistic the slope
p estimate
divided by the standard error of the slope
b1
t
SE (b1 )
The formula for the SE of the slope is complicated so
we will use the Excel Data Analysis Tool to do this t- t-
test. The Data Analysis Tool provides the t- t-statistic
and the pp--value of the tt--test of the slope
p
6. State the conclusion. If the test statistic is more
extreme than the critical values reject the null
hypothesis and conclude that there is a significant
relationship between the variables.
PubH 6414 Lesson 15 51
T-test of the Slope in Excel
Data Analysis Tool output for the weight / plasma volume example:
The t-statistic and p-value for the t-test of the slope are highlighted
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.759126577
R Square 0.576273159
Adj t d R S
Adjusted Square 0 505652019
0.505652019
Standard Error 0.218809511
Observations 8

ANOVA
df SS MS F Significance F
Regression 1 0.390684388 0.390684388 8.160066 0.028930913
Residual 6 0.287265612 0.047877602
Total 7 0.67795

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 0.085724285 1.023998015 0.083715284 0.936006 -2.419910427 2.591358996
Body weight 0.043615338 0.015268361 2.856582911 0.028931 0.006254978 0.080975697

P-value
P al e for
fo t-test
t test = 0.029
0 029 so reject
eject the n
nullll h
hypothesis
pothesis and conclude
concl de that
there is a significant relationship between weight and plasma volume
PubH 6414 Lesson 15 52
Regression
g Analysis
y in Excel
In Excel Module 15 use the Data Analysis Tool to obtain
the Regression Analysis results
select
l Regression
under
d theh Data Analysis
l Tool.
l
Enter the plasma volume data for Y- Y-range and the
weight
g data for X- X-range
g
Check labels if you highlight the column headers
Also check Residuals and Residual Plot
Identify
Id tif ththe tt--statistic
t ti ti andd th
the p-
p-value
l for
f the
th t-
t-test
t t off
the slope.
Also identifyy the slopep and the intercept
p on the outputp
table
These are under the Coefficients column
95% confidence intervals for the coefficients are also
provided if the Confidence Level box is checked
PubH 6414 Lesson 15 53
T-test of the Intercept
The Data Analysis Tool also provides results of a t- t-test of
the Intercept
Intercept.
The Null hypothesis of this test is that the intercept = 0:
= 0
Th Alt
The Alternative
ti hypothesis
h th i off this
thi test
t t is
i that
th t the
th
intercept 0: 0
Usuallyy there is not much interest in the t-t-test of the
intercept because testing whether the intercept = 0 does
not provide information about the relationship between
the two variables.
From the Regression Table, you can see that the null
hypothesis for the intercept = 0 is not rejected because
the p-
p-value = 0.936.
0 936 This result does not affect the
significant result of the t-
t-test of the slope.
PubH 6414 Lesson 15 54
Linear Regression Procedure
Look at a scatter plot of the data
Plot Y on the y-
y-axis and X on the x-
x-axis
Add the trend line to the plot
Estimate the regression line equation
Find the slopep and interceptp of the regression
g line
Is the relationship statistically significant?
Use a t-t-test of the slope to determine significance
H
How wellll does
d the
th estimated
ti t d regression i line
li equation
ti fit the
th
data?
Calculate R2 - the coefficient of determination
Use the estimated regression line equation to predict values
of the dependent variable (Y) for specified values of the
independent variable (X).

PubH 6414 Lesson 15 55


How well does the regression
line
li equation
i fit
fi the
h data?
d ?
r2 iss the
t e notation
otat o for
o tthee coe
coefficient
ce t o
of
determination
r2 is equal to the correlation coefficient (r)
squared. d It
I can range from
f 0 to 1.
1
Interpretation of r2
r2 isi proportion
ti off variation
i ti ini the
th dependent
d d t
variable (Y) that is explained by the estimated
least squares regression equation.
Larger values of r2 indicate a better fit of the
regression line to the data which indicates a more
usefulf l predictive
d model.
d l
PubH 6414 Lesson 15 56
Calculating r2
In Excel, you can use the CORREL function to find
the correlation coefficient and square this value to
find the coefficient of determination
For the plasma / weight data, r = 0.759 so r2 =
0.7592 = 0.576
Or you can find r2 on the Data Analysis Tool Output:
Regression Statistics
Multiple R = the correlation coefficient
Multiple
p R 0.759126577
R Square 0.576273159
Adjusted R Square 0.505652019 R square = coefficient of determination (r2)
Standard Error 0.218809511
Ob
Observations
ti 8

PubH 6414 Lesson 15 57


Interpretation of r2
For the plasma volume example r2 = 0.576.
Interpretation:
p 57.6% of the variation in plasma
p volume
is explained by the regression line equation with weight
as the explanatory variable.
Since onlyy 57.6% of the variation in p
plasma volume is
explained by body weight, there are likely other
variables that explain some of the variation in plasma
volume.
M l i l regression
Multiple i analysis
l i uses more than
h one
explanatory variable to predict the dependent variable
This is covered in PubH 6415
If there are other explanatory variables significantly
related to plasma volume in a multiple regression
model, r2 will increase

PubH 6414 Lesson 15 58


Linear Regression Procedure
Look at a scatter plot of the data we have done this
Plot Y on the y-
y-axis and X on the x-
x-axis
Does the relationship appear to be linear?
Estimate the regression line equation we have done this
Find the slopep and interceptp of the regression
g line
Is the relationship statistically significant?
Use a t-t-test of the slope to determine significance
H
How wellll does
d the
th estimated
ti t d regression i line
li equation
ti fit the
th
data? We have done this
Calculate R2 - the coefficient of determination
Use the estimated regression line equation to predict values
of the dependent variable (Y) for specified values of the
independent variable (X).

PubH 6414 Lesson 15 59


Using the Regression Line
equation
i for
f Prediction
di i
The regression line equation for the weight and
plasma volume data is: Y 0.0857 0.0436 X

For a given value of weight (X), the plasma


volume ((Y)) can be predicted.
p
What is the expected plasma volume for an
individual who weighs 60 kg?
Insert 60 in the equation in place of X and
solve for Y: Y 0.0857 0.0 36 * 60 2.7lite
0436 literss
PubH 6414 Lesson 15 60
Predicting plasma volume for
weight
i h = 60 kg
k
4
P la s m a V o lu m e (lite rs )

3.5

3
2.7
2.5

2
50 55 60 65 70 75 80
Body Weight (kg)

The predicted plasma volume for weight = 60 kg is the point on the regression
line corresponding to x = 60. This point is 2.7 liters.
PubH 6414 Lesson 15 61
Appropriate Applications of the
Regression
i Line
i Equation
i
Predictions using regression line equations are only valid
within
ithi the
th range off x-
x-values
l in
i the
th collected
ll t d data.
d t
For the example data, the range of weight is from 58
74 kgs.
g
It would not be appropriate to use this regression line
equation to predict plasma volume for an individual
weighing 100 kg or an individual weighing 25 kg.
There may be a different relationship between weight and
plasma volume beyond the values of the collected data so
the relationship identified by the regression line equation
should not be extrapolated much beyond the range of the X
values.

PubH 6414 Lesson 15 62


More cautions about application
pp
of Regression line predictions
Predictions using Regression line equations are only valid
for the population represented by the sample data.
For Example, if data for a regression analysis are
collected for girls age 10 - 18, predictions using the
equation are not necessarily valid for boys, adults or girls
younger than 10.
You cant assume that the relationship between two
variables in one population is the same in other
populations.
populations
Read the study description carefully to identify the
population that was sampled. Regression analysis
i f
inferences are valid
lid for
f this
thi population
l ti but
b t nott
necessarily other populations.
PubH 6414 Lesson 15 63
What if there isnt a significant
relationship between the
variables?
If regression analysis reveals that there is NOT a
significant relationship between the two variables (that is
p--value for the tt--test of the slope >
if the p )
) the
regression equation is not useful for predicting values of
the dependent variable from the independent variable.
If the t-
t-test of the slope is NOT significant, end the
regression analysis procedure and do not use the
regression line equation for prediction.
Prediction using the regression line equation is only
useful if the null hypothesis of independence between
the variables is rejected.

PubH 6414 Lesson 15 64


Relationship between
C
Correlation
l i and
d Regression
i
The correlation coefficient and the slope of the
regression line are related. For a given set of
data:
They will both have the same sign indicating the
direction of the relationship (positive or negative).
Th
There is
i a mathematical
th ti l relationship
l ti hi between
b t the
th
slope and the correlation coefficient: the slope of the
regression
g line is equal
q to the correlation coefficient
times the standard deviation of y divided by the
standard deviation of x: rs
b1 y

sx
PubH 6414 Lesson 15 65
Hypothesis Test of population
correlation
l i coefficient:
ffi i
We can set up p a hypothesis
yp test of independence
p for the
population correlation:
Null Hypothesis:
no significant linear association between the variables
Alternative Hypothesis:
0
significant linear association between the variables
The test statistic is a t-
t-statistic with n-
n-2 df
r n2
t
1 r 2
After finding
g the t-
t-statistic,, you
y can use EXCEL to find the
p-value = TDIST(t, n- n-2, 2)
PubH 6414 Lesson 15 66
T-test of the correlation
coefficient
ffi i
For a given sample data, the t- t-test for and the t- t-test for
th slope,
the l 1 , will
ill have
h the
th same t- t-statistic
t ti ti and d p-
p-value.
l
For the plasma volume data, the t- t-statistic for the test of
the population correlation coefficient = 2.85658 which is
th same as th
the the tt--statistic
t ti ti ffor th
the slope
l off th
the regressioni
line
You can work through the equation in EXCEL to
confirm this
P-value = TDIST(2.85658, 6, 2) = 0.02893
The same conclusion is reached from either hypothesis
t t th
test: there iis a significant
i ifi t relationship
l ti hi between
b t the
th two
t
variables
The p-p-value < 0.05 so the null hypothesis of
independen e is
independence i rejected
eje ted att significance
ignifi n e level
le el 0.05
0 05
PubH 6414 Lesson 15 67
Linear Regression and
Correlation: which to use?
Both Linear Regression and Correlation Analysis can be
used to explore the linear relationship between two
continuous (quantitative) random variables
Use Correlation analysis when the interest is primarily
in identifying whether a relationship exists.
exists
Use the t-
t-test of the correlation coefficient to determine if
the relationship is significant.
Use Regression
Reg ession Anal
Analysis
sis to identif
identify a relationship
elationship AND
to predict the value of one variable given a value of
the other variable.
Use the t-
t-test of the slope to determine if the relationship is
significant
Regression analysis is most useful when there is an identified
interest in predicting one variable from the other(s).
other(s) If
prediction doesnt make sense, use correlation analysis.
PubH 6414 Lesson 15 68
Readings and Assignments
Reading
ead g
Chapter 8 pgs. 192-
192-194, 202
202212
Complete
p the Lesson 15 Practice Exercises
Lesson 15 Excel Modules
Excel Module 15: Plasma Volume works
through the example in this Lesson
Excel Module 15: BMI works through the
example in the text (pages 205
205--206,
206 208-
208-209)
Complete OPTIONAL Homework 11: Use the
Data Analysis Tool for the Linear Regression
problems
PubH 6414 Lesson 15 69

S-ar putea să vă placă și