Sunteți pe pagina 1din 20

Simple Linear Regression

Simple regression: 1 predictor in a model

You can add extra predictors. But the emphasis now is how regression is extending what we can get
from the correlation, assumptions ect.

We have 3 bits:
Descriptive
Correlation
Regression

Last week research question: carers of people on dialysis and how stressed they felt and the relation
to age.
i.e. can we predict their stress from their age
Correlation - does stress have anything to do with their age (co-variance)

Now we will answer the last part: can we predict the carer's stress from their age?

From last week, age and distress co-vary (vary together).

Conduct correlation/regression analysis


Regression and Check assumptions

We have an f-test/t-test. If you square the t you will get your f value.

When testing regression, SPSS gives you an F.

If our overall model is significant, we can:


Write and interpret the 'regression equation'
We can comment on and interpret R2% (R2 is just proportion of variance in a DV that is
accounted for by the predictors in the model.)
o Every variable has variance associated with it
o i.e. The scores vary
o How much of that varying can be accounted for by the predictors in your model is your
R2 . We will see what that means and where that comes from
Check the 4 assumptions underlying our model
o Some assumptions cannot be commented on after looking at the numbers and graphs
that come out of the analysis

We need to do this whenever you do a regression!


1. Run you analysis and see if your overall F for that model is significant
2. If yes, you can interpret regression equation
3. Talk about R2
4. Check assumptions

If assumptions are violated and ran the wrong test, you have ways to deal with that.

Important sentence! It is a template for regression


"For a one unit [ideally you need to be more specific; for example: for a one year increase.]
increase in Age the predicted carer distress will change by [?] units"

This is the last part of the research question


We hope to use Age to improve our best estimate of carer distress

We know from the scatterplot and value of our correlation that this will be a negative change
i.e. as age increases, the carer's distress decreases
As carers get old they have less to stress
But 'by how much'? We need to do simple linear regression to give us that value. Correlation can't
give us this value.

Linear because we assume a constant change. Every one unit increase in age will be associated by
the same change in distress
Constant change: as you move from category 1 of age to category 2, distress changes by
however much (x units). Then from category 2 to 3 it also changes by the same x units
(constant change). Then from category 3 to 4 will also change by x units.

How to do regression SPSS

Analyse -> Regression -> Linear

Dependant variable + independent variable in respective boxes

Syntax

Regression
/Statistics coeff outs r ANOVA
/Dependent shtot
/method=ENTER age

What does SPSS do with this? SPSS is doing something called 'method of least squares' regression.
What does that mean?
When we had a scatterplot, we had a line of best fit
That line is actually our regression line
What it does is that it tries to fit the line that is closest to all the dots as possible.
It will take a particular dot on the graph, look at the distance from that dot to the line of best
fit and then square that value
It does this for every single dot on the scatterplot
It will then add them all up (sums of squares) in a regression just like in ANOVA
The method of least squares -> trying to make that number as small as possible
o Trying to make all the errors (when someone's dot doesnt fall on the line) as small as
possible
o That is our method of 'least squares'
It puts that line whenever it will make the dots as close as possible
You don't have to do this by hand

Once we do that, we will get an output which will tell us what the equation for the regression is.
That equation is just a straight line on the graph and is trying to put all the dots as close as possible
to all the dots (being observations) as it can.
Equation of the Straight line
This is the output we get and we use this to write out equation.

They are from our sample data. The bottom line is whether the effect of Age is significant. The top
line is your y-intercept (is that point significantly different to 0).

The regression equation is in the form:


Predicted distress = y-intercept + slope(Age)
DV = where it hits the y axis + slope(age)
= a + b(Age)
= 38.25 - 2.62(Age)
As age goes up, stress goes down

That is why it is better to have age as a continuous variable rather than a category.

'Increasing one unit of age associated with drop of 2.62 units in predicted distress'

What the equation is telling us is:


Age on x-axis
Distress on y-axis

Regression is a straight line. The only thing is that we will eventually get more predictors.

Your predicted values can be negative. Having said that, if we extrapolate our graph, our distress will
eventually be negative and that is not meaningful/possible in our current situation. Depends on
what you are measuring.
Is the regression significant?
We want to know:
1. Is the regression overall significant
2. Is each predictor significant

In the case of a single predictor the above 2 questions will have the same answer. But when you
have a bunch of predictors later on, you can have an overall significant result but 3 of your
predictors will be significant and 5 of them might not be.

This is testing the null hypothesis that the population slope is really zero.
Just like in a t-test, the null hypothesis is the difference is mean is zero
That is tested by the F in the following table

F(1,59) = 23.384; p<0.0005

Because the regression equation is significant it makes sense to use age to predict distress. So we
write and interpret the regression equation. If it wasnt significant, there is no association between
age and distress where the slope could be zero which means there is no point in looking at this
research question any further.

We now have a better estimate of carer distress than the overall sample mean. Instead of reporting
to the RNS administrator that the estimate of a carer's distress is the sample mean we would now
ask for the carer's age.

Comment and interpret R2%


How much variation has been explained? - that is our R2

R = multiple correlation between all the variables between your data set

R2 = .284
That is how much variance is explained
It gives us an idea of how well the regression line fits our data points
How?

It gives us the explained variation (how much of the variation in our variable we can explain) divided
by total variation that is there
With every variable there is some variation
We want to know of that variation, how much of it can we actually explain
R2 tells us precisely that

R2 = Ssregression / total SS
= .284
= 28.4%
Ssregression is how much we can explain from our model
SS resid is how much of the variation we CAN'T explain -> in ANOVA it is Sserror
o What we can't predict

Therefore 28.4% of the variation in carer distress is explained by knowledge about age. Note also
that 72% is left unexplained.
Don't ask to what point will this variation explained be a good indication because it is
completely your call

So although we have improved our best guess we've only explained a modest amount of variation.

Later in the course we will learn how to add more IVs to improve our explanation

How appropriate was it?


How appropriate was it to do a regression analysis? We determine this by looking at the
assumptions of regression and if they have been met.

Assumptions can be decided ahead of time or some have to be after getting the data.

There are 4 assumptions - analysis of residuals


Think back to assumptions for ANOVA: normal distribution
When talking about regression, it's not the actual DV that needs to be distributed. It is the
residual (error) that needs meet normality.
Unless your R2% is 100 - you cannot predict perfectly and there has been some variation that
you cannot explain.

The assumptions are:


1. Linearity - was it appropriate to fit a straight line to the data points? This assumption is not an
assumption of ANOVA.
2. Normality of residuals - can we interpret inference (p-value)?
3. Constant variance of residuals - was least squares a good method to use for this data set?
4. Independence of observations - basic requirement checked by design. This needs to be
established based on the design and is not something that comes out of your analysis.

1-3 can only be determined after you get your data output. Number 4 is determined beforehand.

How do we look at these assumptions

We get SPSS to create plots for us.

Point and click:


Analyse - regression - linear - plots

The predicted score is what you get when substituting a number into your line equation.
ZPRED is just all your predicted scores standardised.
There is a difference between predicted score and actual score which is your residual. Every
person has a residual associated. It is the difference between what we predict and what they
actually got. ZRESID is everyone's residual (error) score turned into a Z-score
We are going to plot ZPRED and ZRESID on the scatterplot
x-axis: ZPRED
y-axis: ZRESID
Normal probability plot

Syntax:
REGRESSION
/STATISTICS COEFF OUTS R ANOVA
/DEPENDENT shtot
/METHOD=ENTER age
/SCATTERPLOT=(*ZRESID, *ZPRED)
/RESIDUALS NORM(ZRESID) -> that gives us the P-P plot

The assumptions are all about the residuals and now have got SPSS to create plots about residual.

Checking assumption of normality

We look at the P-P plot

Each dot is supposed to exactly on this line if everything is normal.

Deviations off the line means that part is not normally distributed.
You can do your shapro-wilk test to get a forensic investigation on normality. Null hypothesis is that
it is normally distributed. If it comes out significant then we are rejecting the null hypothesis and
that it is not normal.

The P-P plot looks at the same thing as shapiro-wilk but not a forensic investigation of normality -
simply a graphic representation.

The assumption is satisfied if there is a straight line relationship between the observed and expected
values. Here, we would conclude that the assumption of normality is satisfied.

One that wouldn't satisfy the assumption would be:

If there is a systematic pattern to the deviations from the normal line and therefore normality is
violated and was not appropriate to run the regression.

Testing the assumption of linearity and constant variance

We assess these 2 assumptions with the same scatterplot

Testing linearity
What we are looking for is an equal number of residuals above and below 0 [at each point] as
you move across the graph

If just want an approximate - dont have to physically count them


One that doesnt satisfy the assumption would look like this:

Testing for constant variance


Want to know that there is no fanning out of residuals but have an even band

Needs to make sure that it is fairly constant variance as you go along in between the 2 lines
This would not be ok:
o You dont want people with low scores to be systematically different from people with
higher scores
o

o In this one we are better at predicting people with low scores than people with higher
scores as there seems to be more errors which cannot be explained
o We want errors to be the same all along the predicted values

So our assumptions are satisfied. The assumption of independence is satisfied because we are sure
that no carer has been measured twice and we are confident the distress score for one carer doesn't
influence the distress score for another carer.

So we now have an answer for the 4rd part of our analysis/research question.

Concluding statement
Does knowledge about age help predict level of carer distress (regression)? Yes it does.

More details to answer this part of the research question:


For everyone one unit increase in age predicted carer distress goes down by 2.62 units
Slope is significantly different from zero. H0: beta=0 [F(1,59) = 23.38; p<0.0005]
28.4% of variation in carer distress is predicted by age which is a moderate amount
o Remember there are no specific cut-offs for small, medium, large ect.
We are happy to say this because the 4 assumptions have been met.

This is a good outcome for research.

Another example
These are the steps you need to take when doing statistical analysis. Very important for the research
project assignment

First
Understand what you are trying to test

Second
Understand your sampling population - are they representative ect.

Third
Understand your data - e.g. with age are we measuring it continuous or categorically and what
implications would there be ect.

Fourth
Describe your sample data using appropriate descriptive statistics
o Univariate: location of where your data lies (e.g. mean), dispersion (SD), shape
o Bivariate: correlation (relationship between 2 variables)

Fifth
Descript your sample data using appropriate graphs
o Univariate: histograms, boxplots
o Bivariate: scatterplots

Sixth -> part involved with running regression (hard part)


Fit appropriate models
o Estimate effect sizes
o Test hypotheses

Seventh
Test assumptions from the model you have just ran in step 6

Eighth
Interpret with respect to 1
New Example

Is there a relationship between the final grade in a unit and the number of PAL sessions attended

Final grade
DV

PAL sessions attended


Predictor

These are both quantitative variables

RQ: Does knowledge about the number of PAL sessions attended help predict STATS SNG? In
particular, does more PAL sessions predict higher STATS SNG?

Number of PAL attended is not the only thing that predicts final score. Therefore there will be some
errors in predication which will be reflected in the DV.

Investigate (fit) the bivariate relationship

Using sample data, one can estimate the linear model relating Y and X. the data provide estimates
for the unknown parameters alpha (y-intercept/constant) and beta (slope).

Simple linear regression inherently assumes a linear relationship between the IV and DV.

The first step is to look at the data. A scatterplot of the data reveals whether a model with a straight
line trend makes sense.
If there is no linear relationship, there is no point in trying to fit a line of best fit to it or linear
regression to begin with
Interpretation

Specify a linear model

Y = constant + beta(X)

The formula provides a simple approximation for the true relationship between X and Y. the better
the model, the closer the observed Y and Y-hat, smaller the error.

73.266 is alpha (constant)

For regressions with 1 predictor, the standardised coefficient is also your correlation. As soon as you
have more than 1 predictor, that rule does not hold.

Usually you want to unstandardized coefficients.

Linear Regression model in action


Yi is the individual. A and b are sample estimates of the population values alpha and beta.
The error occurs because the linear relationship is not exactly correct
i.e. a is not exactly alpha and b is not exactly beta

Before running the regression, the mean is the best we can do.

Fit the prediction equation (line of best fit)

This is called the prediction equation.


This is derived when we substitute a particular value of x into the forum a+bX

Interpretation of model

Based on the regression table, SNG on average increased by 0.545 points for every extra PAL session
attended

Intercept:
If no PAL sessions attended the expected SNG in STATS would be 73

Instead of number of PAL sessions attended and our predictor was second year STATS performance
what happens?

Positive correlation - do well in second year and well in third year


For every extra mark you got in second year, that predicts about half a mark in the third year
Intercept: 36.76
o So a person who scores 0 in the second year they would score 36 in the third year
o That is a problem because if you got 0 in second year you can't even do third year
o So the intercept in this model does not make sense
o We can play with the data to make it make sense -> centring

Centering IV

What you are doing is rescale your predictor so that 0 is now a meaningful score. If you centre it, 0 is
going to be the mean.

You subtract the mean score from each person's individual score.

The centred scored equals the individual score minus the mean

i.e. their new score is their old score minus the mean

As it is, the average SNG of that year was 68.5


Apply this formula:

On the new score the mean is now 0:

Note the SD should not change. If it does that means you haven't done the second line correctly.

From there, we can do the regression again:

Instead of y-intercept being 36 it is 74.77


The b value is still the same which is what we want

Centering can be a very powerful tool. You are not changing the relationship between the variables
at all. You are making the regression equation a bit more meaningful.

We wouldnt have to make zero the mean. We could subtract whatever the lowest score was from
individual scores.

Again, why does the SD not change after centering?


Conditional Variance
Remember the assumptions and constant variance (don't want fanning out data). That is related to
the conditional variance.
We know we are not predicting exactly.
As long as there is variance, each predicted point has an association with it
There is some error involved - variability. Once there is variability, there will be a distribution
(spread) of that variability and that is called conditional variance.

Regression Theory

Somebody scored 70 in second year. There is a predicted score associated with that from our
formula. That is also the expected average score of someone who scored 70 in second year.
If the predicted score is predicted averaged that implies the distribution of values around that.
If that predicted point is the average score, that means it will have SD (variability associated
around it).
We know this because once there is an average, there will also be a spread of variability

If the predicted score is the average, what is the SD?


We will now answer how to get that standard deviation in SPSS

Analyse - regression - linear


Sigma here stands for predicted SD.

What this means is that 95% of the observations is going to fall between 48.23 and 98.11.

Link this back to assumption of constant variance (we don't want that fanning out of data), we are
expecting those errors in prediction will be constant along the prediction line. We are assuming that
the SD associated with the expected values in y will be 12.47 all the way along the line.

Assumption of constant variance, we are assuming the variability will be the same all along the line.
We have calculated that SD to be 12.47.
Main thing to remember:
Our assumption of constant variance means that the variability around predicted line is the
same all around the values of x
We can figure out what that variability is - take the square root of the mean square error

Another example

Each curve here has a predicted SD of 13.

At x=12, E(y) = 30 is your conditional distribution and has an SD associated with it

Each conditional distribution is assumed normal.


Conditional vs Marginal SD
Marginal SD surrounds the mean. If our model is DV score =mean we get a flat line:

Variation around the mean is usually greater than the variation around the expected value that you
get from regression because the whole point of doing regression is to get better predictions. So
there should be more errors when just looking at the mean.

Sy is the sample SD of the marginal distribution (mean distribution) of y


Sigmay is the population SD of the marginal distribution of y

S-hat is the sample SD of the conditional distribution of y, for a fixed x

Sigma is the population SD of the conditional distribution of y, for a fixed x

Regression assumptions
Independent observations

Normal distribution of residuals

Linear relationship between IV and DV

Constant variance

Multiple regression in non-experimental research


There are research questions which cannot be done on a experimental basis and therefore requires
multiple regression.

If we want to get a better prediction of third year we could add in another predictor.
If we want to see if PAL has an effect on third year SNG above and beyond their second year score,
we enter them both together into the regression. The results we get in the second table tell us the
effects of PAL above and beyond second year scores and converse.

The first table tells us where the model is overall significant.

Multiple regression basically looks at multiple factors that implicate our DV

S-ar putea să vă placă și