Sunteți pe pagina 1din 7

SPSS for linear regression

R-square is the coefficient of determination, it tells us which


proportion of the variation in the dependent variable (Y) that
is explained (or determined) by variation in the x-variables
in the model.

Standard error of estimates (se) is the estimate of the


standard deviation of the error terms () in the model.

Sum of squares for Regression is the variation in the


dependent variable that is explained by the variation in the
x-variables in the model.

Sum of squares for Residual is the variation in Y that is not


explained by the variation in the x-variables.

Sum of squares for Total is the total variation in Y.

df stands for degree of freedom. The total number of degrees


of freedom is equal to the number of observations (n). We use
one degree of freedom to estimate the regression coefficient
() for each x-variable and one degree of freedom to estimate
the regression coefficient, intercept, () in the model. The
degrees of freedom for the residuals are the df that is left when
we have estimated all the regression coefficients. If the
number of variables in the model is k, the number of df for the
residuals is n-k-1.

Mean square is the sum of squares devided with its df. The
mean sum of squares of residuals (s2) is the estimate of the
variance of the error terms (2) and the mean sum of squares
of the total (sy2) is the estimate of the variance of Y (y2).
F is the ratio between the mean sum of squares for regression
and the mean sum of squares for residuals. This is a test
function used to test the null hypothesis H0: all are equal to
zero, against the alternative hypothesis H1: at least one is
not zero. This is called the over all test.

sig is the p-value for the test function for the test above. The p-
value is the probability to get a value of the test function, with
random sampling, that is at least as high as the observed if the
null hypothesis is true.

Coefficients
Unstandardized (B) coefficients are the estimates of the
regression coefficients and Standardized coefficients are
the estimates of the regression coefficients if the x and y
variables should be standardized. A standardized variable is
the variable minus its mean value devided with its standard
deviation, for instance the standardized y-variable is [y-
mean(y)]/std(y). In a simple (only one x-variable) linear
regression the standardized coefficient for the variable is the
observed correlation between x and y.

Standard Error of the estimates is the estimated unsecurity in


the estimates of the regression coefficients.

t is the ratio between the unstandardized coefficient and the


standard error and is used to test the hypothesis H0: =0,
against H1: 0.
MODEL CONTROL
Residual analysis for controlling the model assumptions:
- Histogram for the standardized residuals, to detect a
deviance from a symmetric distribution and
- scatter plot for the standardized residuals against the
predicted values and/or the x-variables, to detect
assumption violations as for instance variance increasing
with x-values or expected value of the residual depending
on the x-value.
- Scatter plot for the standardized residuals against the case
number, to find which case a large standardized residual
belongs to. Also important to make this plot if the variables
in the model are time series, to detect correlations between
residuals.

Which model is the best model?


R-square as high as possible
R-square adjusted as high as possible
Standard error of estimate as low as possible
Hypothesis testing; are the regression coefficients zero or not?
Which variable is the most important according to theory and
which one has the highest standardized coefficient?

We have used the method enter to estimate the


regression model but you may use the method backward
or forward for an automatic choice of best model.

OBS! This choice is only based on F-values and not on


residual analysis. A F-distribution is based on the
assumption of normally distributed error terms. If this
assumption is violated you cant trust the result.
RELATION BETWEEN TWO CATEGORICAL

VARIABLES

To analyse the relation between two categorical variables we


cant use the Pearson correlation or linear regression model.
The possible outcomes of a categorical variable are often very
few and there is no ranking order of the variable values.

Cross tables and -tests are use to describe the relation


2

and to test if two categorical variables are related (also


used for ordinal data) .

Example: For a sample of 500 persons sex and attitudes


towards mobbing were registrated. The questions were
formulated in such a way that the answere could be yes or
no.
Table 1. Attitudes toward mobbing among males and females
sex Yes no count
Male 50 150 200
Female 50 250 300
100 400 500
Is there a significant difference between males and females in
attitudes toward mobbing?

H0: the two variables are independent (no relation)


H1 : the two variables are dependent (related)
Significans level 0.05

Test function; = (O-E)2/E


2

The test function is chi-square distributed with one df, if the


null hypothesis is true.
Critical region; obs > 0.05(1)=3.841
2 2

The observed value of the test function is calculated by first


caculating the expected value (Ei) for each cell. E is equal to
the number of observations that we could expect if the
variables were independent. The proportion of males in the
sample was 200/500 = 40% it means that we could expect that
40% of the answeres yes would come from the males, i.e.
E1=40. Then the expected counts are

Table 2; Expected counts


sex Yes no count
Male 40 160 200
Female 60 240 300
100 400 500

The observed value of the test function is


(50-40)2/40 + (50-60)2/60 + (150-160)2/160 + (250-240)2/240
= 2.5 + 1.67 + 0.63 + 0.42 = 5,215

The observed value is higher than the critical value. Reject the
null hypothesis, i.e. there is a relation between sex and
attitudes.

The computer calculates the observed value of the test


function and the p-value. In this case the p-value is lower than
0.05. The p-value is between 0.025 and 0.01.
Here the dependent variable is attitude (Y) and the
independent variable is sex (X). If there are more than one
independent variable you nead a model to test the relations
between the Y-variable and the x-variables.

Logit model
A common model to describe the relations between a
dichotomous categorical Y-variable and X-variables
(categorical or other variables) is the logit model (the logistisk
regression).

Instead of a model for the Y-values this is a model for the


probability that Y is equal to 1.

If the probability that Y is equal to 1, is the same in the two


groups then is equal to zero. If the probability that Y is
equal to 1, is the same as the probability that Y is equal to 0,
then is equal to zero.

e 1 X 1
Modell: P Y 1 X 1
1 e 1 X 1

Or in a simpler way;

Logit P=+x1

Y=1 if yes and Y=0 if no


X1=1 if male and X1=0 if female
For this model it is easy to include a new variable. For
instance, if we want to test the hypothesis that age (X2) effects
the attitudes.

Logit P=+x1+ x2

S-ar putea să vă placă și