Sunteți pe pagina 1din 10

Lecture 1 09-08 Quantitative Methods I: Regression Fundamental thing we need to know the deviation score is the distance of the

of the observation from the mean. Standard scores allow us to compare the relative position of observations within different distribution Standard scores express any observation in terms of its distance from the mean of its distribution in standard deviation units Standard scores are called z scores Z = x-m/SD

Normal distribution Specific shape may be described in terms of its mean and standard deviation Some naturally occurring phenomena have a distribution that conforms well to a normal distribution In psychology we often assume constructs of interests are normally distributed, or even if theyre not, we assume that they are Sampling Distribution The sampling distribution for a statistic gives the probability of getting each possible value that a sample statistic can take There is a different sampling distribution for each sample size The standard distribution of the sampling distribution depends on the standard deviation of the population and on the sample size Getting several samples and attaining their mean. In the long run, the distribution will be of the mean of the population Significance testing tells us whether we accept or reject a hypothesis or not We use our sample statistic to test our hypothesis about the population parameter.. a specific one is called the null hypothesis testing The null hypothesis usually states there is no effect, no difference, no relationship and so on. If the null hypothesis is rejected then the alternative hypothesis is supported The alternative hypothesis is strictly = NOT (Ho) When testing, we originally have to believe that the null hypothesis is true Null hyp testing doesnt say anything about what reality is. If you reject the null hypothesis, it doesnt mean that the alternative is true, it just means that the null hypothesis is just not right. 5% in total will have an absolute score larger than 1.96, so were just using a p value of .05 to reject the null hypothesis. The T-distribution is used when we dont know the normal population standard deviation Ho mu1=mu2 ha = mu -/- mu2 The test statistic is divided by the standard error

Doublethink In your mind, youre holding two contradicting views.

Lecture 2 13-08 Regression

Equation of a straight line Y = B1x + B0 where b1 is the slope, and B0 is the intercept. Slope tells us how much Y increases every time x increases by 1 Linear regression model Y = b1x + b0 + ei(residual) Residual = error of prediction Redisdual = observed y predicted y

The best fitting line minimises the SS such that that the SS is at a minimum

Lecture 3 16-08 notes also written down on pad Regl = linear regression Regl <- ( iq.time2 iq.time1) Summary (regl) Then youll get residual standard error: 3.11 on 98 degrees of freedom. 3.11 measure of how tightly clustered the points around the area are. This gives a measure of overall variability The higher the value of R^2 is, the better as it explains more variance. I.e., 0.96% of the variance of iq.time2 can be explained/predicted by the iq.time1 Sample of 900 kids, and have a measure of school achievement in year 8. We also know what gender the kids are. If youre interested in comparing the achievement of boys and girls, then youd do a t.test. the top half says yr8.ach ~ sex.f, dfl, var.equal = TRUE) Now if p-value = <0.05, then theres a significance difference between the two groups T value is -7.7015, and p-value = 3.547e-14 Now, for the t.test of the slope parameter in the regression analysis It is t-value = 7.702. pr = 3.55e-14 If you remember, the mean group difference between female and male respectively is 4.9 (51.7 46.8). then, in the coeefficents part, it is also roughly the same.

When dealing with categorical data, i.e., sex, it is generally 0 and 1. So either boy or girl. Remember, intercept is value of Y when x=0 So, if you want to find out whether the variables parents education, homework at school and homework at home have any effects on yr10.grades, you can still do a regression. It looks like this:

Multiple r-squared: 0.1584 so means 15% of variance can be explained, so want to know statistically speaking, is that good or not? We can formalise a null hyp0: b1 = b2 = b3 = 0 --- MEANING theyre all zero, and the model is no good at predicting kids scores. We test this by comparing the amount of variance the model

predicts compared with the variance that the model doesnt predict. That is, compare SSregression with SSresidual We first need to normalise the SS on their degrees of freedom. Degrees of freedom for SS regression is the number of things, and the SSresidual will be the number of things plus the predictor variable If F statistic is 1 or less, then the model is no good. If the model is good, then F will be >1

So as you can see, the F-statistic is way bigger than one. on 3 means that there were three variables par.ed, hw.school, hw.home.. and the 875DF is from ? P-value < 2.2e-16 meaning it is significant and that it is better than having no model at all (statistically speaking) We can have a look individually if the three slop parameters are good or not.

If you divide 4.244/.13741, youll get 30.887 From looking at this, you can see each hour of homework spent at home is better than each hour of homework spent done at school you figure this out by figuring its estimates, also their pvalues. Adding and taking variables away from the model can alter its significants. Because all variables are correlated with each other. You can only interpret the model with the one you currently have. Regression parameters can be hard to interpret with psychological and educational variables, but you can make this better by standardising your regression coefficients. Lsr can work out standardised regression coefficients for us. To use the coefficients, use beta

Lecture 4 22-08 some notes written on pad When we have several predictor variables, how do we interpret those regression coefficients? Three assumptions of regression Linearity of relationships between Y and xs Residuals are normally distributed with a mean of zero Homogeneity of variance the variance of all the residuals is the same for all predicted values of outcome variable Y Independence of residuals Library(car) residualPlots(reg1) these statistical tests add another term to the regression and outcome called the square of the original predictor variable. What this does, is introduces non-linearity. This then tests the question of whether it improves the fit of the model. Shapiro.test if your test is normal, then W will = 1. We expect predictor variables to be related to each other, but we dont want it to be too high because technically the variables should be fighting with each other to explain the variance.. if the predictor variances are too correlated, you wont be able to tell which one predicts which. Collinearity of .7,.8.,9 then thats not good. However, multicollinearity can be subtle. To detect multicollinearity, you look at variance inflation factors. What this is, is you treat one predictor variable as an outcome variable, and all the other predictor variables are trying to predict it. If the R^2 is a big number, then the variance of the variable youre interested in is very high in strength with the other ones. If this is too high, then its bad. The function of vif(reg1) -- it gives the VIF for each variable. So its easy to check for multi collinearity Cooks distance >1 means that it is influential? Rlm in package (MASS) Robust regression Blah <-rlm (year10.grades par.ed + hw. School + hw.home + yr8.grades, df2) Summary(blah) Cbind(round(coef(reg1),3),round(confint(reg1),3)) Wide confidence interval is not good.

Lecture 5 23-08 Comparing two regression models. With the second one with an extra predictor variable. The model that has the fewer number of predictor variables is said to be nested within the larger model. That is, the first model is the subset of the second model obviously. But what we want to know is does adding in the extra variable give us an extra explanation of year 10 grades? Do we account for more variance in year 10 grades by adding the extra regressor model? That is, does R2 increase in a meaningful way?

Gives us the difference of 0.124. the variance has been increased by 12.4% but is it statistically significant? We can figure this out by using anova(reg1, reg2) where reg1 is the original model, and reg2 is the extra added model. When your residuals are higher, and youve got lots of predictor variables, your AIC will be higher which is bad. The AIC is the tool used when trying to compare two models in terms of their performance. There is a rule of thumb that we choose the model that is more simple.

Model 2 has more variables, BUT it has a smaller AIC.. meaning that the model is better You can also omit variables and then compare them also. If the number is really small, it means that the model has not had any significant impacts on the droppage of the new variable. So reaction time versus group (with two categories) You can make the formula:

Theres no difference in means between the two groups. Group coded 1 and group coded 2 have statistically the same mean. Group one has more variability. But this analysis neglects group membership. You can however, just call the thing x so that you can do:

But the regression line doesnt describe the two groups very well, seeing as theyre different. You can regress with all three variables

Theres an interaction between group membership and x in interactions with reaction time. the relationship between RT and x depends on what group youre in. you can incorporate this onto a regression model

Now theyre all statistically significant. But we just concentrate on understanding the interaction by plotting different models for the two groups.

S-ar putea să vă placă și