Documente Academic
Documente Profesional
Documente Cultură
It is often of interest to assess the relationship between two continuous variables, e.g. between weight change and age A simple and very informative way for doing this is to produce a scatter plot using the scatter command
Weight change
-5
10
30
35 Age in years
40
45
Scatter Plots
Indicate sub groups Insert a straight (regression) line Insert a smooth line
The twoway command (short for graph twoway) is useful in this respect The command allows to
Overlay several graphs (e.g. scatter and line graphs) in the same plotting area Indicate sub-groups by using different symbols or line styles Fit lines separately for sub groups Has many options for labelling axes, titles, legends
Command twoway
Produce a scatter graph of weight against age which includes a smooth line that describes the relationship
twoway (scatter weight age) (lowess weight age), /* */ ytitle(Weight change) xtitle(Age (years))/* */ legend(order(1 "Observed" 2 "Lowess fit"))
10 -5 Weight change 0 5
30
40 Lowess fit
45
Produce a scatter graph of weight against age for those that are not at suicide risk and include a regression line
-2 30
Weight change 2 4
40
45
twoway (scatter weight age if life==1,) /* */ (lfit weight age if life==1) , /* */ ytitle(Weight change) xtitle(Age (years)) legend(order(1 */ "Observed - no suicidal thoughts" 2 "Regression fit - no */ suicidal thoughts" ))
/* /*
A scatter plot matrix can be used to look at the bivariate relationships between several continuous variables Use the graph matrix command, e.g.
age
40
30 110 100
iq
90 80
10
5 0 -5
30
40
50
-5
10
Concept of Correlation
Instead of assessing the relationship between ordered and/or continuous variables visually, indices can be constructed that measure the degree of directional association (=correlation) Correlation coefficients range from -1 to 1 where
A number of different concepts have been employed to define correlation common ones are
Pearson correlation Spearman correlation Others, e.g. Kendalls Tau correlation coefficient (not here)
Pearson Correlation
The Pearson correlation coefficient r measures the degree of linear relationship between two variables Frequently employed due to its links with regression analysis The Stata command corr supplies a matrix of observed Pearson correlations, e.g.
corr age iq weight (obs=100) | age iq weight -------------+--------------------------age | 1.0000 iq | -0.4363 1.0000 weight | 0.2856 -0.2597 1.0000
Note only cases with complete observations on all variables are used. Here 100 subjects had complete records.
The pwcorr command calculates pairwise correlation coefficients using all the available information The command also supplies a test of zero correlation (based on normality) if requested, e.g.
10
ci2 and cii2 (immediate version) are user contributed commands that extend Statas ci and cii commands to include correlations
written by Paul T. Seed Install from STATA website (sg159) Background information provided in STB-59
Commands produce a confidence interval for a single correlation based on Fishers r-to-z transformation CI construction assumes bivariate normality
11
The immediate version requires only the sample size n and the observed Pearson correlation coefficient r, e.g. for r=-0.26 based on n=100 observations
12
Partial Correlation
Often want to measure the strength of relationship between two variables after controlling for a third (or more) Interested in partial correlation the Pearson correlation between two variables expected if the level of the third was held constant. For example,
IQ negatively related to weight change (r=-0.26), while weight change is positively related to age and age negatively related to IQ. What would be the correlation between IQ and weight change if age was held constant?
13
Use the pcorr command to calculate the partial correlation between IQ and weight change after controlling for age
pcorr iq weight age (obs=100) Partial correlation of iq with Variable | Corr. Sig. -------------+-----------------weight | -0.1567 0.121 age | -0.3913 0.000
14
Spearman Correlation
The Spearman correlation coefficient is defined as the Pearson correlation based on the ranks of the two variables Since it only needs ranks can be used for ordinal outcomes
It measures the degree of any monotonic relationship between two variables A significance test for the Spearman correlation coefficient can be derived without making distributional assumptions
-
In that sense the coefficient is nonparametric The test assumes that there are no ties
15
The spearman command provides the observed coefficient and a significance test
Test of Ho: iq and weight are independent Prob > |t| = 0.0217
16
50
60
70
80 12 10 8 6
weight
60 50 60
height
50
40 6 8 10 12 40 50 60
17
We might be interested in how weight changes when children get older A Pearson correlation could be used to measure the strength of the linear relationship between weight and age Simple linear regression estimates the nature of a linear relationship between a dependent variable y (response, outcome) and an independent variable x (explanatory variable, predictor) To do this it employs the model of a linear relationship between y and x The observed data y is assumed to arise as the sum of a linear regression line (=linear predictor) and an error term
18
y = 0 + 1 x +
where
0 is the intercept or constant: the value of y when x is 0 1 is the gradient or slope: the increase in y when x increases by one unit is the error: 0+ 1 x is the equation for the predicted values of y, which lie on a straight line, and is the difference between the
observed and predicted values of y
19
twoway (scatter weight age) (lfit weight age) , /* */ ytitle(Weight in pounds) xtitle(Age (years)) legend(order(1 "Observed" */ 2 "Regression fit"))
50 6
Weight in pounds 60 70
80
10 Regression fit
12
20
The structure is
I.e. the first variable given after the command is assumed to be the dependent variable
Regression theory also provides confidence intervals for the regression coefficients and tests of zero coefficients Typically only the significance of the slope coefficient is of interest since this amounts to testing the existence of a relationship between the response and explanatory variables
10
21
-----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 3.642857 .9551151 3.81 0.003 1.514728 5.770986 _cons | 30.57143 8.613705 3.55 0.005 11.3789 49.76396 ------------------------------------------------------------------------------
Weight increased significantly with age (t(10)=3.81, p=0.003). The increase was estimated to be 3.64 pounds per extra year of age (95% CI from 1.51 to 5.77 pounds).
22
A measure of how well the regression model fits the observed data.
R-squared measures the proportion of the variance in y (say weight) that is explained by predicting y from x (say weight from age). Here 59.26% of the variance in weight are accounted for by age. An adjusted version of this coefficient adj. R-squared provides a better estimate of the population value We estimate that 55.19% of the variance in weight can be explained by age. The model showed significant fit (F(1,10)=14.55, p=0.0034).
11
23
Generating Predictions
Once a regression model has been fitted in Stata so-called post estimation commands can be used to elicit further information Often one might want to use the fitted regression line to predict the expected value of y for a given x; e.g. the expected weight at age=10 years The predict command provides predicted values and standard errors The command always refers to the last regression model fitted The format is
predict newname, what
24
Command predict
E.g. to get predicted expected weights for the ages observed in our sample type
predict pred
This generates a new variable called pred in the data file that contains the predicted weights To get standard errors of these predictions type
predict predse, stdp
This generates another variable called predse that holds the respective standard errors
12
25
An approximate 95% CI for the expected value is then given by the estimated value +/- twice its standard error We can use the generate command to construct lower and upper limits of 95% CIs
26
The lincom command provides tests of significance and estimates of linear combinations of the regression coefficients The regression coefficient of a variable is referred to by the name of the variable
E.g. lincom X1*0.5+X2*0.5 tests and estimates the arithmetic average of the regression coefficients of variables X1 and X2
We can use this to predict the value of the dependent variable (say weight) for given values (say 10 years) of the independent variable(s) e.g. lincom _cons+age*10
( 1) 10 age + _cons = 0 -----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | 67 2.063284 32.47 0.000 62.40272 71.59728 ------------------------------------------------------------------------------
13
27
Regression Assumptions
Regression inferences assume that the observations are independent and that the errors have
a normal distribution constant variance 3) zero mean (linear relationship) These assumptions should be assessed There are a number of diagnostics that use the residuals (differences between observed and predicted values) 1) Histogram, QQ-plot, box plot of residuals 2) Residualplot 3) Partial residualplot (see later)
1) 2)
28
option resid for unstandardised residuals option rstandard for residuals standardised to have SD=1
14
29
Residual plot
A residual plot is a scatter plot of the residuals against the fitted values (=predicted values for the sample, here predicted weights) It is commonly used to assess the homogeneity assumption
If the variance of the errors is constant the spread of the residuals around the zero reference line is expected not to change with the size of the fitted values
However, the command uses the unstandardized residuals perhaps more appropriate to plot the standardised residuals against the fitted values
30
Residualplot ctd.
rvfplot
10
-10
-5
Residuals 0
50
55
60
65 Fitted values
70
75
55
60
65 Fitted values
70
75
15
31
In a simple linear regression model the relationship between the dependent variable y and the single independent variable x can be assessed by means of a scatter plot Could include a smooth line that follows the data, e.g.
/*
twoway (scatter weight age) (lowess weight age) , /* */ ytitle(Weight in pounds) xtitle(Age (years)) legend(order(1 "Observed" */ 2 Lowess smooth"))
80 50 6 Weight in pounds 60 70
10 Lowess smooth
12
32
We might be interested in weight changes with age that are not simply due to growing at the same time Multiple linear regression estimates the linear relationships between a dependent variable y and several continuous independent variables x1, x2, xp The model is simply extended to include several linear effects
y = 0 + 1 x1 + ... + p x p +
E.g. 1 is the increase in y when x1 increases by one unit when x2,, xp remain constant Because of this the regression coefficients are sometimes referred to as partial regression coefficients They measure the relationship between the response and an explanatory variable after adjusting for the remaining explanatory variables
16
33
-----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------age | 2.050126 .9372256 2.19 0.056 .4332373 height | .722038 .2608051 2.77 0.022 .5483191 _cons | 6.553048 10.94483 0.60 0.564 . ------------------------------------------------------------------------------
The effect of age on weight was not statistically significant after adjusting for height. Within a group of children of the same height, weight was estimated to increase by 2.05 pounds per year (95% CI from -0.07 to 4.17 pounds).
34
E.g. test that height and age together have no effect on weight
testparm age height
( 1) ( 2) age = 0 height = 0 F( 2, 9) = 15.95 Prob > F = 0.0011
17
35
Partial Residualplots
Regression assumptions 1) and 2) can be checked as described before However, linearity assumption 3) can no longer be assessed by simply plotting the response against the respective explanatory variable The model assumes that the part of the explanatory variable that does not vary with other explanatory variables has a linear effect on the response Partial residualplot
plots the residuals from regressing y on all the explanatory variables except the one in question against the residual from a regression of the explanatory variable of interest on the other explanatory variables
36
avplot height
10 -10 -15 -5 e( weight | X ) 0 5
-5
e( weight | X ) 5
10
-2
-1
0 e( age | X )
-10
-5 e( height | X )
18
37
When there are a large set of potential predictor variables investigators often would like to empirically select a subset of important variables Example data on air pollution in 41 US cities USair.dta so2 sulphur dioxide content of air in micrograms per cubic metre o temperat average annual temperature in F manuf number of manufacturing enterprises employing 20 or more workers pop population size (1970 census) in thousands wind - average wind speed in miles per hour precip average annual precipitation in inches days average number of days with precipitation per year Which climate and human ecology variables predict air pollution?
38
Approach used: one that has a significant effect on the outcome at some test level
The problem is that the significance of any independent variable depends on the other variables included in the model equation As a result several selection procedures have been suggested which vary in the set of variables for which the test is adjusted Therefore different selection procedures can lead to different variable subsets being chosen!
19
39
Forward selection
Starting with just the constant, the model is updated as follows: include the variable which has the smallest p-value when adding to the previous model. Stop if the next variable to be added is not significant according to the chosen significance level. Starting with all variables in the model, keep updating the model as follows: exclude the variable which has the largest p-value for a significance test. Stop if the next variable to be excluded is significant. Like forward, but exclude variables previously selected if these become non-significant
Backward selection
Stepwise selection
40
Statas sw command can be placed before the regress command to run through a series of regressions The option pe(alpha) specifies that the inclusion level in forward regression is alpha
Source | SS df MS Number of obs = 41 -------------+-----------------------------F( 3, 37) = 19.90 Model | 13606.2352 3 4535.41173 Prob > F = 0.0000 Residual | 8431.66725 37 227.882899 R-squared = 0.6174 -------------+-----------------------------Adj R-squared = 0.5864 Total | 22037.9024 40 550.947561 Root MSE = 15.096 -----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------manuf | .074334 .0150661 4.93 0.000 .0438071 .1048609 pop | -.0493944 .0145442 -3.40 0.002 -.0788637 -.019925 days | .1643594 .0948015 1.73 0.091 -.0277267 .3564455 _cons | 6.965849 11.77691 0.59 0.558 -16.89643 30.82813 ------------------------------------------------------------------------------
20
41
The option pr(alpha) specifies that the exclusion level in backward regression is alpha
-----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------temperat | -1.121288 .4158633 -2.70 0.011 -1.965535 -.2770402 manuf | .0648871 .0155449 4.17 0.000 .0333293 .096445 pop | -.0393347 .0149366 -2.63 0.012 -.0696575 -.0090119 wind | -3.082399 1.765623 -1.75 0.090 -6.666805 .5020065 precip | .4194681 .2162447 1.94 0.060 -.0195319 .8584681 _cons | 100.1524 30.27521 3.31 0.002 38.69051 161.6144 ------------------------------------------------------------------------------
42
Specifying both pr(alpha) and pe(alpha) requests a stepwise regression; the default is backwards The inclusion level has to be smaller than the exclusion level
sw regress so2 temperat manuf pop wind precip days, pe(0.1) /* */ pr(0.11) forward
begin with p = 0.0000 p = 0.0003 p = 0.0913 empty model < 0.1000 adding < 0.1000 adding < 0.1000 adding manuf pop days
Source | SS df MS Number of obs = 41 -------------+-----------------------------F( 3, 37) = 19.90 Model | 13606.2352 3 4535.41173 Prob > F = 0.0000 Residual | 8431.66725 37 227.882899 R-squared = 0.6174 -------------+-----------------------------Adj R-squared = 0.5864 Total | 22037.9024 40 550.947561 Root MSE = 15.096 -----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------manuf | .074334 .0150661 4.93 0.000 .0438071 .1048609 pop | -.0493944 .0145442 -3.40 0.002 -.0788637 -.019925 days | .1643594 .0948015 1.73 0.091 -.0277267 .3564455 _cons | 6.965849 11.77691 0.59 0.558 -16.89643 30.82813 ------------------------------------------------------------------------------
21
43
Analysis is exploratory not confirmatory - i.e. not driven by existing theory or prior hypotheses. Therefore results are more likely to be spurious. (After all there might be further explanatory variables out there which have not been measured ) Important variables may not be selected because inclusion of less important variables has made them redundant. Multiple testing: A whole list of variables is tested at each stage of the procedure, possibly giving many false positives.
44
Drawbacks ctd.
Any cases with missing values on any of the candidate variables entered for selection will not be used during the selection process
run the model again on the selected variables to maximize data use
Backward selection is only possible when the number of possible explanatory variables is not too large compared to the sample size. For further information and references see http://www.stata.com/support/faqs/stat/stepwise.html
22
45
Blocking
Underlying theory might suggest that a set of variables should be kept together at all times Blocks of explanatory variable are indicated by brackets
sw regress so2 (temperat wind precip days) (manuf pop), /* */ pe(0.1) pr(0.11) forward
manuf pop temperat wind precip days
begin with empty model p = 0.0000 < 0.1000 adding p = 0.0972 < 0.1000 adding
Source | SS df MS Number of obs = 41 -------------+-----------------------------F( 6, 34) = 11.48 Model | 14754.6358 6 2459.10596 Prob > F = 0.0000 Residual | 7283.26667 34 214.213726 R-squared = 0.6695 -------------+-----------------------------Adj R-squared = 0.6112 Total | 22037.9024 40 550.947561 Root MSE = 14.636 -----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------manuf | .0649182 .0157483 4.12 0.000 .0329139 .0969225 pop | -.0392767 .0151327 -2.60 0.014 -.0700302 -.0085233 temperat | -1.267941 .6211795 -2.04 0.049 -2.53033 -.0055524 wind | -3.181366 1.815019 -1.75 0.089 -6.869928 .5071974 precip | .5123589 .3627551 1.41 0.167 -.2248481 1.249566 days | -.0520502 .1620139 -0.32 0.750 -.381302 .2772016 _cons | 111.7285 47.3181 2.36 0.024 15.56652 207.8904 ------------------------------------------------------------------------------
46
The underlying theory might suggest that a set of variables should always be included in the model, e.g. demographics The option lockterm1 forces the first term given after the response into the model
sw regress so2 (manuf pop) temperat wind precip days , /* */ pe(0.1) pr(0.11) forward lockterm1
begin with term 1 model p = 0.0913 < 0.1000 adding days
Source | SS df MS Number of obs = 41 -------------+-----------------------------F( 3, 37) = 19.90 Model | 13606.2352 3 4535.41173 Prob > F = 0.0000 Residual | 8431.66725 37 227.882899 R-squared = 0.6174 -------------+-----------------------------Adj R-squared = 0.5864 Total | 22037.9024 40 550.947561 Root MSE = 15.096 -----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------manuf | .074334 .0150661 4.93 0.000 .0438071 .1048609 pop | -.0493944 .0145442 -3.40 0.002 -.0788637 -.019925 days | .1643594 .0948015 1.73 0.091 -.0277267 .3564455 _cons | 6.965849 11.77691 0.59 0.558 -16.89643 30.82813 ------------------------------------------------------------------------------
23