Sunteți pe pagina 1din 23

STATISTICAL ANALYSIS USING STATA

Week 3: Correlation and Regression

Ulrike Nauman Dept. of Biostatistics

Relationships Between Continuous Variables

It is often of interest to assess the relationship between two continuous variables, e.g. between weight change and age A simple and very informative way for doing this is to produce a scatter plot using the scatter command

Weight change

-5

10

Example data: female psychiatric patients (R:\applications\cours es\stata courses\2011 course\female.dta)

scatter weight age

30

35 Age in years

40

45

Scatter Plots

Often want to do more than a simple scatter plot


Indicate sub groups Insert a straight (regression) line Insert a smooth line

The twoway command (short for graph twoway) is useful in this respect The command allows to

Overlay several graphs (e.g. scatter and line graphs) in the same plotting area Indicate sub-groups by using different symbols or line styles Fit lines separately for sub groups Has many options for labelling axes, titles, legends

Command twoway

Produce a scatter graph of weight against age which includes a smooth line that describes the relationship

twoway (scatter weight age) (lowess weight age), /* */ ytitle(Weight change) xtitle(Age (years))/* */ legend(order(1 "Observed" 2 "Lowess fit"))
10 -5 Weight change 0 5

30

35 Age (years) Observed

40 Lowess fit

45

Command twoway ctd.


6

Produce a scatter graph of weight against age for those that are not at suicide risk and include a regression line

-2 30

Weight change 2 4

35 Age (years) Observed - no suicidal thoughts

40

45

Regression fit - no suicidal thoughts

twoway (scatter weight age if life==1,) /* */ (lfit weight age if life==1) , /* */ ytitle(Weight change) xtitle(Age (years)) legend(order(1 */ "Observed - no suicidal thoughts" 2 "Regression fit - no */ suicidal thoughts" ))

/* /*

Scatter Plot Matrix


A scatter plot matrix can be used to look at the bivariate relationships between several continuous variables Use the graph matrix command, e.g.

graph matrix age iq weight


80 90 100 110 50

age

40

30 110 100

iq
90 80

10

weight change over last 6m (lb)

5 0 -5

30

40

50

-5

10

Concept of Correlation

Instead of assessing the relationship between ordered and/or continuous variables visually, indices can be constructed that measure the degree of directional association (=correlation) Correlation coefficients range from -1 to 1 where

-1 = complete negative relationship 0 = no correlation 1 = complete positive relationship

A number of different concepts have been employed to define correlation common ones are

Pearson correlation Spearman correlation Others, e.g. Kendalls Tau correlation coefficient (not here)

Pearson Correlation

The Pearson correlation coefficient r measures the degree of linear relationship between two variables Frequently employed due to its links with regression analysis The Stata command corr supplies a matrix of observed Pearson correlations, e.g.

corr age iq weight (obs=100) | age iq weight -------------+--------------------------age | 1.0000 iq | -0.4363 1.0000 weight | 0.2856 -0.2597 1.0000

Note only cases with complete observations on all variables are used. Here 100 subjects had complete records.

Pearson Correlation ctd.

The pwcorr command calculates pairwise correlation coefficients using all the available information The command also supplies a test of zero correlation (based on normality) if requested, e.g.

pwcorr age iq weight, obs sig


| age iq weight -------------+--------------------------age | 1.0000 | | 118 | iq | -0.4345 1.0000 | 0.0000 | 110 110 | weight | 0.3010 -0.2597 1.0000 | 0.0016 0.0091 | 107 100 107

10

Pearson Correlation ctd.

ci2 and cii2 (immediate version) are user contributed commands that extend Statas ci and cii commands to include correlations

written by Paul T. Seed Install from STATA website (sg159) Background information provided in STB-59

Commands produce a confidence interval for a single correlation based on Fishers r-to-z transformation CI construction assumes bivariate normality

11

Pearson Correlation ctd.

Example: construct a CI for the correlation between IQ and weight change

ci2 iq weight, corr


Confidence interval for Pearson's product-moment correlation of iq and weight, based on Fisher's transformation. Correlation = -0.260 on 100 observations (95% CI: -0.434 to -0.067)

The immediate version requires only the sample size n and the observed Pearson correlation coefficient r, e.g. for r=-0.26 based on n=100 observations

cii2 100 -0.26, corr


Confidence interval for correlation, based on Fisher's transformation. Correlation = -0.260 on 100 observations (95% CI: -0.434 to -0.067)

12

Partial Correlation

Often want to measure the strength of relationship between two variables after controlling for a third (or more) Interested in partial correlation the Pearson correlation between two variables expected if the level of the third was held constant. For example,

IQ negatively related to weight change (r=-0.26), while weight change is positively related to age and age negatively related to IQ. What would be the correlation between IQ and weight change if age was held constant?

13

Partial Correlation ctd.

Use the pcorr command to calculate the partial correlation between IQ and weight change after controlling for age

pcorr iq weight age (obs=100) Partial correlation of iq with Variable | Corr. Sig. -------------+-----------------weight | -0.1567 0.121 age | -0.3913 0.000

14

Spearman Correlation

The Spearman correlation coefficient is defined as the Pearson correlation based on the ranks of the two variables Since it only needs ranks can be used for ordinal outcomes

Ties are usually given average ranks

It measures the degree of any monotonic relationship between two variables A significance test for the Spearman correlation coefficient can be derived without making distributional assumptions
-

In that sense the coefficient is nonparametric The test assumes that there are no ties

15

Spearman Correlation ctd.

The spearman command provides the observed coefficient and a significance test

spearman iq weight Number of obs = Spearman's rho = 100 -0.2294

Test of Ho: iq and weight are independent Prob > |t| = 0.0217

16

Relationships Between Continuous Variables

The (very small) data set growth.dta contains childrens


ages in years weighs in pounds heights in inches


age

50

60

70

80 12 10 8 6

graph matrix age weight height


80 70

weight
60 50 60

height

50

40 6 8 10 12 40 50 60

17

Simple Linear Regression

We might be interested in how weight changes when children get older A Pearson correlation could be used to measure the strength of the linear relationship between weight and age Simple linear regression estimates the nature of a linear relationship between a dependent variable y (response, outcome) and an independent variable x (explanatory variable, predictor) To do this it employs the model of a linear relationship between y and x The observed data y is assumed to arise as the sum of a linear regression line (=linear predictor) and an error term

18

Simple Linear Regression Model

In simple linear regression, the dependent variable y is modelled as

y = 0 + 1 x +

where

0 is the intercept or constant: the value of y when x is 0 1 is the gradient or slope: the increase in y when x increases by one unit is the error: 0+ 1 x is the equation for the predicted values of y, which lie on a straight line, and is the difference between the
observed and predicted values of y

The parameters 0 and 1 are referred to as regression coefficients

19

Displaying a Fitted Regression Line

Display the regression line from fitting a regression of weight on age


/*

twoway (scatter weight age) (lfit weight age) , /* */ ytitle(Weight in pounds) xtitle(Age (years)) legend(order(1 "Observed" */ 2 "Regression fit"))

50 6

Weight in pounds 60 70

80

8 Age (years) Observed

10 Regression fit

12

20

Simple Linear Regression ctd.

The regress command provides estimates of the regression coefficients


The structure is

regress depvar indvar

I.e. the first variable given after the command is assumed to be the dependent variable

Regression theory also provides confidence intervals for the regression coefficients and tests of zero coefficients Typically only the significance of the slope coefficient is of interest since this amounts to testing the existence of a relationship between the response and explanatory variables

10

21

Simple Linear Regression ctd.


regress weight age
Source | SS df MS -------------+-----------------------------Model | 526.392857 1 526.392857 Residual | 361.857143 10 36.1857143 -------------+-----------------------------Total | 888.25 11 80.75 Number of obs F( 1, 10) Prob > F R-squared Adj R-squared Root MSE = = = = = = Degrees of freedom used in t-tests below 12 14.55 0.0034 0.5926 0.5519 6.0155

-----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 3.642857 .9551151 3.81 0.003 1.514728 5.770986 _cons | 30.57143 8.613705 3.55 0.005 11.3789 49.76396 ------------------------------------------------------------------------------

Estimated Regression coefficient

Significance tests (t-tests)

Limits of confidence intervals

Weight increased significantly with age (t(10)=3.81, p=0.003). The increase was estimated to be 3.64 pounds per extra year of age (95% CI from 1.51 to 5.77 pounds).

22

Simple Linear Regression ctd.


Also provided in the regression output are

A measure of how well the regression model fits the observed data.

R-squared measures the proportion of the variance in y (say weight) that is explained by predicting y from x (say weight from age). Here 59.26% of the variance in weight are accounted for by age. An adjusted version of this coefficient adj. R-squared provides a better estimate of the population value We estimate that 55.19% of the variance in weight can be explained by age. The model showed significant fit (F(1,10)=14.55, p=0.0034).

An F-test of zero fit (null hypothesis: R-squared=0)

11

23

Generating Predictions

Once a regression model has been fitted in Stata so-called post estimation commands can be used to elicit further information Often one might want to use the fitted regression line to predict the expected value of y for a given x; e.g. the expected weight at age=10 years The predict command provides predicted values and standard errors The command always refers to the last regression model fitted The format is
predict newname, what

24

Command predict

E.g. to get predicted expected weights for the ages observed in our sample type
predict pred

This generates a new variable called pred in the data file that contains the predicted weights To get standard errors of these predictions type
predict predse, stdp

This generates another variable called predse that holds the respective standard errors

12

25

Command predict ctd.

An approximate 95% CI for the expected value is then given by the estimated value +/- twice its standard error We can use the generate command to construct lower and upper limits of 95% CIs

gen lower=pred-2*predse gen upper=pred+2*predse

26

Prediction Outside the Observed Range

The lincom command provides tests of significance and estimates of linear combinations of the regression coefficients The regression coefficient of a variable is referred to by the name of the variable

E.g. lincom X1*0.5+X2*0.5 tests and estimates the arithmetic average of the regression coefficients of variables X1 and X2

We can use this to predict the value of the dependent variable (say weight) for given values (say 10 years) of the independent variable(s) e.g. lincom _cons+age*10
( 1) 10 age + _cons = 0 -----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | 67 2.063284 32.47 0.000 62.40272 71.59728 ------------------------------------------------------------------------------

13

27

Regression Assumptions

Regression inferences assume that the observations are independent and that the errors have
a normal distribution constant variance 3) zero mean (linear relationship) These assumptions should be assessed There are a number of diagnostics that use the residuals (differences between observed and predicted values) 1) Histogram, QQ-plot, box plot of residuals 2) Residualplot 3) Partial residualplot (see later)
1) 2)

28

Distributional Shape of Residuals

The predict command also provides residuals


option resid for unstandardised residuals option rstandard for residuals standardised to have SD=1

predict stresid, rstandard graph box stresid


2 -2 -1 Standardized residuals 0 1

14

29

Residual plot

A residual plot is a scatter plot of the residuals against the fitted values (=predicted values for the sample, here predicted weights) It is commonly used to assess the homogeneity assumption

If the variance of the errors is constant the spread of the residuals around the zero reference line is expected not to change with the size of the fitted values

The command rvfplot provides it for the latest regression


However, the command uses the unstandardized residuals perhaps more appropriate to plot the standardised residuals against the fitted values

30

Residualplot ctd.

rvfplot
10

twoway (scatter stresid pred)


2 -2 50 -1 Standardized residuals 0 1

-10

-5

Residuals 0

50

55

60

65 Fitted values

70

75

55

60

65 Fitted values

70

75

15

31

Assessing the Linear Relationship

In a simple linear regression model the relationship between the dependent variable y and the single independent variable x can be assessed by means of a scatter plot Could include a smooth line that follows the data, e.g.
/*

twoway (scatter weight age) (lowess weight age) , /* */ ytitle(Weight in pounds) xtitle(Age (years)) legend(order(1 "Observed" */ 2 Lowess smooth"))
80 50 6 Weight in pounds 60 70

8 Age (years) Observed

10 Lowess smooth

12

32

Multiple Linear Regression

We might be interested in weight changes with age that are not simply due to growing at the same time Multiple linear regression estimates the linear relationships between a dependent variable y and several continuous independent variables x1, x2, xp The model is simply extended to include several linear effects

y = 0 + 1 x1 + ... + p x p +

This affects the interpretation of the regression coefficients

E.g. 1 is the increase in y when x1 increases by one unit when x2,, xp remain constant Because of this the regression coefficients are sometimes referred to as partial regression coefficients They measure the relationship between the response and an explanatory variable after adjusting for the remaining explanatory variables

16

33

Multiple Linear Regression ctd.

The regress command allows for several continuous explanatory variables.

regress weight age height


Source | SS df MS -------------+-----------------------------Model | 692.822607 2 346.411303 Residual | 195.427393 9 21.7141548 -------------+-----------------------------Total | 888.25 11 80.75 Number of obs F( 2, 9) Prob > F R-squared Adj R-squared Root MSE = = = = = = 12 15.95 0.0011 0.7800 0.7311 4.6598

-----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------age | 2.050126 .9372256 2.19 0.056 .4332373 height | .722038 .2608051 2.77 0.022 .5483191 _cons | 6.553048 10.94483 0.60 0.564 . ------------------------------------------------------------------------------

The effect of age on weight was not statistically significant after adjusting for height. Within a group of children of the same height, weight was estimated to increase by 2.05 pounds per year (95% CI from -0.07 to 4.17 pounds).

34

Testing the Effect of a Set of Variables


The null hypothesis that a set of explanatory variables has no effect on the response can be tested using the post-estimation command testparm

E.g. test that height and age together have no effect on weight
testparm age height
( 1) ( 2) age = 0 height = 0 F( 2, 9) = 15.95 Prob > F = 0.0011

17

35

Partial Residualplots

Regression assumptions 1) and 2) can be checked as described before However, linearity assumption 3) can no longer be assessed by simply plotting the response against the respective explanatory variable The model assumes that the part of the explanatory variable that does not vary with other explanatory variables has a linear effect on the response Partial residualplot

plots the residuals from regressing y on all the explanatory variables except the one in question against the residual from a regression of the explanatory variable of interest on the other explanatory variables

36

Partial Residualplot ctd.


avplot age
15

avplot height
10 -10 -15 -5 e( weight | X ) 0 5

-5

e( weight | X ) 5

10

-2

-1

0 e( age | X )

-10

-5 e( height | X )

coef = 2.0501264, se = .93722561, t = 2.19

coef = .72203796, se = .26080506, t = 2.77

The linearity assumptions seemed reasonable for both explanatory variables.

18

37

Automatic Selection Procedures

When there are a large set of potential predictor variables investigators often would like to empirically select a subset of important variables Example data on air pollution in 41 US cities USair.dta so2 sulphur dioxide content of air in micrograms per cubic metre o temperat average annual temperature in F manuf number of manufacturing enterprises employing 20 or more workers pop population size (1970 census) in thousands wind - average wind speed in miles per hour precip average annual precipitation in inches days average number of days with precipitation per year Which climate and human ecology variables predict air pollution?

38

Automatic Selection Procedures ctd.

What is an important variable?

Approach used: one that has a significant effect on the outcome at some test level

The problem is that the significance of any independent variable depends on the other variables included in the model equation As a result several selection procedures have been suggested which vary in the set of variables for which the test is adjusted Therefore different selection procedures can lead to different variable subsets being chosen!

19

39

Automatic Selection Procedures ctd.

Forward selection

Starting with just the constant, the model is updated as follows: include the variable which has the smallest p-value when adding to the previous model. Stop if the next variable to be added is not significant according to the chosen significance level. Starting with all variables in the model, keep updating the model as follows: exclude the variable which has the largest p-value for a significance test. Stop if the next variable to be excluded is significant. Like forward, but exclude variables previously selected if these become non-significant

Backward selection

Stepwise selection

40

Forward Variable Selection


Statas sw command can be placed before the regress command to run through a series of regressions The option pe(alpha) specifies that the inclusion level in forward regression is alpha

sw regress so2 temperat manuf pop wind precip days, pe(0.10)


begin with p = 0.0000 p = 0.0003 p = 0.0913 empty model < 0.1000 adding < 0.1000 adding < 0.1000 adding manuf pop days

Source | SS df MS Number of obs = 41 -------------+-----------------------------F( 3, 37) = 19.90 Model | 13606.2352 3 4535.41173 Prob > F = 0.0000 Residual | 8431.66725 37 227.882899 R-squared = 0.6174 -------------+-----------------------------Adj R-squared = 0.5864 Total | 22037.9024 40 550.947561 Root MSE = 15.096 -----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------manuf | .074334 .0150661 4.93 0.000 .0438071 .1048609 pop | -.0493944 .0145442 -3.40 0.002 -.0788637 -.019925 days | .1643594 .0948015 1.73 0.091 -.0277267 .3564455 _cons | 6.965849 11.77691 0.59 0.558 -16.89643 30.82813 ------------------------------------------------------------------------------

20

41

Backward Variable Selection

The option pr(alpha) specifies that the exclusion level in backward regression is alpha

sw regress so2 temperat manuf pop wind precip days, pr(0.10)


begin with full model p = 0.7500 >= 0.1000 removing days Source | SS df MS -------------+-----------------------------Model | 14732.5258 5 2946.50517 Residual | 7305.37661 35 208.725046 -------------+-----------------------------Total | 22037.9024 40 550.947561 Number of obs F( 5, 35) Prob > F R-squared Adj R-squared Root MSE = = = = = = 41 14.12 0.0000 0.6685 0.6212 14.447

-----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------temperat | -1.121288 .4158633 -2.70 0.011 -1.965535 -.2770402 manuf | .0648871 .0155449 4.17 0.000 .0333293 .096445 pop | -.0393347 .0149366 -2.63 0.012 -.0696575 -.0090119 wind | -3.082399 1.765623 -1.75 0.090 -6.666805 .5020065 precip | .4194681 .2162447 1.94 0.060 -.0195319 .8584681 _cons | 100.1524 30.27521 3.31 0.002 38.69051 161.6144 ------------------------------------------------------------------------------

42

Stepwise Forward Variable Selection

Specifying both pr(alpha) and pe(alpha) requests a stepwise regression; the default is backwards The inclusion level has to be smaller than the exclusion level

sw regress so2 temperat manuf pop wind precip days, pe(0.1) /* */ pr(0.11) forward
begin with p = 0.0000 p = 0.0003 p = 0.0913 empty model < 0.1000 adding < 0.1000 adding < 0.1000 adding manuf pop days

Source | SS df MS Number of obs = 41 -------------+-----------------------------F( 3, 37) = 19.90 Model | 13606.2352 3 4535.41173 Prob > F = 0.0000 Residual | 8431.66725 37 227.882899 R-squared = 0.6174 -------------+-----------------------------Adj R-squared = 0.5864 Total | 22037.9024 40 550.947561 Root MSE = 15.096 -----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------manuf | .074334 .0150661 4.93 0.000 .0438071 .1048609 pop | -.0493944 .0145442 -3.40 0.002 -.0788637 -.019925 days | .1643594 .0948015 1.73 0.091 -.0277267 .3564455 _cons | 6.965849 11.77691 0.59 0.558 -16.89643 30.82813 ------------------------------------------------------------------------------

21

43

Drawbacks of Selection Procedures

Analysis is exploratory not confirmatory - i.e. not driven by existing theory or prior hypotheses. Therefore results are more likely to be spurious. (After all there might be further explanatory variables out there which have not been measured ) Important variables may not be selected because inclusion of less important variables has made them redundant. Multiple testing: A whole list of variables is tested at each stage of the procedure, possibly giving many false positives.

44

Drawbacks ctd.

Any cases with missing values on any of the candidate variables entered for selection will not be used during the selection process

run the model again on the selected variables to maximize data use

Backward selection is only possible when the number of possible explanatory variables is not too large compared to the sample size. For further information and references see http://www.stata.com/support/faqs/stat/stepwise.html

22

45

Blocking

Underlying theory might suggest that a set of variables should be kept together at all times Blocks of explanatory variable are indicated by brackets
sw regress so2 (temperat wind precip days) (manuf pop), /* */ pe(0.1) pr(0.11) forward
manuf pop temperat wind precip days

begin with empty model p = 0.0000 < 0.1000 adding p = 0.0972 < 0.1000 adding

Source | SS df MS Number of obs = 41 -------------+-----------------------------F( 6, 34) = 11.48 Model | 14754.6358 6 2459.10596 Prob > F = 0.0000 Residual | 7283.26667 34 214.213726 R-squared = 0.6695 -------------+-----------------------------Adj R-squared = 0.6112 Total | 22037.9024 40 550.947561 Root MSE = 14.636 -----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------manuf | .0649182 .0157483 4.12 0.000 .0329139 .0969225 pop | -.0392767 .0151327 -2.60 0.014 -.0700302 -.0085233 temperat | -1.267941 .6211795 -2.04 0.049 -2.53033 -.0055524 wind | -3.181366 1.815019 -1.75 0.089 -6.869928 .5071974 precip | .5123589 .3627551 1.41 0.167 -.2248481 1.249566 days | -.0520502 .1620139 -0.32 0.750 -.381302 .2772016 _cons | 111.7285 47.3181 2.36 0.024 15.56652 207.8904 ------------------------------------------------------------------------------

46

Forcing Terms into the Model


The underlying theory might suggest that a set of variables should always be included in the model, e.g. demographics The option lockterm1 forces the first term given after the response into the model

sw regress so2 (manuf pop) temperat wind precip days , /* */ pe(0.1) pr(0.11) forward lockterm1
begin with term 1 model p = 0.0913 < 0.1000 adding days

Source | SS df MS Number of obs = 41 -------------+-----------------------------F( 3, 37) = 19.90 Model | 13606.2352 3 4535.41173 Prob > F = 0.0000 Residual | 8431.66725 37 227.882899 R-squared = 0.6174 -------------+-----------------------------Adj R-squared = 0.5864 Total | 22037.9024 40 550.947561 Root MSE = 15.096 -----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------manuf | .074334 .0150661 4.93 0.000 .0438071 .1048609 pop | -.0493944 .0145442 -3.40 0.002 -.0788637 -.019925 days | .1643594 .0948015 1.73 0.091 -.0277267 .3564455 _cons | 6.965849 11.77691 0.59 0.558 -16.89643 30.82813 ------------------------------------------------------------------------------

23