Correlation

STATISTICAL ANALYSIS USING STATA
Week 3: Correlation and Regression
Ulrike Nauman Dept. of Biostatistics
Relationships Between Continuous Variables
It is often of interest to assess the relationship between two continuous variables, e.g. between weight change and age A simple and very informative way for doing this is to produce a scatter plot using the scatter command
Weight change
-5
10
Example data: female psychiatric patients (R:\applications\cours es\stata courses\2011 course\female.dta)
scatter weight age
30
35 Age in years
40
45
Scatter Plots
Often want to do more than a simple scatter plot

Indicate sub groups Insert a straight (regression) line Insert a smooth line
The twoway command (short for graph twoway) is useful in this respect The command allows to
Overlay several graphs (e.g. scatter and line graphs) in the same plotting area Indicate sub-groups by using different symbols or line styles Fit lines separately for sub groups Has many options for labelling axes, titles, legends
Command twoway
Produce a scatter graph of weight against age which includes a smooth line that describes the relationship
twoway (scatter weight age) (lowess weight age), /* */ ytitle(Weight change) xtitle(Age (years))/* */ legend(order(1 "Observed" 2 "Lowess fit"))
10 -5 Weight change 0 5
30
35 Age (years) Observed
40 Lowess fit
45
Command twoway ctd.

6
Produce a scatter graph of weight against age for those that are not at suicide risk and include a regression line
-2 30
Weight change 2 4
35 Age (years) Observed - no suicidal thoughts
40
45
Regression fit - no suicidal thoughts
twoway (scatter weight age if life==1,) /* */ (lfit weight age if life==1) , /* */ ytitle(Weight change) xtitle(Age (years)) legend(order(1 */ "Observed - no suicidal thoughts" 2 "Regression fit - no */ suicidal thoughts" ))
/* /*
Scatter Plot Matrix

A scatter plot matrix can be used to look at the bivariate relationships between several continuous variables Use the graph matrix command, e.g.
graph matrix age iq weight

80 90 100 110 50
age
40
30 110 100
iq
90 80
10
weight change over last 6m (lb)
5 0 -5
30
40
50
-5
10
Concept of Correlation
Instead of assessing the relationship between ordered and/or continuous variables visually, indices can be constructed that measure the degree of directional association (=correlation) Correlation coefficients range from -1 to 1 where

-1 = complete negative relationship 0 = no correlation 1 = complete positive relationship
A number of different concepts have been employed to define correlation common ones are

Pearson correlation Spearman correlation Others, e.g. Kendalls Tau correlation coefficient (not here)
Pearson Correlation
The Pearson correlation coefficient r measures the degree of linear relationship between two variables Frequently employed due to its links with regression analysis The Stata command corr supplies a matrix of observed Pearson correlations, e.g.
corr age iq weight (obs=100) | age iq weight -------------+--------------------------age | 1.0000 iq | -0.4363 1.0000 weight | 0.2856 -0.2597 1.0000
Note only cases with complete observations on all variables are used. Here 100 subjects had complete records.
Pearson Correlation ctd.
The pwcorr command calculates pairwise correlation coefficients using all the available information The command also supplies a test of zero correlation (based on normality) if requested, e.g.
pwcorr age iq weight, obs sig

| age iq weight -------------+--------------------------age | 1.0000 | | 118 | iq | -0.4345 1.0000 | 0.0000 | 110 110 | weight | 0.3010 -0.2597 1.0000 | 0.0016 0.0091 | 107 100 107
10
ci2 and cii2 (immediate version) are user contributed commands that extend Statas ci and cii commands to include correlations

written by Paul T. Seed Install from STATA website (sg159) Background information provided in STB-59
Commands produce a confidence interval for a single correlation based on Fishers r-to-z transformation CI construction assumes bivariate normality
11
Example: construct a CI for the correlation between IQ and weight change
ci2 iq weight, corr

Confidence interval for Pearson's product-moment correlation of iq and weight, based on Fisher's transformation. Correlation = -0.260 on 100 observations (95% CI: -0.434 to -0.067)
The immediate version requires only the sample size n and the observed Pearson correlation coefficient r, e.g. for r=-0.26 based on n=100 observations
cii2 100 -0.26, corr

Confidence interval for correlation, based on Fisher's transformation. Correlation = -0.260 on 100 observations (95% CI: -0.434 to -0.067)
12
Partial Correlation
Often want to measure the strength of relationship between two variables after controlling for a third (or more) Interested in partial correlation the Pearson correlation between two variables expected if the level of the third was held constant. For example,

IQ negatively related to weight change (r=-0.26), while weight change is positively related to age and age negatively related to IQ. What would be the correlation between IQ and weight change if age was held constant?
13
Partial Correlation ctd.
Use the pcorr command to calculate the partial correlation between IQ and weight change after controlling for age
pcorr iq weight age (obs=100) Partial correlation of iq with Variable | Corr. Sig. -------------+-----------------weight | -0.1567 0.121 age | -0.3913 0.000
14
Spearman Correlation
The Spearman correlation coefficient is defined as the Pearson correlation based on the ranks of the two variables Since it only needs ranks can be used for ordinal outcomes
Ties are usually given average ranks
It measures the degree of any monotonic relationship between two variables A significance test for the Spearman correlation coefficient can be derived without making distributional assumptions
-
In that sense the coefficient is nonparametric The test assumes that there are no ties
15
Spearman Correlation ctd.
The spearman command provides the observed coefficient and a significance test
spearman iq weight Number of obs = Spearman's rho = 100 -0.2294
Test of Ho: iq and weight are independent Prob > |t| = 0.0217
16
Relationships Between Continuous Variables
The (very small) data set growth.dta contains childrens

ages in years weighs in pounds heights in inches

age
50
60
70
80 12 10 8 6
graph matrix age weight height

80 70
weight
60 50 60
height
50
40 6 8 10 12 40 50 60
17
Simple Linear Regression
We might be interested in how weight changes when children get older A Pearson correlation could be used to measure the strength of the linear relationship between weight and age Simple linear regression estimates the nature of a linear relationship between a dependent variable y (response, outcome) and an independent variable x (explanatory variable, predictor) To do this it employs the model of a linear relationship between y and x The observed data y is assumed to arise as the sum of a linear regression line (=linear predictor) and an error term
18
Simple Linear Regression Model
In simple linear regression, the dependent variable y is modelled as
y = 0 + 1 x +
where

0 is the intercept or constant: the value of y when x is 0 1 is the gradient or slope: the increase in y when x increases by one unit is the error: 0+ 1 x is the equation for the predicted values of y, which lie on a straight line, and is the difference between the
observed and predicted values of y
The parameters 0 and 1 are referred to as regression coefficients
19
Displaying a Fitted Regression Line
Display the regression line from fitting a regression of weight on age

/*
twoway (scatter weight age) (lfit weight age) , /* */ ytitle(Weight in pounds) xtitle(Age (years)) legend(order(1 "Observed" */ 2 "Regression fit"))
50 6
Weight in pounds 60 70
80
10 Regression fit
12
20
Simple Linear Regression ctd.
The regress command provides estimates of the regression coefficients

The structure is
regress depvar indvar
I.e. the first variable given after the command is assumed to be the dependent variable
Regression theory also provides confidence intervals for the regression coefficients and tests of zero coefficients Typically only the significance of the slope coefficient is of interest since this amounts to testing the existence of a relationship between the response and explanatory variables
10
21

regress weight age
Source | SS df MS -------------+-----------------------------Model | 526.392857 1 526.392857 Residual | 361.857143 10 36.1857143 -------------+-----------------------------Total | 888.25 11 80.75 Number of obs F( 1, 10) Prob > F R-squared Adj R-squared Root MSE = = = = = = Degrees of freedom used in t-tests below 12 14.55 0.0034 0.5926 0.5519 6.0155
-----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------age | 3.642857 .9551151 3.81 0.003 1.514728 5.770986 _cons | 30.57143 8.613705 3.55 0.005 11.3789 49.76396 ------------------------------------------------------------------------------
Estimated Regression coefficient
Significance tests (t-tests)
Limits of confidence intervals
Weight increased significantly with age (t(10)=3.81, p=0.003). The increase was estimated to be 3.64 pounds per extra year of age (95% CI from 1.51 to 5.77 pounds).
22

Also provided in the regression output are
A measure of how well the regression model fits the observed data.
R-squared measures the proportion of the variance in y (say weight) that is explained by predicting y from x (say weight from age). Here 59.26% of the variance in weight are accounted for by age. An adjusted version of this coefficient adj. R-squared provides a better estimate of the population value We estimate that 55.19% of the variance in weight can be explained by age. The model showed significant fit (F(1,10)=14.55, p=0.0034).
An F-test of zero fit (null hypothesis: R-squared=0)
11
23
Generating Predictions
Once a regression model has been fitted in Stata so-called post estimation commands can be used to elicit further information Often one might want to use the fitted regression line to predict the expected value of y for a given x; e.g. the expected weight at age=10 years The predict command provides predicted values and standard errors The command always refers to the last regression model fitted The format is
predict newname, what
24
Command predict
E.g. to get predicted expected weights for the ages observed in our sample type
predict pred
This generates a new variable called pred in the data file that contains the predicted weights To get standard errors of these predictions type
predict predse, stdp
This generates another variable called predse that holds the respective standard errors
12
25
Command predict ctd.
An approximate 95% CI for the expected value is then given by the estimated value +/- twice its standard error We can use the generate command to construct lower and upper limits of 95% CIs
gen lower=pred-2*predse gen upper=pred+2*predse
26
Prediction Outside the Observed Range
The lincom command provides tests of significance and estimates of linear combinations of the regression coefficients The regression coefficient of a variable is referred to by the name of the variable
E.g. lincom X1*0.5+X2*0.5 tests and estimates the arithmetic average of the regression coefficients of variables X1 and X2
We can use this to predict the value of the dependent variable (say weight) for given values (say 10 years) of the independent variable(s) e.g. lincom _cons+age*10
( 1) 10 age + _cons = 0 -----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------(1) | 67 2.063284 32.47 0.000 62.40272 71.59728 ------------------------------------------------------------------------------
13
27
Regression Assumptions
Regression inferences assume that the observations are independent and that the errors have
a normal distribution constant variance 3) zero mean (linear relationship) These assumptions should be assessed There are a number of diagnostics that use the residuals (differences between observed and predicted values) 1) Histogram, QQ-plot, box plot of residuals 2) Residualplot 3) Partial residualplot (see later)
1) 2)
28
Distributional Shape of Residuals
The predict command also provides residuals

option resid for unstandardised residuals option rstandard for residuals standardised to have SD=1
predict stresid, rstandard graph box stresid

2 -2 -1 Standardized residuals 0 1
14
29
Residual plot
A residual plot is a scatter plot of the residuals against the fitted values (=predicted values for the sample, here predicted weights) It is commonly used to assess the homogeneity assumption
If the variance of the errors is constant the spread of the residuals around the zero reference line is expected not to change with the size of the fitted values
The command rvfplot provides it for the latest regression

However, the command uses the unstandardized residuals perhaps more appropriate to plot the standardised residuals against the fitted values
30
Residualplot ctd.
rvfplot
10
twoway (scatter stresid pred)

2 -2 50 -1 Standardized residuals 0 1
-10
-5
Residuals 0
50
55
60
65 Fitted values
70
75
55
60
65 Fitted values
70
75
15
31
Assessing the Linear Relationship
In a simple linear regression model the relationship between the dependent variable y and the single independent variable x can be assessed by means of a scatter plot Could include a smooth line that follows the data, e.g.
/*
twoway (scatter weight age) (lowess weight age) , /* */ ytitle(Weight in pounds) xtitle(Age (years)) legend(order(1 "Observed" */ 2 Lowess smooth"))
80 50 6 Weight in pounds 60 70
10 Lowess smooth
12
32
Multiple Linear Regression
We might be interested in weight changes with age that are not simply due to growing at the same time Multiple linear regression estimates the linear relationships between a dependent variable y and several continuous independent variables x1, x2, xp The model is simply extended to include several linear effects
y = 0 + 1 x1 + ... + p x p +
This affects the interpretation of the regression coefficients
E.g. 1 is the increase in y when x1 increases by one unit when x2,, xp remain constant Because of this the regression coefficients are sometimes referred to as partial regression coefficients They measure the relationship between the response and an explanatory variable after adjusting for the remaining explanatory variables
16
33
Multiple Linear Regression ctd.
The regress command allows for several continuous explanatory variables.
regress weight age height

Source | SS df MS -------------+-----------------------------Model | 692.822607 2 346.411303 Residual | 195.427393 9 21.7141548 -------------+-----------------------------Total | 888.25 11 80.75 Number of obs F( 2, 9) Prob > F R-squared Adj R-squared Root MSE = = = = = = 12 15.95 0.0011 0.7800 0.7311 4.6598
-----------------------------------------------------------------------------weight | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------age | 2.050126 .9372256 2.19 0.056 .4332373 height | .722038 .2608051 2.77 0.022 .5483191 _cons | 6.553048 10.94483 0.60 0.564 . ------------------------------------------------------------------------------
The effect of age on weight was not statistically significant after adjusting for height. Within a group of children of the same height, weight was estimated to increase by 2.05 pounds per year (95% CI from -0.07 to 4.17 pounds).
34
Testing the Effect of a Set of Variables

The null hypothesis that a set of explanatory variables has no effect on the response can be tested using the post-estimation command testparm
E.g. test that height and age together have no effect on weight
testparm age height
( 1) ( 2) age = 0 height = 0 F( 2, 9) = 15.95 Prob > F = 0.0011
17
35
Partial Residualplots
Regression assumptions 1) and 2) can be checked as described before However, linearity assumption 3) can no longer be assessed by simply plotting the response against the respective explanatory variable The model assumes that the part of the explanatory variable that does not vary with other explanatory variables has a linear effect on the response Partial residualplot
plots the residuals from regressing y on all the explanatory variables except the one in question against the residual from a regression of the explanatory variable of interest on the other explanatory variables
36
Partial Residualplot ctd.

avplot age
15
avplot height
10 -10 -15 -5 e( weight | X ) 0 5
-5
e( weight | X ) 5
10
-2
-1
0 e( age | X )
-10
-5 e( height | X )
coef = 2.0501264, se = .93722561, t = 2.19
coef = .72203796, se = .26080506, t = 2.77
The linearity assumptions seemed reasonable for both explanatory variables.
18
37
Automatic Selection Procedures
When there are a large set of potential predictor variables investigators often would like to empirically select a subset of important variables Example data on air pollution in 41 US cities USair.dta so2 sulphur dioxide content of air in micrograms per cubic metre o temperat average annual temperature in F manuf number of manufacturing enterprises employing 20 or more workers pop population size (1970 census) in thousands wind - average wind speed in miles per hour precip average annual precipitation in inches days average number of days with precipitation per year Which climate and human ecology variables predict air pollution?
38
Automatic Selection Procedures ctd.
What is an important variable?
Approach used: one that has a significant effect on the outcome at some test level
The problem is that the significance of any independent variable depends on the other variables included in the model equation As a result several selection procedures have been suggested which vary in the set of variables for which the test is adjusted Therefore different selection procedures can lead to different variable subsets being chosen!
19
39
Automatic Selection Procedures ctd.
Forward selection
Starting with just the constant, the model is updated as follows: include the variable which has the smallest p-value when adding to the previous model. Stop if the next variable to be added is not significant according to the chosen significance level. Starting with all variables in the model, keep updating the model as follows: exclude the variable which has the largest p-value for a significance test. Stop if the next variable to be excluded is significant. Like forward, but exclude variables previously selected if these become non-significant
Backward selection
Stepwise selection
40
Forward Variable Selection

Statas sw command can be placed before the regress command to run through a series of regressions The option pe(alpha) specifies that the inclusion level in forward regression is alpha
sw regress so2 temperat manuf pop wind precip days, pe(0.10)

begin with p = 0.0000 p = 0.0003 p = 0.0913 empty model < 0.1000 adding < 0.1000 adding < 0.1000 adding manuf pop days
Source | SS df MS Number of obs = 41 -------------+-----------------------------F( 3, 37) = 19.90 Model | 13606.2352 3 4535.41173 Prob > F = 0.0000 Residual | 8431.66725 37 227.882899 R-squared = 0.6174 -------------+-----------------------------Adj R-squared = 0.5864 Total | 22037.9024 40 550.947561 Root MSE = 15.096 -----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------manuf | .074334 .0150661 4.93 0.000 .0438071 .1048609 pop | -.0493944 .0145442 -3.40 0.002 -.0788637 -.019925 days | .1643594 .0948015 1.73 0.091 -.0277267 .3564455 _cons | 6.965849 11.77691 0.59 0.558 -16.89643 30.82813 ------------------------------------------------------------------------------
20
41
Backward Variable Selection
The option pr(alpha) specifies that the exclusion level in backward regression is alpha
sw regress so2 temperat manuf pop wind precip days, pr(0.10)

begin with full model p = 0.7500 >= 0.1000 removing days Source | SS df MS -------------+-----------------------------Model | 14732.5258 5 2946.50517 Residual | 7305.37661 35 208.725046 -------------+-----------------------------Total | 22037.9024 40 550.947561 Number of obs F( 5, 35) Prob > F R-squared Adj R-squared Root MSE = = = = = = 41 14.12 0.0000 0.6685 0.6212 14.447
-----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------temperat | -1.121288 .4158633 -2.70 0.011 -1.965535 -.2770402 manuf | .0648871 .0155449 4.17 0.000 .0333293 .096445 pop | -.0393347 .0149366 -2.63 0.012 -.0696575 -.0090119 wind | -3.082399 1.765623 -1.75 0.090 -6.666805 .5020065 precip | .4194681 .2162447 1.94 0.060 -.0195319 .8584681 _cons | 100.1524 30.27521 3.31 0.002 38.69051 161.6144 ------------------------------------------------------------------------------
42
Stepwise Forward Variable Selection
Specifying both pr(alpha) and pe(alpha) requests a stepwise regression; the default is backwards The inclusion level has to be smaller than the exclusion level
sw regress so2 temperat manuf pop wind precip days, pe(0.1) /* */ pr(0.11) forward
begin with p = 0.0000 p = 0.0003 p = 0.0913 empty model < 0.1000 adding < 0.1000 adding < 0.1000 adding manuf pop days
21
43
Drawbacks of Selection Procedures
Analysis is exploratory not confirmatory - i.e. not driven by existing theory or prior hypotheses. Therefore results are more likely to be spurious. (After all there might be further explanatory variables out there which have not been measured ) Important variables may not be selected because inclusion of less important variables has made them redundant. Multiple testing: A whole list of variables is tested at each stage of the procedure, possibly giving many false positives.
44
Drawbacks ctd.
Any cases with missing values on any of the candidate variables entered for selection will not be used during the selection process
run the model again on the selected variables to maximize data use
Backward selection is only possible when the number of possible explanatory variables is not too large compared to the sample size. For further information and references see http://www.stata.com/support/faqs/stat/stepwise.html
22
45
Blocking

Underlying theory might suggest that a set of variables should be kept together at all times Blocks of explanatory variable are indicated by brackets
sw regress so2 (temperat wind precip days) (manuf pop), /* */ pe(0.1) pr(0.11) forward
manuf pop temperat wind precip days
begin with empty model p = 0.0000 < 0.1000 adding p = 0.0972 < 0.1000 adding
Source | SS df MS Number of obs = 41 -------------+-----------------------------F( 6, 34) = 11.48 Model | 14754.6358 6 2459.10596 Prob > F = 0.0000 Residual | 7283.26667 34 214.213726 R-squared = 0.6695 -------------+-----------------------------Adj R-squared = 0.6112 Total | 22037.9024 40 550.947561 Root MSE = 14.636 -----------------------------------------------------------------------------so2 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------manuf | .0649182 .0157483 4.12 0.000 .0329139 .0969225 pop | -.0392767 .0151327 -2.60 0.014 -.0700302 -.0085233 temperat | -1.267941 .6211795 -2.04 0.049 -2.53033 -.0055524 wind | -3.181366 1.815019 -1.75 0.089 -6.869928 .5071974 precip | .5123589 .3627551 1.41 0.167 -.2248481 1.249566 days | -.0520502 .1620139 -0.32 0.750 -.381302 .2772016 _cons | 111.7285 47.3181 2.36 0.024 15.56652 207.8904 ------------------------------------------------------------------------------
46
Forcing Terms into the Model

The underlying theory might suggest that a set of variables should always be included in the model, e.g. demographics The option lockterm1 forces the first term given after the response into the model
sw regress so2 (manuf pop) temperat wind precip days , /* */ pe(0.1) pr(0.11) forward lockterm1
begin with term 1 model p = 0.0913 < 0.1000 adding days
23

Correlation

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Correlation

Încărcat de

Drepturi de autor:

Formate disponibile

STATISTICAL ANALYSIS USING STATA

Week 3: Correlation and Regression

Ulrike Nauman Dept. of Biostatistics

Relationships Between Continuous Variables

Example data: female psychiatric patients (R:\applications\cours es\stata courses\2011 course\female.dta)

scatter weight age

Often want to do more than a simple scatter plot

35 Age (years) Observed

Command twoway ctd.

35 Age (years) Observed - no suicidal thoughts

Regression fit - no suicidal thoughts

Scatter Plot Matrix

graph matrix age iq weight

weight change over last 6m (lb)

-1 = complete negative relationship 0 = no correlation 1 = complete positive relationship

Pearson Correlation ctd.

pwcorr age iq weight, obs sig

Pearson Correlation ctd.

Pearson Correlation ctd.

Example: construct a CI for the correlation between IQ and weight change

ci2 iq weight, corr

cii2 100 -0.26, corr

Partial Correlation ctd.

Ties are usually given average ranks

Spearman Correlation ctd.

spearman iq weight Number of obs = Spearman's rho = 100 -0.2294

Relationships Between Continuous Variables

The (very small) data set growth.dta contains childrens

ages in years weighs in pounds heights in inches

graph matrix age weight height

Simple Linear Regression

Simple Linear Regression Model

In simple linear regression, the dependent variable y is modelled as

The parameters 0 and 1 are referred to as regression coefficients

Displaying a Fitted Regression Line

Display the regression line from fitting a regression of weight on age

8 Age (years) Observed

Simple Linear Regression ctd.

The regress command provides estimates of the regression coefficients

regress depvar indvar

Simple Linear Regression ctd.

Estimated Regression coefficient

Significance tests (t-tests)

Limits of confidence intervals

Simple Linear Regression ctd.

An F-test of zero fit (null hypothesis: R-squared=0)

Command predict ctd.

gen lower=pred-2*predse gen upper=pred+2*predse

Prediction Outside the Observed Range

Distributional Shape of Residuals

The predict command also provides residuals

predict stresid, rstandard graph box stresid

The command rvfplot provides it for the latest regression

twoway (scatter stresid pred)

Assessing the Linear Relationship

8 Age (years) Observed

Multiple Linear Regression

This affects the interpretation of the regression coefficients

Multiple Linear Regression ctd.

The regress command allows for several continuous explanatory variables.

regress weight age height

Testing the Effect of a Set of Variables

Partial Residualplot ctd.

coef = 2.0501264, se = .93722561, t = 2.19

gen lower=pred-2predse gen upper=pred+2predse