Sunteți pe pagina 1din 45

STATA Training Session 2

Statistical Analysis in STATA

Sun Li
Centre for Academic Computing
lsun@smu.edu.sg
Outline
Resources And Books
Data Description And Simple Inference
Group Comparison And Correlation
General Linear Regression
Logistic Model
Binary Logistic Model
Ordinal Logistic Model
Multinomial Logistic Model
Resources And Books
CAC Computing Resources for STATA users
Windows:
STATA/SE version 10.0
10-user network perpetual license
Installation guide
(http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA-
Software Questions.aspx)
Linux CAC Beowulf Cluster:
STATA/SE version 10.0
Unlimited users
About CAC Beowulf Cluster:
(http://research2.smu.edu.sg/CAC/HPC/Wiki/MAIN.aspx)
New features in STATA 10.0 (http://www.stata.com/stata10)
Resources And Books
Website resources:
The STATA website: http://www.stata.com
The STATA journal reviewed papers, regular columns, user-written
software: http://www.stata-journal.com/
STATA FAQ : http://www.stata.com/support/faqs
STATA User Support : http://www.stata.com/support
Books: http://www.stata.com/bookstore/
CAC STATA support:
Website:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA.aspx
Contact:
For statistical consultation: Sun Li: lsun@smu.edu.sg
For software installation: TAN SuhWen: swtan@smu.edu.sg
Resources And Books
Additional recommended readings:

Regression Models for Categorical Dependent Variables Using


Stata, 2nd Edition, J. Scott Long and Jeremy Freese

Logistic Regression with Stata, Xiao Chen, Phil Ender, Michael


Mitchell & Christine Wells, UCLA

Statistics with Stata (Updated for Version 9), Lawrence C.


Hamilton

Data Analysis Using Stata, Ulrich Kohler and Frauke Kreuter


Download Training Slides , data and Syntax:

http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/T
raining%20Slides%20and%20Syntax.aspx
Data Description & Simple Inference
Description of Data
Name: ibmff.dta
Variables:
Variable name Variable information
permno CRSP Permanent Number
date Numeric date
ret Holding Period Return
retx Return without dividends
mktrf Excess return on markert
smb Small-minus-big return
hml High-minus-low return
rf Risk-free return rate
umd Momentum factor
Data Description & Simple Inference
Data Description & Simple Inference
Convert to STATA date format
gen year=int(date/10000)
gen month=int((date-year*10000)/100)
gen day=date-year*10000-month*100
gen newdate=mdy(month, day, year)
format newdate %td
list date newdate year month day in 1
Data Description & Simple Inference
Distribution of Variables

1.00
pnorm ret

0.75
swilk ret mktrf smb hml rf

Normal F[(ret-m)/s]

0.50
pnorm: Standarized normal probability plot

0.25
swilk: Shapiro-Wilk normality test with null
hypothesis that data is normal.

0.00
0.00 0.25 0.50 0.75 1.00
Empirical P[i] = i/(N+1)
Data Description & Simple Inference
It appears all variables listed are not normally distributed. As variable ret is the variable
of interest, we adjust its skewness with zero-skewness log function lnskew0, then run
swilk to test it again.

lnskew0 lnret=ret
swilk lnret ret

For lnret, p-value > 0.05 (by default the significant level is 95%), we do not
reject the supposition that data is normally distributed.
Group Comparison & Correlation
Question: To test whether average holding period return for year 1998 has significant
difference from average of return of all the other years.

Generate dummy variables

tab year, gen(dumyear)


Group Comparison & Correlation
tabstat lnret, stat(n mean sd p25 p50 p75) by(dumyear1)
graph box lnret, by(dumyear1) box(1, bfcolor(blue))

0 1
.2
0
ln(ret+.7527907)

-.2
-.4
-.6

Graphs by year== 1998.0000


Group Comparison & Correlation
sdtest lnret, by(dumyear1)
ttest lnret, by(dumyear1)

sdtest tests the


equality of variances

ttest performs one-


sample T-test and
independent-samples
T-test
Group Comparison & Correlation
Question: To test whether average holding period return for all these years are
significantly different with each other. If yes, then find out which groups have the
differences.

oneway lnret year


oneway lnret year, tabulate bonferroni

tabulate lists average returns for all the years.


bonferroni performs multiple comparison btw groups with adjusted p-values.
Group Comparison & Correlation
To detect correlations btw profit returns and other factors.

graph matrix ret mktrf smb hml rf, half


spearman ret mktrf smb hml rf, stats(rho p) print(.05) bonferroni

ret

.1
excess
0 return on
-.1 the
market
-.2
.2

small-minus-big
0
return
-.2
.1
high-minus-low
0
return
-.1
.006
risk-free
.004 return rate
(one month
.002 treasury bill
0 rate)
-.2 0 .2 .4-.2 -.1 0 .1-.2 0 .2 -.1 0 .1
Group Comparison & Correlation
Exercise 1
1. Tabulate the average risk-free return rate by different years

2. Use help to search command ranksum: Mann-Whitney U-test.

3. To test if average risk-free return rate in year 2005 is significant different from
2006 using Mann-Whitney U-test.
(hint: generate dummy variable first)

4. Use help to search command correlate: Pearsons correlation.

5. To identify correlation btw the factors of interest in year 2006 using listwise and
pairwise Pearsons correlation respectively.
General Linear Regression
General Form of Model

Y X
Y is the n 1 vector of responses.
X is the n ( p 1) matrix of explanatory variables.
1
(XX) X Y least square estimates of the regression coefficien ts

Data: ibmff.dta
General Linear Regression
Step 1: Examine data
graph matrix ret mktrf smb hml rf umd
-.2 -.1 0 .1 -.1 0 .1 -.1 0 .1 .2
.2

ret 0

-.2
.1

0 excess
return on
-.1
the
market
-.2
.2

small-minus-big 0
return

-.2
.1

high-minus-low
0 return

-.1
.006
risk-free
return rate .004
(one month
treasury bill .002
rate)
0
.2

.1
momentum
0 factor

-.1
-.2 0 .2 -.2 0 .2 0 .002 .004 .006
General Linear Regression
Step 2: Perform Linear Regression
regress ret mktrf smb hml rf umd year

regress: to perform linear regression


General Linear Regression
sw regress ret mktrf smb hml rf umd year, pe(0.05)

sw: to perform stepwise regression


pe(0.05): to specify the significant level of the F-test for addition to the model; items
with a p-value less than 0.05 will be included.
General Linear Regression
Step 3: Post-estimation Statistics
vif //variance inflation factor
rvfplot //plot residuals against predicted values
predict fit //store fitted values
predict sdres, rstandard //store standard residuals
pnorm sdres //normal probability plot of residuals
twoway scatter sdres fit //plot residuals against predicted values
predict cook, cooksd //store Cooks distance statistics
list year ret cook if cook>4/108 // lists details of those observations for which the
statistic is above the suggested cut-off point (4/n).
General Linear Regression

1.00
0.75
Normal F[(sdres-m)/s]

0.50
0.25
0.00
.2

0.00 0.25 0.50 0.75 1.00


Empirical P[i] = i/(N+1)
.1
Residuals

0
-.1
-.2

-.2 -.1 0 .1 .2
Fitted values
General Linear Regression
Exercise 2
1. Repeat the analysis described in this section after removing the listed possible
outliers identified by Cooks.

2. After finishing Q1, repeat the analysis but treat the variable year as the
categorical.
hint: use command
xi: sw regress ret mktrf smb hml rf umd i.year, pe(0.05)
Logistic Model
Binary logistic model: dichotomous response outcomes
e,.g.: presence or absence of an event

Ordinal logistic model: ordinal response variable with more than two
ordered categories
e,.g.: a 5-point Likert scale

Multinomial logistic model: nominal response variables with more


than two categories
e,.g.: different types of programs in school
Binary Logistic Regression
General Form of Model
i E ( y i | xi )
logit ( i ) log( i /(1 i )) 0 1 x1i 2 x 2i ... p x pi

i exp( ' xi ) /(1 exp( ' xi ))

exp( k ) is the Odds Ratio that y 1 when xk increases by one unit and all other
covariates remain the same.

Binary responses Y are typically coded as 1 for the event of interest, and 0 for the
opposite event.
Binary Logistic Regression
Description of Data
How to identify a person with high chance of getting defaults on the bank loan. We have
700 records from bank database (bankloan.csv) .
Variable name Variable information
age Age in years
ed Level of education
1= didnt complete high school 2= high school degree
3= college degree 4= undergraduate 5= postgraduate
employ Years with current employer
address Years in current address
income Household income in thousands
debtinc Debt to income ratio (*100)
creddebt Credit card debt in thousands
othdebt Other debts in thousands
default Previously defaulted (1=Yes; 0=No)
Binary Logistic Regression
Step 1: Import and examine data
insheet using bankloan.csv
d
browse
codebook default
Binary Logistic Regression
tabstat age employ address income debtinc creddebt othdebt, by(default)
table ed, c(mean income mean age mean debtinc mean creddebt mean othdebt) by(default)
Binary Logistic Regression
Step 2: Construct logistic model
logistic default age ed employ income address
estimates store model1
logistic default age ed employ income address debtinc creddebt othdebt
lrtest model1 .
sw logit default age address employ income debtinc creddebt othdebt, pe(0.05)

logistic: produces odds ratios.


logit: produces parameter coefficients.
estimates: saves the current likelihood
and all the estimates.
lrtest: produces p-value of likelihood-
ratio test.
Binary Logistic Regression
Step 3: Post-estimation statistics

.8
predict prob
predict resi, rstandard

.6
hist resi

Density
estat gof

.4
.2
estat gof: goodness-of-fit test
0

-5 0 5 10
standardized Pearson residual
Binary Logistic Regression
estat classification
This is calculated based on 50% as a
cut-off point for positive
predictions.

Summary of correct
predictions

Summary of incorrect
predictions

Overall success rate


Binary Logistic Regression
gen z=_b[debtinc]*debtinc+_b[employ]*employ+_b[creddebt]* creddebt+_b[address]*address
line prob z, sort
1
.8
.6
Pr(default)

.4
.2
0

-10 -5 0 5 10
z
Binary Logistic Regression
gen empcat=employ>5
logit default address empcat debtinc creddebt

postgr3 debtinc, by(empcat) //you need to install postgr3 package

postgr3: graphs the predicted


values , holding all other variables
constant at specified values (default .8

is the mean).
.6
.4

Marginal impact is higher for


people with short service than
.2

for those with long service in


their current company.
0

0 10 20 30 40
debtinc

yhat_, empcat == 0 yhat_, empcat == 1


Binary Logistic Regression
Exercise 3
1. Explore the use of commands lroc and lsens to diagnostic data and interpret
results.
lroc: graphs the ROC curve and calculates the area under the curve.
lsens: graphs sensitivity and specificity versus probability cutoff.

2. Predict the probability of default on bank loan for a person with


debt/income ratio of 22.7, 2 years with current employer, 16 years living in
current place, and 1.21 thousand credit card debt.
Ordinal Logistic Model
General Form of Model
p1
Logit ( p1 ) log 10 ' x
1 p1
p1 p 2
Logit ( p1 p 2 ) log 20 ' x
1 ( p1 p 2 )
. .
p1 p 2 ... p k
Logit ( p1 p 2 ... p k ) log k 0 ' x
1 ( p1 p 2 ... p k )
and p1 p 2 ... p k pk 1 1

exp( k ) represents Odds Ratio that y as for any s when xk increases by one unit and all
other covariates remain the same.
Ordered responses with k categories can be formulated as a threshold model.
Ordinal Logistic Model
Construct model
recode income (min/20=1 "<20") (20/30=2 "20-29") (30/40=3 "30-39") (40/50=4
"40-49") (50/max=5 "above 50"), generate(inccat)
codebook inccat
Ordinal Logistic Model
xi: ologit inccat age i.ed employ debtinc, or
listcoef, help oligit: to perform ordered logistic
regression.
listcoef: to obtain ORs and change
of odds for a sd of the variable.
Ordinal Logistic Model
xi: omodel logit inccat age i.ed employ debtinc
brant, detail

Test parallel regression assumption


(proportional odds assumption):

omodel: to perform likelihood ratio


test.
brant: to do Brant test.
Ordinal Logistic Model
prtab employ //predicted probabilities for each of the values of the variable specified
prvalue, x(_Ied_2=1) //predicted probabilities for selected values of variables
prvalue, x(_Ied_2=1 age=28 employ=3 debtinc=10)
Multinomial Logistic Model
xi: mlogit inccat age i.ed employ debtinc
Multinomial Logistic Model
listcoef
fitstat
prtab _Ied_2
Multinomial Logistic Model

1
predict p1 p2 p3 p4 p5

.8
summarize p1 p2 p3 p4 p5

.6
sort employ

.4
twoway connect p1 p5 employ, msym(i i)

.2
0
0 10 20 30
employ

Pr(inccat==1) Pr(inccat==5)
Logistic Model
Exercise 4
1. Try to construct probit models.
End

S-ar putea să vă placă și