STATA Training Session 2 Statistical Analysis in STATA PDF

STATA Training Session 2
Statistical Analysis in STATA
Sun Li
Centre for Academic Computing
lsun@smu.edu.sg
Outline
Resources And Books
Data Description And Simple Inference
Group Comparison And Correlation
General Linear Regression
Logistic Model
Binary Logistic Model
Ordinal Logistic Model
Multinomial Logistic Model
Resources And Books
CAC Computing Resources for STATA users
Windows:
STATA/SE version 10.0
10-user network perpetual license
Installation guide
(http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA-
Software Questions.aspx)
Linux CAC Beowulf Cluster:
STATA/SE version 10.0
Unlimited users
About CAC Beowulf Cluster:
(http://research2.smu.edu.sg/CAC/HPC/Wiki/MAIN.aspx)
New features in STATA 10.0 (http://www.stata.com/stata10)
Resources And Books
Website resources:
The STATA website: http://www.stata.com
The STATA journal reviewed papers, regular columns, user-written
software: http://www.stata-journal.com/
STATA FAQ : http://www.stata.com/support/faqs
STATA User Support : http://www.stata.com/support
Books: http://www.stata.com/bookstore/
CAC STATA support:
Website:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA.aspx
Contact:
For statistical consultation: Sun Li: lsun@smu.edu.sg
For software installation: TAN SuhWen: swtan@smu.edu.sg
Resources And Books
Additional recommended readings:
Regression Models for Categorical Dependent Variables Using

Stata, 2nd Edition, J. Scott Long and Jeremy Freese
Logistic Regression with Stata, Xiao Chen, Phil Ender, Michael

Mitchell & Christine Wells, UCLA
Statistics with Stata (Updated for Version 9), Lawrence C.

Hamilton
Data Analysis Using Stata, Ulrich Kohler and Frauke Kreuter

Download Training Slides , data and Syntax:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/T
raining%20Slides%20and%20Syntax.aspx
Data Description & Simple Inference
Description of Data
Name: ibmff.dta
Variables:
Variable name Variable information
permno CRSP Permanent Number
date Numeric date
ret Holding Period Return
retx Return without dividends
mktrf Excess return on markert
smb Small-minus-big return
hml High-minus-low return
rf Risk-free return rate
umd Momentum factor
Convert to STATA date format
gen year=int(date/10000)
gen month=int((date-year*10000)/100)
gen day=date-year*10000-month*100
gen newdate=mdy(month, day, year)
format newdate %td
list date newdate year month day in 1
Distribution of Variables
1.00
pnorm ret
0.75
swilk ret mktrf smb hml rf
Normal F[(ret-m)/s]
0.50
pnorm: Standarized normal probability plot
0.25
swilk: Shapiro-Wilk normality test with null
hypothesis that data is normal.
0.00
0.00 0.25 0.50 0.75 1.00
Empirical P[i] = i/(N+1)
It appears all variables listed are not normally distributed. As variable ret is the variable
of interest, we adjust its skewness with zero-skewness log function lnskew0, then run
swilk to test it again.
lnskew0 lnret=ret
swilk lnret ret
For lnret, p-value > 0.05 (by default the significant level is 95%), we do not
reject the supposition that data is normally distributed.
Group Comparison & Correlation
Question: To test whether average holding period return for year 1998 has significant
difference from average of return of all the other years.
Generate dummy variables
tab year, gen(dumyear)

tabstat lnret, stat(n mean sd p25 p50 p75) by(dumyear1)
graph box lnret, by(dumyear1) box(1, bfcolor(blue))
0 1
.2
0
ln(ret+.7527907)
-.2
-.4
-.6
Graphs by year== 1998.0000

sdtest lnret, by(dumyear1)
ttest lnret, by(dumyear1)
sdtest tests the

equality of variances
ttest performs one-

sample T-test and
independent-samples
T-test
Question: To test whether average holding period return for all these years are
significantly different with each other. If yes, then find out which groups have the
differences.
oneway lnret year

oneway lnret year, tabulate bonferroni
tabulate lists average returns for all the years.

bonferroni performs multiple comparison btw groups with adjusted p-values.
To detect correlations btw profit returns and other factors.
graph matrix ret mktrf smb hml rf, half

spearman ret mktrf smb hml rf, stats(rho p) print(.05) bonferroni
ret
.1
excess
0 return on
-.1 the
market
-.2
.2
small-minus-big
0
return
-.2
.1
high-minus-low
0
return
-.1
.006
risk-free
.004 return rate
(one month
.002 treasury bill
0 rate)
-.2 0 .2 .4-.2 -.1 0 .1-.2 0 .2 -.1 0 .1
Exercise 1
1. Tabulate the average risk-free return rate by different years
2. Use help to search command ranksum: Mann-Whitney U-test.
3. To test if average risk-free return rate in year 2005 is significant different from
2006 using Mann-Whitney U-test.
(hint: generate dummy variable first)
4. Use help to search command correlate: Pearsons correlation.
5. To identify correlation btw the factors of interest in year 2006 using listwise and
pairwise Pearsons correlation respectively.
General Form of Model
Y X
Y is the n 1 vector of responses.
X is the n ( p 1) matrix of explanatory variables.
1
(XX) X Y least square estimates of the regression coefficien ts
Data: ibmff.dta
Step 1: Examine data
graph matrix ret mktrf smb hml rf umd
-.2 -.1 0 .1 -.1 0 .1 -.1 0 .1 .2
.2
ret 0
-.2
.1
0 excess
return on
-.1
the
market
-.2
.2
small-minus-big 0
return
-.2
.1
high-minus-low
0 return
-.1
.006
risk-free
return rate .004
(one month
treasury bill .002
rate)
0
.2
.1
momentum
0 factor
-.1
-.2 0 .2 -.2 0 .2 0 .002 .004 .006
Step 2: Perform Linear Regression
regress ret mktrf smb hml rf umd year
regress: to perform linear regression

sw regress ret mktrf smb hml rf umd year, pe(0.05)
sw: to perform stepwise regression

pe(0.05): to specify the significant level of the F-test for addition to the model; items
with a p-value less than 0.05 will be included.
Step 3: Post-estimation Statistics
vif //variance inflation factor
rvfplot //plot residuals against predicted values
predict fit //store fitted values
predict sdres, rstandard //store standard residuals
pnorm sdres //normal probability plot of residuals
twoway scatter sdres fit //plot residuals against predicted values
predict cook, cooksd //store Cooks distance statistics
list year ret cook if cook>4/108 // lists details of those observations for which the
statistic is above the suggested cut-off point (4/n).
1.00
0.75
Normal F[(sdres-m)/s]
0.50
0.25
0.00
.2
0.00 0.25 0.50 0.75 1.00

Empirical P[i] = i/(N+1)
.1
Residuals
0
-.1
-.2
-.2 -.1 0 .1 .2
Fitted values
Exercise 2
1. Repeat the analysis described in this section after removing the listed possible
outliers identified by Cooks.
2. After finishing Q1, repeat the analysis but treat the variable year as the
categorical.
hint: use command
xi: sw regress ret mktrf smb hml rf umd i.year, pe(0.05)
Logistic Model
Binary logistic model: dichotomous response outcomes
e,.g.: presence or absence of an event
Ordinal logistic model: ordinal response variable with more than two
ordered categories
e,.g.: a 5-point Likert scale
Multinomial logistic model: nominal response variables with more

than two categories
e,.g.: different types of programs in school
Binary Logistic Regression
i E ( y i | xi )
logit ( i ) log( i /(1 i )) 0 1 x1i 2 x 2i ... p x pi
i exp( ' xi ) /(1 exp( ' xi ))
exp( k ) is the Odds Ratio that y 1 when xk increases by one unit and all other
covariates remain the same.
Binary responses Y are typically coded as 1 for the event of interest, and 0 for the
opposite event.
Description of Data
How to identify a person with high chance of getting defaults on the bank loan. We have
700 records from bank database (bankloan.csv) .
Variable name Variable information
age Age in years
ed Level of education
1= didnt complete high school 2= high school degree
3= college degree 4= undergraduate 5= postgraduate
employ Years with current employer
address Years in current address
income Household income in thousands
debtinc Debt to income ratio (*100)
creddebt Credit card debt in thousands
othdebt Other debts in thousands
default Previously defaulted (1=Yes; 0=No)
Step 1: Import and examine data
insheet using bankloan.csv
d
browse
codebook default
tabstat age employ address income debtinc creddebt othdebt, by(default)
table ed, c(mean income mean age mean debtinc mean creddebt mean othdebt) by(default)
Step 2: Construct logistic model
logistic default age ed employ income address
estimates store model1
logistic default age ed employ income address debtinc creddebt othdebt
lrtest model1 .
sw logit default age address employ income debtinc creddebt othdebt, pe(0.05)
logistic: produces odds ratios.

logit: produces parameter coefficients.
estimates: saves the current likelihood
and all the estimates.
lrtest: produces p-value of likelihood-
ratio test.
Step 3: Post-estimation statistics
.8
predict prob
predict resi, rstandard
.6
hist resi
Density
estat gof
.4
.2
estat gof: goodness-of-fit test
0
-5 0 5 10
standardized Pearson residual
estat classification
This is calculated based on 50% as a
cut-off point for positive
predictions.
Summary of correct
predictions
Summary of incorrect
predictions
Overall success rate

gen z=_b[debtinc]*debtinc+_b[employ]*employ+_b[creddebt]* creddebt+_b[address]*address
line prob z, sort
1
.8
.6
Pr(default)
.4
.2
0
-10 -5 0 5 10
z
gen empcat=employ>5
logit default address empcat debtinc creddebt
postgr3 debtinc, by(empcat) //you need to install postgr3 package
postgr3: graphs the predicted

values , holding all other variables
constant at specified values (default .8
is the mean).
.6
.4
Marginal impact is higher for

people with short service than
.2
for those with long service in

their current company.
0
0 10 20 30 40
debtinc
yhat_, empcat == 0 yhat_, empcat == 1

Exercise 3
1. Explore the use of commands lroc and lsens to diagnostic data and interpret
results.
lroc: graphs the ROC curve and calculates the area under the curve.
lsens: graphs sensitivity and specificity versus probability cutoff.
2. Predict the probability of default on bank loan for a person with

debt/income ratio of 22.7, 2 years with current employer, 16 years living in
current place, and 1.21 thousand credit card debt.
p1
Logit ( p1 ) log 10 ' x
1 p1
p1 p 2
Logit ( p1 p 2 ) log 20 ' x
1 ( p1 p 2 )
. .
p1 p 2 ... p k
Logit ( p1 p 2 ... p k ) log k 0 ' x
1 ( p1 p 2 ... p k )
and p1 p 2 ... p k pk 1 1
exp( k ) represents Odds Ratio that y as for any s when xk increases by one unit and all
other covariates remain the same.
Ordered responses with k categories can be formulated as a threshold model.
Construct model
recode income (min/20=1 "<20") (20/30=2 "20-29") (30/40=3 "30-39") (40/50=4
"40-49") (50/max=5 "above 50"), generate(inccat)
codebook inccat
xi: ologit inccat age i.ed employ debtinc, or
listcoef, help oligit: to perform ordered logistic
regression.
listcoef: to obtain ORs and change
of odds for a sd of the variable.
xi: omodel logit inccat age i.ed employ debtinc
brant, detail
Test parallel regression assumption

(proportional odds assumption):
omodel: to perform likelihood ratio

test.
brant: to do Brant test.
prtab employ //predicted probabilities for each of the values of the variable specified
prvalue, x(_Ied_2=1) //predicted probabilities for selected values of variables
prvalue, x(_Ied_2=1 age=28 employ=3 debtinc=10)
xi: mlogit inccat age i.ed employ debtinc
listcoef
fitstat
prtab _Ied_2
1
predict p1 p2 p3 p4 p5
.8
summarize p1 p2 p3 p4 p5
.6
sort employ
.4
twoway connect p1 p5 employ, msym(i i)
.2
0
0 10 20 30
employ
Pr(inccat==1) Pr(inccat==5)
Logistic Model
Exercise 4
1. Try to construct probit models.
End

STATA Training Session 2 Statistical Analysis in STATA PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

STATA Training Session 2 Statistical Analysis in STATA PDF

Încărcat de

Drepturi de autor:

Formate disponibile

STATA Training Session 2

Statistical Analysis in STATA

Regression Models for Categorical Dependent Variables Using

Logistic Regression with Stata, Xiao Chen, Phil Ender, Michael

Statistics with Stata (Updated for Version 9), Lawrence C.

Data Analysis Using Stata, Ulrich Kohler and Frauke Kreuter

Generate dummy variables

tab year, gen(dumyear)

Graphs by year== 1998.0000

sdtest tests the

ttest performs one-

oneway lnret year

tabulate lists average returns for all the years.

graph matrix ret mktrf smb hml rf, half

2. Use help to search command ranksum: Mann-Whitney U-test.

4. Use help to search command correlate: Pearsons correlation.

regress: to perform linear regression

sw: to perform stepwise regression

0.00 0.25 0.50 0.75 1.00

Multinomial logistic model: nominal response variables with more

i exp( ' xi ) /(1 exp( ' xi ))

logistic: produces odds ratios.

Overall success rate

postgr3 debtinc, by(empcat) //you need to install postgr3 package

postgr3: graphs the predicted

Marginal impact is higher for

for those with long service in

yhat_, empcat == 0 yhat_, empcat == 1

2. Predict the probability of default on bank loan for a person with

Test parallel regression assumption

omodel: to perform likelihood ratio

S-ar putea să vă placă și