Documente Academic
Documente Profesional
Documente Cultură
Sun Li
Centre for Academic Computing
lsun@smu.edu.sg
Outline
Resources And Books
Data Description And Simple Inference
Group Comparison And Correlation
General Linear Regression
Logistic Model
Binary Logistic Model
Ordinal Logistic Model
Multinomial Logistic Model
Resources And Books
CAC Computing Resources for STATA users
Windows:
STATA/SE version 10.0
10-user network perpetual license
Installation guide
(http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA-
Software Questions.aspx)
Linux CAC Beowulf Cluster:
STATA/SE version 10.0
Unlimited users
About CAC Beowulf Cluster:
(http://research2.smu.edu.sg/CAC/HPC/Wiki/MAIN.aspx)
New features in STATA 10.0 (http://www.stata.com/stata10)
Resources And Books
Website resources:
The STATA website: http://www.stata.com
The STATA journal reviewed papers, regular columns, user-written
software: http://www.stata-journal.com/
STATA FAQ : http://www.stata.com/support/faqs
STATA User Support : http://www.stata.com/support
Books: http://www.stata.com/bookstore/
CAC STATA support:
Website:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA.aspx
Contact:
For statistical consultation: Sun Li: lsun@smu.edu.sg
For software installation: TAN SuhWen: swtan@smu.edu.sg
Resources And Books
Additional recommended readings:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/T
raining%20Slides%20and%20Syntax.aspx
Data Description & Simple Inference
Description of Data
Name: ibmff.dta
Variables:
Variable name Variable information
permno CRSP Permanent Number
date Numeric date
ret Holding Period Return
retx Return without dividends
mktrf Excess return on markert
smb Small-minus-big return
hml High-minus-low return
rf Risk-free return rate
umd Momentum factor
Data Description & Simple Inference
Data Description & Simple Inference
Convert to STATA date format
gen year=int(date/10000)
gen month=int((date-year*10000)/100)
gen day=date-year*10000-month*100
gen newdate=mdy(month, day, year)
format newdate %td
list date newdate year month day in 1
Data Description & Simple Inference
Distribution of Variables
1.00
pnorm ret
0.75
swilk ret mktrf smb hml rf
Normal F[(ret-m)/s]
0.50
pnorm: Standarized normal probability plot
0.25
swilk: Shapiro-Wilk normality test with null
hypothesis that data is normal.
0.00
0.00 0.25 0.50 0.75 1.00
Empirical P[i] = i/(N+1)
Data Description & Simple Inference
It appears all variables listed are not normally distributed. As variable ret is the variable
of interest, we adjust its skewness with zero-skewness log function lnskew0, then run
swilk to test it again.
lnskew0 lnret=ret
swilk lnret ret
For lnret, p-value > 0.05 (by default the significant level is 95%), we do not
reject the supposition that data is normally distributed.
Group Comparison & Correlation
Question: To test whether average holding period return for year 1998 has significant
difference from average of return of all the other years.
0 1
.2
0
ln(ret+.7527907)
-.2
-.4
-.6
ret
.1
excess
0 return on
-.1 the
market
-.2
.2
small-minus-big
0
return
-.2
.1
high-minus-low
0
return
-.1
.006
risk-free
.004 return rate
(one month
.002 treasury bill
0 rate)
-.2 0 .2 .4-.2 -.1 0 .1-.2 0 .2 -.1 0 .1
Group Comparison & Correlation
Exercise 1
1. Tabulate the average risk-free return rate by different years
3. To test if average risk-free return rate in year 2005 is significant different from
2006 using Mann-Whitney U-test.
(hint: generate dummy variable first)
5. To identify correlation btw the factors of interest in year 2006 using listwise and
pairwise Pearsons correlation respectively.
General Linear Regression
General Form of Model
Y X
Y is the n 1 vector of responses.
X is the n ( p 1) matrix of explanatory variables.
1
(XX) X Y least square estimates of the regression coefficien ts
Data: ibmff.dta
General Linear Regression
Step 1: Examine data
graph matrix ret mktrf smb hml rf umd
-.2 -.1 0 .1 -.1 0 .1 -.1 0 .1 .2
.2
ret 0
-.2
.1
0 excess
return on
-.1
the
market
-.2
.2
small-minus-big 0
return
-.2
.1
high-minus-low
0 return
-.1
.006
risk-free
return rate .004
(one month
treasury bill .002
rate)
0
.2
.1
momentum
0 factor
-.1
-.2 0 .2 -.2 0 .2 0 .002 .004 .006
General Linear Regression
Step 2: Perform Linear Regression
regress ret mktrf smb hml rf umd year
1.00
0.75
Normal F[(sdres-m)/s]
0.50
0.25
0.00
.2
0
-.1
-.2
-.2 -.1 0 .1 .2
Fitted values
General Linear Regression
Exercise 2
1. Repeat the analysis described in this section after removing the listed possible
outliers identified by Cooks.
2. After finishing Q1, repeat the analysis but treat the variable year as the
categorical.
hint: use command
xi: sw regress ret mktrf smb hml rf umd i.year, pe(0.05)
Logistic Model
Binary logistic model: dichotomous response outcomes
e,.g.: presence or absence of an event
Ordinal logistic model: ordinal response variable with more than two
ordered categories
e,.g.: a 5-point Likert scale
exp( k ) is the Odds Ratio that y 1 when xk increases by one unit and all other
covariates remain the same.
Binary responses Y are typically coded as 1 for the event of interest, and 0 for the
opposite event.
Binary Logistic Regression
Description of Data
How to identify a person with high chance of getting defaults on the bank loan. We have
700 records from bank database (bankloan.csv) .
Variable name Variable information
age Age in years
ed Level of education
1= didnt complete high school 2= high school degree
3= college degree 4= undergraduate 5= postgraduate
employ Years with current employer
address Years in current address
income Household income in thousands
debtinc Debt to income ratio (*100)
creddebt Credit card debt in thousands
othdebt Other debts in thousands
default Previously defaulted (1=Yes; 0=No)
Binary Logistic Regression
Step 1: Import and examine data
insheet using bankloan.csv
d
browse
codebook default
Binary Logistic Regression
tabstat age employ address income debtinc creddebt othdebt, by(default)
table ed, c(mean income mean age mean debtinc mean creddebt mean othdebt) by(default)
Binary Logistic Regression
Step 2: Construct logistic model
logistic default age ed employ income address
estimates store model1
logistic default age ed employ income address debtinc creddebt othdebt
lrtest model1 .
sw logit default age address employ income debtinc creddebt othdebt, pe(0.05)
.8
predict prob
predict resi, rstandard
.6
hist resi
Density
estat gof
.4
.2
estat gof: goodness-of-fit test
0
-5 0 5 10
standardized Pearson residual
Binary Logistic Regression
estat classification
This is calculated based on 50% as a
cut-off point for positive
predictions.
Summary of correct
predictions
Summary of incorrect
predictions
.4
.2
0
-10 -5 0 5 10
z
Binary Logistic Regression
gen empcat=employ>5
logit default address empcat debtinc creddebt
is the mean).
.6
.4
0 10 20 30 40
debtinc
exp( k ) represents Odds Ratio that y as for any s when xk increases by one unit and all
other covariates remain the same.
Ordered responses with k categories can be formulated as a threshold model.
Ordinal Logistic Model
Construct model
recode income (min/20=1 "<20") (20/30=2 "20-29") (30/40=3 "30-39") (40/50=4
"40-49") (50/max=5 "above 50"), generate(inccat)
codebook inccat
Ordinal Logistic Model
xi: ologit inccat age i.ed employ debtinc, or
listcoef, help oligit: to perform ordered logistic
regression.
listcoef: to obtain ORs and change
of odds for a sd of the variable.
Ordinal Logistic Model
xi: omodel logit inccat age i.ed employ debtinc
brant, detail
1
predict p1 p2 p3 p4 p5
.8
summarize p1 p2 p3 p4 p5
.6
sort employ
.4
twoway connect p1 p5 employ, msym(i i)
.2
0
0 10 20 30
employ
Pr(inccat==1) Pr(inccat==5)
Logistic Model
Exercise 4
1. Try to construct probit models.
End