Documente Academic
Documente Profesional
Documente Cultură
Christopher F Baum
Faculty Micro Resource
Center Boston College
August 2011
Introduction to Stata
August 2011
Strengths of Stata
What is Stata?
Introduction to Stata
August 2011
Strengths of Stata
What is Stata?
Introduction to Stata
August 2011
Strengths of Stata
What is Stata?
Introduction to Stata
August 2011
Strengths of Stata
What is Stata?
Introduction to Stata
August 2011
Strengths of Stata
What is Stata?
Introduction to Stata
August 2011
Strengths of Stata
Portability
Introduction to Stata
August 2011
Strengths of Stata
Introduction to Stata
August 2011
Strengths of Stata
Help
Graphics
Command
Revie
w
Statistics
_rc
- / _!
- - - (R)
II /
// I I
Statistics/Data
11.0
Analysis
MP - Parallel
Edition
Name
Variable
s
Label
Type
User
- -
//
Resul
ts
Window
Copyright 2009
StataCorp lP StataCorp
4905 lakeway Drive
College
Station, http://www.stata.co
Texas
800-STATA-PC
77845
USA
m
979-696-4600
stata@stata.com
979-6964601(fax)
Command
Introduction to Stata
August 2011
9 I 157
Strengths of Stata
The Toolbar contains icons that allow you to Open and Save files, Print
results, control Logs, and manipulate windows. Some very important
tools allow you to open the Do-File Editor, the Data Editor and the
Data Browser.
The Data Editor and Data Browser present you with a spreadsheet-like
view of the data, no matter how large your dataset may be. The
Do-File editor, as we will discuss, allows you to construct a file of Stata
commands, or do-file, and execute it in whole or in part from the
editor.
Introduction to Stata
August 2011
10
Strengths of Stata
Introduction to Stata
August 2011
11
Strengths of Stata
There are four windows in the default interface: the Review, Results,
Command and Variables window. You may alter the appearance of
any window in the GUI using the Preferences > General dialog, and
make those changes on a temporary or permanent basis.
As you might expect, you may type commands in the Command
window. You may only enter one command in that window, so you
should not try pasting a list of several commands. When a command is
executedwith or without errorit appears in the Review window, and
the results of the command (or an error message) appears in the
Results window. You may click on any command in the Review
window and it will reappear in the Command window, where it may be
edited and resubmitted.
Introduction to Stata
August 2011
12
Strengths of Stata
Once you have loaded data into the program, the Variables window
will be populated with information on each variable. That information
includes the variable name, its label (if any), its type and its format.
This is a subset of information available from the d e s c r i b e
command.
Lets look at the interface after I have loaded one of the datasets
provided with Stata, u s l i f e e x p , with the sysuse command and
given the d e s c r i b e and summarize commands:
Introduction to Stata
August 2011
13
Strengths of Stata
Command
-rc
1 sysuse
uslifeex
p
2 describe
3 summari
ze
le_wfemale
'"'I
Variables
Nam
e
year
le
le_mal
e
le_w
le_fema
le
life ...
float
le_wmale
life ...
float
le_wfemale
Label
Type
life ...
Year
float
life
..
int
float
life ...
floa
t
year
variable
name
yea
r
l
eJIIQl
le
le_femal
e
l
ee_w
le_MIOl
e
le_
le_bmal
b
le_bfemal
e
e
Sorted by:
year
s1.111110riz
e
Variab
le
100
l
e
100
le_ma
le
le_fema
le
( }
(Charg
Results - uslife
10
4,600 (99.9% of memory
.dta
free)
30 Mar 2009 04:31
valu
stora displ
(_dta has label
notes)
variable
ay
e
ge
labe
type format
Year
int
l
float %9.0g
life expectancy
floa
li.fe expectancy, males
%9.0g
t
life expectancy ,
floa %9.0g
females life
%9.0g
t
floa %9.0g
expectancy, whites
t
li.fe expectancy , white
%9.0g
floa
males li.fe expectancy ,
%9.0g
t
white females life
float %9.0g
expectancy, blacks
float %9.0g
float %9.0g
life expectancy, black males
floa
life expectancy, black
t
Max
Obs
Mean Std .females Min
Dev.
1949.5
29.011
1900
1999
49
64.829
9.1586
39.1
76.7
28
100
62.30 8.436369
36.6
73.
2
9
100
67.51 9.834987
42.2
79.
5
life ..
float
le_b
life ..
float
le_bmale
Introduction to Stata
August 2011
14 I 157
Strengths of Stata
interface
Statas user
Notice that the three commands are listed in the Review window. If any
had failed, the _ r c column would contain a nonzero number, in red,
indicating the error code. The Variables window contains the list of
variables and their labels. The Results window shows the effects of
summarize: for each variable, the number of observations, their
mean, standard deviation, minimum and maximum. If there were any
string variables in the dataset, they would be listed as having zero
observations.
Try it out: type the commands
sysuse
uslifeexp
describe
summarize
Take note of an important design feature of Stata. If you do not say
what to d e s c r i b e or summarize, Stata assumes you want to perform
those commands for every variable in memory, as shown here. As we
shallFsee,
this College
design
principleIntroduction
holds throughout
the program.
Christopher
Baum (Boston
FMRC)
to Stata
August 2011
15
Strengths of Stata
We may also write a do-file in the do-file editor and execute it. The
Do-File Editor icon on the Toolbar brings up a window in which we
may type those same three commands, as well as a few more:
sysuse u s l i f e e x p
describe
summarize
notes
summarize le if year
summarize le if year
< 1950
>= 1950
After typing those commands into the window, the rightmost icon, with
tooltip Do, may be used to execute them.
Introduction to Stata
August 2011
16
Strengths of Stata
Data
User
Comma nd
1
2
3
4
-rc
sysu se u sli feexp
descr ibe
summar i ze
do "/ U se rs/bau
m/ . ..
,
Vari ables
Statistics
4. For se
Window
o4 ( }
) (Char
Help
s.
st.mnari.ze
Var i abl e
Obs
Mean
Std . Dev.
100
1949.5
le
100
64.829
le_ma le
100
62.302
le_f emale
100
67.51
le_w
100
65.688
100
63.143
29.011
49
9.15862
8
8.4363
69
9.83498
7
9.17126
9
8.5039
54
9.79716
7
12.489
37
Std. Dev.
year
le_m10 le
le
le_ma le
le_fema le
l ife .. .
float
Life
float
Life ...
... float
le_w
le_wma
Name l e
Var iable
le_bma le
year
le_wfemale
Yea
Lifer ... int
float
le_bfema lele
le_b
le_b ma le
le_bfe ma le
. Ile_wfemale
I a v e r a ge l i . f e expectan
100 cy,
1900-1949
68.434
. II average l
Obs
Mean
100
53.589
11.456
9
13.5409
58.567
57.22 6.6504
26
100
50
i.fe expectancy,
Mi
n
1900
36.6
42.2
39.8
37.1
( )0
43.2
71.
4
29.1
67.
8
74.
8
32.5
ze
avera life
ge
expectancy,
Obs
50
S l .l . do
1950-1999
line :
80
30.8
Var iable
le
199
9
76.
7
73.
9
79.
5
77.
3
74.
6
39.1
Max
Mean
ze le if year
19001949
Std. Dev.
<
1950
Introduction to Stata
August 2011
17 I 157
Strengths of Stata
Introduction to Stata
August 2011
18 / 157
Strengths of Stata
(Cha r
o4)
4.
Command
_rc
1 sysuse
uslifeex p
2 describe
3 summarize
4 do "/Users/
baum/ ...
For selected
Max
Min year
1999
76.7
73.9
79.5
77.3
29.01149
74.6
years,
SL11111Clri.ze
Variable
Obs
100
1949.5
1900
le
9.158628
.-""\
Nam
e
year
le
le_male
le_femal
e le_w
le_wmale
le_wfe
male le_b
le_bmale
le_bfemal
e
Varia
bles
39.1
100
le_male
Label
62.302 8.436369
36.6
T ype
Year
i nt
life ...
float
Life
...
float Life ...
float
Life
...
float Life ...
float Life ...
float Life ...
float
Life
...
float Life ...
float
64.829
100
100
le_female
80
71.4
67.8
74.8
67.51 9.834987
Sl.l.do
0
42.2
100
le_w
le_\\1'110 le
65.6&8 9.171269
100
39.8
8.503954
63.143
37.1
le_wfema le
le_b
le_bmale
le_bfemale
100
68.434 9.797167
43.2
100
13.5409
58.567
32.5
SL11111Clrize le if year
Variable
le
6.650426
1900-1949
<
()0
1950
Obs
Mean
50
Std . Dev.
57.22
Introduction to Stata
August 2011
19 I 157
Strengths of Stata
Try it out: use the Do-File Editor to open the do-file S 1 . 1 . d o , and run
the file.
Try selecting only those last four lines and run those commands.
Introduction to Stata
August 2011
20
Strengths of Stata
The rightmost menu on the menu bar is labeled Help. From that menu,
you can search for help on any command or feature. The Help
Browser, which opens in a Viewer window, provides hyperlinks, in
blue, to additional help pages. At the foot of each help screen, there
are hyperlinks to the full manuals, which are accessible in PDF
format. The links will take you directly to the appropriate page of the
manual.
You may also search for help at the command line with h e l p
command. But what if you dont know the exact command name?
Then you may use s e a r c h or its expanded version, f i n d i t , each of
which may be followed by one or several words.
Introduction to Stata
August 2011
21
Strengths of Stata
Introduction to Stata
August 2011
22
Strengths of Stata
Data Manipulation
Introduction to Stata
August 2011
23
Strengths of Stata
Statistics
Introduction to Stata
August 2011
24
Strengths of Stata
Statistics
Introduction to Stata
August 2011
25
Strengths of Stata
Graphics
Stata graphics are excellent tools for exploratory data analysis, and
can produce high-quality 2-D publication-quality graphics in several
dozen different forms. Every aspect of graphics may be programmed
and customized, and new graph types and graph schemes are being
continuously developed. The programmability of graphics implies that
a number of similar graphs may be generated without any pointing
and clicking to alter aspects of the graphs. Stata 12 provides support
for contour plots and heatmaps.
Introduction to Stata
August 2011
26
Strengths of Stata
Introduction to Stata
August 2011
27
Strengths of Stata
Extensibility
Introduction to Stata
August 2011
28
Strengths of Stata
Extensibility
Introduction to Stata
August 2011
29
Strengths of Stata
Extensibility
The importance of this program design goes far beyond the limits of
official Stata. Since the adopath includes both Stata directories and
other directories on your hard disk (or on a servers filesystem), you
may acquire new Stata commands from a number of web sites. The
Stata Journal (SJ), a quarterly refereed journal, is the primary method
for distributing user contributions. Between 1991 and 2001, the Stata
Technical Bulletin played this role, and a complete set of issues of the
STB are available on line at the Stata website.
The SJ is a subscription publication (articles more than three years old
freely downloadable), but the ado- and s t h l p -files may be freely
downloaded from Statas web site. The Stata h e l p command
accesses help on all installed commands; the Stata command f i n d i t
will locate commands that have been documented in the STB and the
SJ, and with one click you may install them in your version of Stata.
Help for these commands will then be available in your own copy.
Introduction to Stata
August 2011
30
Strengths of Stata
Extensibility
But this is only the beginning. Stata users worldwide participate in the
StataList listserv, and when a user has written and documented a new
general-purpose command to extend Stata functionality, they
announce it on the StataList listserv (to which you may freely
subscribe: see Statas web site).
Since September 1997, all items posted to S t a t a L i s t (over 1,300)
have been placed in the Boston College Statistical Software
Components (SSC) Archive in RePEc (Research Papers in
Economics), available from IDEAS (h t t p : / / i d e a s . r e p e c . o r g ) and
EconPapers (h t t p : / / e c o n p a p e r s . r e p e c . o r g ).
Introduction to Stata
August 2011
31
Strengths of Stata
Extensibility
Introduction to Stata
August 2011
32
Strengths of Stata
Extensibility
The command s s c new lists, in the Stata Viewer, all SSC packages
that have been added or modified in the last month. You may click
on their names for full details. The command s s c h o t reports on
the most popular packages on the SSC Archive.
The Stata command adoupdate checks to see whether all packages
you have downloaded and installed from the SSC archive, the Stata
Journal, or other user-maintained n e t f r o m . . . sites are up to date.
adoupdate alone will provide a list of packages that have been
updated. You may then use a d o u p d a t e , update to refresh your
copies of those packages, or specify which packages are to be
updated.
Introduction to Stata
August 2011
33
Strengths of Stata
Extensibility
Introduction to Stata
August 2011
34
Strengths of Stata
Introduction to Stata
August 2011
35
Strengths of Stata
If you would like your own copy of Stata, it is quite inexpensive. The
vendors GradPlan program makes the full version of Stata version 12
software available to BC faculty and students for $98.00 (one-year
license) or $179.00 (perpetual license). This includes the full set of
manuals in PDF format, hyperlinked to Statas help system.
The Small Stata version is available to students for $49.00 for a
one-year license. It contains all of Statas commands, but can only
handle a limited number of observations and variables (thus not
recommended for Ph.D. students or Senior Honors Thesis students).
GradPlan orders are made direct to Stata, with delivery from
on-campus inventory.
Introduction to Stata
August 2011
36
Strengths of Stata
Introduction to Stata
August 2011
37
Introduction to Stata
August 2011
38
Advantage: Reproducibility
Reproducibility
Introduction to Stata
August 2011
39
Advantage: Reproducibility
In a computer program where all actions are point and click, such as a
spreadsheet, who can say how you arrived at a certain set of results?
Unless every step of your transformations of the data can be retraced,
how can you find exactly how the sample you are employing differs
from the raw data? A command-driven program is capable of this level
of reproducibility, we should all instill this level of rigor in our research
practices.
Reproducibility also makes it very easy to perform an alternate
analysis of a particular model. What would happen if we added this
interaction, or introduced this additional variable, or decided to handle
zero values as missing? Even if many steps have been taken since the
basic model was specified, it is easy to go back and produce a
variation on the analysis if all the work is represented by a series of
programs.
Introduction to Stata
August 2011
40
Advantage: Transportability
Transportability
Stata binary files may be easily transformed into SPSS or SAS files
with the third-party application Stat/Transfer. Stat/Transfer is
available for Windows and Mac OS X systems as well as on various
Unix systems on campus. Personal copies of Stat/Transfer version
11 (which handles Stata versions 6, 7, 8, 9, 10, 11 and 12 datafiles)
are available at a discounted academic rate of $69.00 through the
Stata GradPlan.
Stat/Transfer can also transfer SAS, SPSS and many other file
formats into Stata format, without loss of variable labels, value labels,
and the like. It can also be used to create a manageable subset of a
very large Stata file (such as those produced from survey data) by
selecting only the variables you need. It is a very useful tool.
Introduction to Stata
August 2011
41
Programmability of tasks
Programmability of tasks
Introduction to Stata
August 2011
42
Programmability of tasks
Going one step further, if you use the do-file editor to create a
sequence of commands, you may save that do-file and reuse it
tomorrow, or use it as the starting point for a similar set of data
management or statistical operations. Working in this way promotes
reproducibility, which makes it very easy to perform an alternate
analysis of a particular model. Even if many steps have been taken
since the basic model was specified, it is easy to go back and produce
a variation on the analysis if all the work is represented by a series of
programs.
One of the implications of the concern for reproducible work: avoid
altering data in a non-auditable environment such as a spreadsheet.
Rather, you should transfer external data into the Stata environment as
early as possible in the process of analysis, and only make permanent
changes to the data with do-files that can give you an audit trail of
every change made to the data.
Introduction to Stata
August 2011
43
Programmability of tasks
Introduction to Stata
August 2011
44
Introduction to Stata
August 2011
45
The local macro is an invaluable tool for do-file authors. A local macro
is created with the l o c a l statement, which serves to name the macro
and provide its content. When you next refer to the macro, you extract
its value by dereferencing it, using the backtick () and apostrophe ()
on its left and right:
local
2
george l o c a l g e o r g e + 2
paul
In
= this case, I use an equals sign in the second local statement as I
want to evaluate the right-hand side, as an arithmetic expression, and
store it in the macro p a u l . If I did not use the equals sign in this
context, the macro p a u l would contain the string 2 + 2.
Introduction to Stata
August 2011
46
Introduction to Stata
August 2011
47
Introduction to Stata
August 2011
48
Introduction to Stata
August 2011
49
c t y = _n
` alleps
c t y,
yline(0)
scheme(s2mono)
legend(rows(1))
t i ( " R e s i d u a l s from
model o f l i f e
expectancy vs p e r
GDP") / / / t 2 ( " F i t
s e p a r a t e l y f o r each
region")
Introduction to Stata
August 2011
50
-15
-10
-5
40
cty
20
60
N.A.
Introduction to Stata
80
S.A.
August 2011
51
Global macros
Introduction to Stata
August 2011
52
Introduction to Stata
August 2011
53
Introduction to Stata
August 2011
54
Command template
Introduction to Stata
August 2011
55
The varlist
The varlist
varlist is a list of one or more variables on which the command is to
operate: the subject(s) of the verb. Stata works on the concept of a
single set of variables currently defined and contained in memory,
each of which has a name. As desc will show you, each variable has
a data type (various sorts of integers and reals, and string variables of
a specified maximum length). The varlist specifies which of the defined
variables are to be used in the command.
The order of variables in the dataset matters, since you can use
hyphenated lists to include all variables between first and last. (The
o r d e r and move commands can alter the order of variables.) You can
also use wildcards to refer to all variables with a certain prefix. If you
have variables pop60, pop70, pop80, pop90, you can refer to them in a
varlist as pop * or pop?0.
Christopher F Baum (Boston College FMRC)
Introduction to Stata
August 2011
56
Introduction to Stata
August 2011
57
Introduction to Stata
August 2011
58
Even more commonly, you may employ the if exp clause. This restricts
the set of observations to those for which the exp, a Boolean
expression, evaluates to true. Statas missing value codes are greater
than the largest positive number, so that the last command would
avoid listing cars for which the price is missing.
l i s t make p r i c e i f f o r e i g n = = 1
lists only foreign cars, and
l i s t make p r i c e i f p r i c e > 10000 & p r i c e < .
lists only expensive cars (in 1978 prices!) Note the double equal in the
exp. A single equal sign, as in the C language, is used for assignment;
double equal for comparison.
Introduction to Stata
August 2011
59
Introduction to Stata
August 2011
60
Introduction to Stata
August 2011
61
Reading and writing binary (.dta) files is much faster than dealing with
text (ASCII) files (with the i n s h e e t or i n f i l e commands), and
permits variable labels, value labels, and other characteristics of the
file to be saved along with the file. To write a Stata binary file, the
command
save file [,r e p l a c e ]
is employed. The compress command can be used to economize on
the disk space (and memory) required to store variables.
Statas version 10 and 11 datasets cannot be read by version 8 or 9; to
create a compatible dataset, use s a v e o l d .
Introduction to Stata
August 2011
62
or
use
http://fmwww.bc.edu/ec-p/d ata/Wooldridge/c rim e1.dta
will read these datasets into Statas memory over the web.
Introduction to Stata
August 2011
63
The t y p e command can display any text file, whether on your hard
disk or over the Web; thus
t y p e h t t p : / / f m w w w.b c . e d u / e c - p / d a t a / W oo l d r i d g e / c r i m e 1 . d e s
Introduction to Stata
August 2011
64
When you have used a dataset over the Web, you have loaded it into
memory in your desktop Stata. You cannot save it to the Web, but can
save the data to your own hard disk. The advantages of this feature for
instructional and collaborative research should be clear. Students may
be given a URL from which their assigned data are to be accessed; it
matters not whether they are using Stata for Windows, Macintosh,
Linux, or Unix.
Introduction to Stata
August 2011
65
Introduction to Stata
August 2011
66
Prefix commands
Prefix commands
A number of Stata commands can be used as prefix commands,
preceding a Stata command and modifying its behavior. The most
commonly employed is the by prefix, which repeats a command over a
set of categories. The statsby: prefix repeats the command, but
collects statistics from each category. The rolling: prefix runs the
command on moving subsets of the data (usually time series).
Several other command prefixes: simulate:, which simulates a
statistical model; bootstrap:, allowing the computation of bootstrap
statistics from resampled data; and jackknife:, which runs a command
over jackknife subsets of the data. The svy: prefix can be used with
many statistical commands to allow for survey sample design. See my
separate slideshow on Monte Carlo Simulation in Stata.
Introduction to Stata
August 2011
67
Prefix commands
The by prefix
You can often save time and effort by using the by prefix. When a
command is prefixed with a bylist, it is performed repeatedly for each
element of the variable or variables in that list, each of which must be
categorical. For instance,
by f o r e i g n :
summ p r i c e
will provide descriptive statistics for both foreign and domestic cars. If
the data are not already sorted by the bylist variables, the prefix
b y s o r t should be used. The option , t o t a l will add the overall
summary.
Introduction to Stata
August 2011
68
Prefix commands
Introduction to Stata
August 2011
69
Prefix commands
Introduction to Stata
August 2011
70
Prefix commands
For instance, if you have individual data with a family identifier, these
commands might be useful:
s o r t f a m i d age
by famid:
gen famsize
= _N
by famid:
gen birthorder
= _N -
_n +1
Here the f a m s i z e variable is set to _N, the total number of records for
that family, while the b i r t h o r d e r variable is generated by sorting the
family members ages within each family.
Introduction to Stata
August 2011
71
Missing values
Missing values
Missing value codes in Stata appear as the dot (.) in printed output
(and a string missing value code as well: , the null string). It takes on
the largest possible positive value, so in the presence of missing data
you do not want to say
generate h i p r i c e
= ( p r i c e > 10000), but rather
generate h i p r i c e
= ( p r i c e > 10000 & p r i c e < . )
which then generates a dummy variable for high-priced cars (for
which price data are complete, with prices less than missing).
As of version 8, Stata allows for multiple missing value codes
(. a ,
. b , . c , . . . , . z ).
Introduction to Stata
August 2011
72
Display formats
Display formats
Each variable may have its own default display format. This does not
alter the contents of the variable, but affects how it is displayed. For
instance, %9.2f would display a two-decimal-place real number. The
command
f o r m a t varname %9.2f
will save that format as the default format of the variable, and
f o r m a t d a t e %tm
will format a Stata date variable into a monthly format (e.g.,
1998m10).
Introduction to Stata
August 2011
73
Variable labels
Variable labels
Each variable may have its own variable label. The variable label is a
character string (maximum 80 characters) which describes the
variable, associated with the variable via
l a b e l v a r i a b l e varname " t e x t "
Variable labels, where defined, will be used to identify the variable in
printed output, space permitting.
Introduction to Stata
August 2011
74
Value labels
Value labels
Introduction to Stata
August 2011
75
Introduction to Stata
August 2011
76
Introduction to Stata
August 2011
77
Introduction to Stata
August 2011
78
Introduction to Stata
August 2011
79
Estimation commands
Common syntax
Estimation commands
Introduction to Stata
August 2011
80
Estimation commands
Post-estimation commands
Introduction to Stata
August 2011
81
Estimation commands
Introduction to Stata
August 2011
82
Estimation commands
the command
e s t i m a t e s t a b l e model1 model2 model3
will produce a nicely-formatted table of results. Options on
e s t i m a t e s t a b l e allow you to control precision, whether standard
errors or t-values are given, significance stars, summary statistics, etc.
For example:
e s t i m a t e s t a b l e model1 model2 model 3, b ( % 1 0 . 3 f )
s e ( % 7 . 2 f ) s t a t s ( r 2 rmse N) t i t l e ( S o m e models o f a u t o
price)
Introduction to Stata
August 2011
83
Estimation commands
Publication-quality tables
Introduction to Stata
August 2011
84
Estimation commands
Publication-quality tables
Introduction to Stata
August 2011
85
Estimation commands
quality tables
Publication-
Table 1: Models
of (2)
auto price (3)
(1)
price
price
price
mpg
186.7
(2.13)
length
52.58
-97.63
-88.03
(1.67)
(-2.47)
(-2.65)
turn
-199.0
(-1.44)
weight
4.613
displacement
5.479
(3.30)
0.727
(5.24)
(0.10)
gear ratio
-669.1
(-0.72)
foreign
3837.9
cons
R2
8148.0
(1.35)
10440.6
(2.39)
(5.19)
7041.5
(1.46)
0.251
0.348
0.552
bic
1387.2
1377.0
1353.5
74
Introduction to Stata
August 2011
86
File handling
File handling
File extensions usually employed (but not required) include:
.ado
.dct
.do
.dta
.gph
.log
.smcl
.raw
.sthlp
a u t o m a t i c d o - f i l e ( d e f i n e s a S t a t a command)
d a t a d i c t i o n a r y , o p t i o n a l l y used w i t h i n f i l e
d o - f i l e ( u s e r program)
Stata binary dataset
graphics
output file
(binary)
text
log file
SMC ( m a r k u p ) l o g f i l e , f o r use w i t h Viewer
ASCII
data file
L
Stata help file
Introduction to Stata
August 2011
87
File handling
Introduction to Stata
August 2011
88
File handling
Introduction to Stata
August 2011
89
File handling
Introduction to Stata
August 2011
90
File handling
Introduction to Stata
August 2011
91
File handling
If some of the data are string variables without embedded spaces, they
must be specified in the command:
i n f i l e s t r 3 c o u n t r y p r i c e mpg d i s p l a c e m e n t u s i n g a u t o 2
Introduction to Stata
August 2011
92
File handling
Introduction to Stata
August 2011
93
File handling
Introduction to Stata
August 2011
94
File handling
Introduction to Stata
August 2011
95
File handling
If your data are already in the internal format of SAS, SPSS, Excel,
GAUSS, MATLAB, or a number of other packages, the best way to get
it into Stata is by using the third-party product Stat/Transfer.
Stat/Transfer will preserve variable labels, value labels, and other
aspects of the data, and can be used to convert a Stata binary file into
other packages formats. It can also produce subsets of the data
(selecting variables, cases or both) so as to generate an extract file
that is more manageable. This is particularly important when the
2,047-variable limit on standard Stata data sets is encountered.
Stat/Transfer is well documented, with on-line help available in both
Windows, Mac OS X and Unix versions, and an extensive manual.
Introduction to Stata
August 2011
96
append
Introduction to Stata
August 2011
97
merge
We now describe the merge command, which is Statas basic tool for
working with more than one dataset. Its syntax changed considerably
in Stata version 11.
The merge command takes a first argument indicating whether you are
performing a one-to-one, many-to-one, one-to-many or many-to-many
merge using specified key variables. It can also perform a one-to-one
merge by observation.
Introduction to Stata
August 2011
98
merge
Introduction to Stata
August 2011
99
merge
Introduction to Stata
August 2011
100 / 157
merge
id var
1
..
112
..
dataset1 :
216
..
449
id var
22
..
112
..
dataset3 :
216
..
449
var 2
..
..
..
var
44 .
.
Introduction to Stata
..
..
var 46
..
..
..
August 2011
101
merge
112
.
.
.
.
.
combined :
.
.
.
.
.
216
..
..
..
..
..
449
Introduction to Stata
August 2011
102
merge
The rule for merge, then, is that if datasets are to be combined on one
or more merge keys, they each must have one or more variables with a
common name and datatype (string vs. numeric). In the example
above, each dataset must have a variable named i d . That variable
can be numeric or string, but that characteristic of the merge key
variables must match across the datasets to be merged. Of course, we
need not have exactly the same observations in each dataset: if
d a t a s e t 3 contained observations with additional i d values, those
observations would be merged with missing values for v a r 1 and
var2.
This is the simplest kind of merge: the one-to-one merge. Stata
supports several other types of merges. But the key concept should be
clear: the merge command combines datasets horizontally, adding
variables values to existing observations.
Introduction to Stata
August 2011
103
Match merge
Introduction to Stata
August 2011
104
Match merge
Introduction to Stata
August 2011
105
Introduction to Stata
August 2011
106
Introduction to Stata
August 2011
107
Introduction to Stata
August 2011
108
Reconfiguring data
Reconfiguring data
Data are often provided in a different orientation than that required for
statistical analysis. The most common example of this occurs with
panel, or longitudinal, data, in which each observation conceptually
has both cross-section (i) and time-series (t) subscripts. Often one will
want to work with a pure cross-section or pure time-series. If the
microdata themselves are the objects of analysis, this can be handled
with sorting and a loop structure. If you have data for N firms for T
periods per firm, and want to fit the same model to each firm, one
could use the s t a t s b y command, or if more complex processing of
each models results was required, a f o r e a c h block could be used. If
analysis of a cross-section was desired, a b y s o r t would do the job.
Introduction to Stata
August 2011
109
Reconfiguring data
collapse
But what if you want to use average values for each time period,
averaged over firms? The resulting dataset of T observations can be
easily created by the c o l l a p s e command, which permits you to
generate a new data set comprised of summary statistics of specified
variables. More than one summary statistic can be generated per input
variable, so that both the number of firms per period and the average
return on assets could be generated. c o l l a p s e can produce counts,
means, medians, percentiles, extrema, and standard deviations.
Introduction to Stata
August 2011
110
Reconfiguring data
reshape
Introduction to Stata
August 2011
111
Reconfiguring data
reshape
long d, i(ccode
wide d, i(ccode
vcode) j(year)
year) j(vcode)
string
Introduction to Stata
August 2011
112
Returned results
Returned results
Stata commands are either r-class commands like summarize, that
return results, or e-class commands, that return estimates. You may
examine the set of results from a r-class command with the command
r e t u r n l i s t . For an e-class command, use e r e t u r n l i s t . An
eclass command will return e ( ) scalars, macros and matrices: for
instance, after r e g r e s s , the local macro e ( N ) will contain the number
of observations, e ( r 2 ) the R 2 value, e ( d e p v a r ) will contain the
name of the dependent variable, and so on.
Commands may also return matrices. For instance, r e g r e s s (like all
estimation commands) will return the matrix e ( b ) , a row vector of
point estimates, and the matrix e ( V ) , the estimated variance
covariance matrix of the estimated parameters.
Introduction to Stata
August 2011
113
Returned results
Introduction to Stata
August 2011
114
Useful commands
Introduction to Stata
August 2011
115
Data manipulation
Introduction to Stata
August 2011
116
Data manipulation
Introduction to Stata
August 2011
117
Statistics
Statistical commands
summarize : descriptive statistics
correlate : correlation matrices
ttest : perform 1-, 2-sample and paired t-tests
anova : 1-, 2-, n-way analysis of variance
regress : least squares regression
predict : generate fitted values, residuals, etc.
test : test linear hypotheses on parameters
lincom : linear combinations of parameters
cnsreg : regression with linear constraints
testnl : test nonlinear hypothesis on parameters
margins : marginal effects (elasticities, etc.)
ivregress : instrumental variables regression
prais : regression with AR(1) errors
sureg : seemingly unrelated regressions
reg3 : three-stage least squares
qreg : quantile regression
Introduction to Stata
sem F: Baum
structural
equati
Christopher
(Boston College
FMRC) n modeling
August 2011
118 / 157
Introduction to Stata
August 2011
119
Introduction to Stata
August 2011
120
Introduction to Stata
August 2011
121
Nonlinear estimation
Introduction to Stata
August 2011
122
Graphics
Graphics commands:
Introduction to Stata
August 2011
123
Graphics
will generate a scatterplot, overlaid with the linear regression fit, and
twoway ( l f i t c i p r i c e mpg) ( s c a t t e r p r i c e mpg)
Introduction to Stata
August 2011
124
Graphics
5,000
10,000
15,000
10
20
Price
Mileage (mpg)
30
40
Fitted values
Introduction to Stata
August 2011
125
Graphics
5000
10000
15000
10
20
Mileage (mpg)
95% CI
30
40
Fitted values
Price
Introduction to Stata
August 2011
126
Graphics
Introduction to Stata
August 2011
127
Graphics
5000
10000
15000
10
20
Mileage (mpg)
30
Introduction to Stata
40
Price
August 2011
128
Graphics
-1
-.5
y
0
.5
.2
.4
.6
.8
Introduction to Stata
August 2011
129
Graphics
Introduction to Stata
August 2011
130
Graphics
5000
5,000
10000
10,000
15000
15,000
10
20
30
Mileage (m pg)
Price
40
Fitted values
.02
.04
.06
Gallons per mile
.08
Price
Introduction to Stata
August 2011
131
Introduction to Stata
August 2011
132
Cross-section example
* StataIntro: cross-section
l o gexample
using i n t r o 1 , replace
use h t t p : / / f m w w w . b c . e d u / e c - p / d a t a / h a y a s h i / g r i l i c h e s 7 6
describe
summarize
label define ur 0 r u r a l 1
urb an l a b e l v a l u e s smsa u r
t a b smsa
t a b m r t smsa, c h i 2
t t e s t med,by(smsa)
l w m r t smsa
anova l w m r t smsa
a
n o va , r e g r e s s
anova
m r t * smsa
r e g r e s s l w t e n u r e kww
p r e d i c t smsa l w e p s , r e s i d
lwe ps kww
kww smsa
b
s ot tret r
regress lw
s cy a
y e a r : grap h t e n u r e i q kww age l w , m s i z e ( t i n y )
m a t r i x gen
s expr
m e d r u r a l gen = med* (smsa==1)
(smsa==0)
medrural
rmedurban
e g r e s s l w t e n u r e *kww
medurban t e s t
medurban=medrural
log close
Introduction to Stata
August 2011
133
20
40
Cross-section example
60
10
15
20
7
150
iq
score
100
50
60
score on
knowledge
in world of
work t es t
40
20
30
25
age
20
15
20
completed
years of
schooling
15
10
10
experience,
years
5
0
7
6
log
wage
5
50
100
150
15
20
25
30
Introduction to Stata
10
August 2011
134
Introduction to Stata
August 2011
135
* S t a t a I n t r o : t i m e - s e r i e s example
l o g using i n t r o 2 , r e p l a c e
use h t t p : / / f m w w w . b c . e d u / e c - p / d a t a / m i c r o / d d j i a . d t a
desc
summ
tsset
tslin
e
ret
L ( 1 / 3 ) . ` v , r o b u s t
r e g r e s s ` v
foreach
v
o eps_`
f v a rvl i ,sr te s i dd j i a
predict
l d j i a r eeps_`
t { dv
f g l s ` v ,
wnt es t q
m axlag(12)
}
dfgls
D .` v , m axlag(12)
log close
Introduction to Stata
August 2011
136
-30
10
1000
2000
day
Introduction to Stata
3000
4000
5000
August 2011
137
Writing a do-file
Introduction to Stata
August 2011
138
Writing a do-file
The use of local macros and the appropriate loop constructs make it
possible to write a Stata program that is fairly general, and requires
little modification to be reused on different series, or with different
parameters. This makes your work with Stata very productive, since
much of the code is reusable and adaptable to similar tasks. Let us
consider how this approach might be pursued in the context of the
volatility forecast example.
For more information, see my separate slideshow Why should you
become a Stata programmer? and my 2009 book An Introduction to
Stata Programming, available in ONeill Library.
Introduction to Stata
August 2011
139
Writing an ado-file
Writing an ado-file
Introduction to Stata
August 2011
140
Writing an ado-file
Introduction to Stata
August 2011
141
Writing an ado-file
[Win(integer 12)
AR(integer 4)]
q u i e t l y summ ` v a r l i s t ,
meanonly l o c a l l a s t = r ( N )
d i s _n " ` v o l : v o l a t i l i t y f o r e c a s t f o r ` v a r l i s t w i t h window=` wi n ,
) " forv i=` start / ` last {
l o c a l f i r s t = ` i - ` win +1
q u i e t l y regress ` v a r l i s t L ( 1 / ` a r ) . ` v a r l i s t
/ / / in `f i r s t / `i
q u i e t l y re p l a c e ` v o l = e(rmse) i n ` i / ` i
}
exit
end
Introduction to Stata
AR(` a r
August 2011
142
Writing an ado-file
This program defines the v o l f c command, which will appear like any
other Stata command on your machine. It may be executed as
use h t t p : / / f m w w w. b c . e d u / e c - p / d a t a / m a c r o / b d h , c l e a r
volfc
pcrude,
vol(vv)
volfc
pcrude,
vol(vv24) win(24)
volfc
pcrude,
vol(vv126) ar(6)
volfc
pcrude,
vol(vv248) win(24) ar(8)
tin(1983q1,), t i ( Vol a t i l i t y
Introduction to Stata
forecasts f o r
August 2011
143
Writing an ado-file
1983q3
1987q3
1991q3
date
1995q3
vv
vv24
vv126
vv248
Introduction to Stata
1999q3
August 2011
144
Writing an ado-file
Introduction to Stata
August 2011
145
local
local
local
local
local
vol
win
ar
first
last
`vol
`win
`ar
`start
`last
With these statements added to the end of the routine, the local
macros are defined, and their values stored.
Introduction to Stata
August 2011
146
Introduction to Stata
August 2011
147
Introduction to Stata
August 2011
148
Local macros are exactly that: objects with local scope, defined within
the program in which they are used, disappearing when that program
terminates. This is generally the desired outcome, preventing a clutter
of objects from being retained when a program calls numerous others
in the course of execution. At times, though, it is necessary to have
objects that can be passed from one subprogram to another. The
r e t u r n logic above would not really serve, since although it passes
local macros from a program to its caller, they would then have to be
passed as arguments to a second program.
To deal with the need for persistent objects, Stata contains global
macros. These objects, once defined, live for the duration of your Stata
session, and may be read or written within any Stata program. They
are defined with the g l o b a l command, rather than l o c a l , and
referred to as $macroname. Global macros should only be used
where they are required.
Introduction to Stata
August 2011
149
Introduction to Stata
August 2011
150
This program, pangrodev, performs this task for each unit of a panel,
automatically identifying the observations belonging to each unit,
taking the logarithm of the specified variable, running the appropriate
regression and prediction commands, and assembling the results in
the specified new variable.
The program makes use of Statas tempname and tempvar
commands to create non-scalar objects (in this case the matrix VV and
variables l v a r and p v a r which, like local macros, will exist only for
the duration of the ado-file). These temporary facilities, like the
associated t e m p f i l e which allows temporary files to be specified,
help reduce clutter and guarantee that objects names will not conflict
with other items in the users namespace.
Introduction to Stata
August 2011
151
CFBaum 21Jan2006
pangrodev
e n e r a t e d e v i a t i o n s f r o m c o n s t a n t g r o wt h
panel
*1 . 1 g
.0
* 1 i.n1 . 0 : promote t o v 9 , use
l e v e l s od fe f i n e p a n g r o d e v,
program
r c l a s s version 10.0
s y n t a x varname, G e n ( s t r i n g )
growth"
l[ ox b
c a] l
" d e v i a t i o n s from
togens i f
constant " " {
" ` xb "
!= "p r e d i c t e d growth"
local
togens
r(panelvar)
}
lqouci a l t ti m
s seevta r =
rl o( tci m
a l e v iavra) rtempname VV
tem
= pvar l v a r p v a r
q u i gen d o u b l e ` l v a r =
g (e`t v a
l irsl it s t o f)
*l o g
q uui n li et sv e l s o f ` i v a r ,
l o c a l ( v a l s ) l o c a l n v a l s : word
c o u n t ` v a l s q u i gen d o u b l e
l` ogen
c a l = xc
. 0
tbar
local 0
local rsqr
(continues...)
0
*!
Introduction to Stata
August 2011
152
foreach v o f l o c a l va l s {
summ ` l v a r i f ` i v a r ==` v
,m eanonly i f r ( N ) > 2 {
qu i regress ` l v a r ` timevar i f
` i v a r ==` v
ca pt drop ` pvar
q u i rper pe ldaiccet `dgen
o u b l e= `epxvpa(`r p v a
i fr ) i f
e ( is fa m"p`l exb
) , x"b = = " " e ( s a m p l e )
= ` v a r l i s t - ` gen i f
{ qui replace
e(sample)
` gen
l} o c a l xc = ` xc +
l o c a l t b a r 1 ` t b a r + e(N)
= local rsqr ` rsqr + e(r2)
=
}
t b a r = i n t ( 1 0 0 ` t b a r / ` xc ) / 1 0 0 . 0
}
*
r
s
q
r
local
i n t ( 1: 0`togens
0 0 * ` r s qfor
r `xc
/ of
` xc`nvals
di in gr _n ="`gen
di
rsq-bar = `rsqr"
) /=1 0`tbar
00.0
l o cina l gr "tbar
exit
units"
end
Introduction to Stata
August 2011
153
This program defines the pangrodev command, which will appear like
any other Stata command on your machine. It may be executed as
. use
(hW
o :r /l /df m w w w.b
ttp
c . e d u / e cf-op r/ d aSt e
a c/ m
Database
t oarcarlo /Icnavpe7s9t 7
mwean t , 19 48 Bank
1 99 2) TotSECap, g ( t o t c a p d e v )
.
tpangrodev
o t c a p d e v : d e v i a t i o n s f r o m c o n s t a n t g r o wt h
for
tbar =
57 o f 63 u n i t s
25.94
rs q-bar = .673
. pangrodev TotSECap, g ( t o t c a p h a t )
xb ( o u t p u t o m i t t e d )
Introduction to Stata
August 2011
154
)
accum" l a b e l v a r ccode " S o u t h American c o u n t r y "
t s l i n e t o t c a p d e v i f ye a r > 1 9 6 9 , b y ( c c o d e )
///
Introduction to Stata
August 2011
155
CHL
COL
PER
URY
VEN
0
-100
-200
100
0
-100
-200
100
ARG
1 97 0 1 9 75 1 98 0 1 98 5
1 99 0
Graphs by South American country
1 97 0 1 9 75 19 8 0 1 9 85
19 9 0
19 7 0 1 97 5 1 9 80 1 98 5
19 9 0
year
Introduction to Stata
August 2011
156
Concluding remarks
Concluding remarks
Whether or not you use Statas programming facilities to write your
own ado-files, a reading knowledge of the programming language is
very useful in case you want to adapt an existing Stata command
(official or user-contributed) in a do-file you are writing.
Since the code for all Stata commands that are implemented as
ado-files (as the command w h i c h . . . will show) are available on your
hard disk, Stata itself is a fertile source of programming techniques
that may be adapted to solve any programming problem.
For a thorough treatment of the subject, see my book An Introduction
to Stata Programming (2009) in ONeill Library.
Introduction to Stata
August 2011
157