Sunteți pe pagina 1din 12

Before we start…

•  Download the course files using Internet Explorer:


Go to www.ciser.cornell.edu
Click on the Workshop Downloads link An Introduction to Stata:
Enter the following information:

UserID: ciser
Part II
Password: download

•  Click on Download next to “Stata: Programming, Parts 1&2”


•  Click Run twice and Next twice Chayanee Chawanote
•  This will create a folder called: CISER Consultant
c:\ stataworkshop containing all of the course files
Fall 2012

2. Introduction to Stata: 2. Introduction to Stata:


Overview Overview (cont.)
•  Review of Part I •  Panel Data Capabilities
•  Do Files •  Correlation
•  Summary Statistics •  Regression Analysis
•  Sorting Data •  Graphs
•  Appending Datasets
•  Merging Datasets
•  “By Group” Processing
•  Collapsing Datasets

4. Introduction to Stata: 5. Introduction to Stata:


Review of Part I Review of Part I (cont.)
•  Interactive and batch modes •  Do and Log files
•  Increasing the memory: set memory •  Formats: numeric, string, dates
•  Clearing the memory: clear •  Checking data: list, describe, codebook
•  Using Stata dataset: use •  Transforming data: rename, generate,
•  Entering fixed format data (i.e. fixed columns): replace, drop, keep
infile or infix
•  Entering free format data : infile or insheet
•  Creating .dta files directly from other file formats:
Stat/Transfer

1
6. Do Files: 7. Do Files:
Creating Do Files Executing Do Files
•  Do-files allow commands to be saved and •  To run do-file in Stata, type:
executed in “batch” form. do dofilename
•  We will use the Stata do-file editor to write do- •  Can “comment out” lines by preceding with *
files. or by enclosing text within /* and */.
•  Can also use WordPad or Notepad: •  Can save the contents of the Review window as
a do-file by right-clicking on window and
–  Save as “Text Document” with extension selecting “Save Review Contents”.
“.do” (instead of “.txt”). Allows larger files than
do-file editor.
•  Note: a blank line must be included at the end of
a WordPad do-file (otherwise last line will not
•  To open do-file editor click Window à Do-File run).
Editor or click

8. Data Exploration: 9. Data Exploration:


Summary Statistics Summary Statistics (cont.)
•  tabstat produces a table of summary statistics: •  tabulate generates one or two-way tables of
tabstat varlist[, statistics(statlist)] frequencies (also useful for checking data):
•  Example: tabulate rowvar [colvar]
tabstat weight length, s(mean n sd
semean) •  Example:
•  summarize displays a variety of univariate summary tabulate foreign
statistics (number of non-missing observations, mean,
standard deviation, minimum, maximum): tabulate foreign rep78
tabulate foreign rep78, row col
summarize weight length, detail

10. Data Exploration: EXERCISE 1


Summary Statistics (cont.) 11. Generating Simple Statistics
•  table displays table of statistics:
•  Open the file US Economic Data.dta (Part I)
table rowvar [colvar]
•  Use summarize to determine which of
[, contents(clist varname)]
gdpdeflator and cpi has the higher mean and
•  clist can be freq, mean, sum etc. which has the higher standard deviation (Slide 8)
•  Example: •  Create a string variable yearstr containing the
values of year using the string function (Part I)
table foreign rep78, c(mean mpg sd
mpg) •  Create a one-digit variable decade by selecting
the third digit of yearstr:
•  Missing values are excluded from tables by
default. To include them as a group, use the gen decade=substr(yearstr,3,1)
missing option with tabulate

2
EXERCISE 1 (cont.) 13. Data Management:
12. Generating Simple Statistics Sorting Data
•  Use tabulate to calculate the number of years of •  sort puts the observations in dataset in a
high unemployment, highun, by decade, i.e. specific order:
1940s, 1950s (Slide 9)
•  Use table command to produce a table that shows sort varlist
the mean of realgdp and the sum of
•  Some procedures require file to be sorted before
persbankrupt by decade, rows, and highun,
they can be executed, e.g. merge.
columns (Slide 10)
•  Save the do-file using any name you wish (Slide 6) •  You can sort a file based on more than one
variable.
•  Example:
sort mpg weight

14. Data Management: 15. Data Management:


Sorting Data (cont.) Sorting Data (cont.)
•  sort-ascending order •  Note that Stata randomizes the order of
•  Example: observations within variables used to sort.
sort price •  Hence, useful to create a “counter” variable
•  gsort-ascending and descending order beforehand, so that the original order can be
recovered if necessary:
•  Example:
gsort price gen id=_n
gsort –price (Use sort id to “undo” a mistaken sort.)
gsort –make

16. Combining Datasets: 17. Combining Datasets:


Appending Appending (Example)
•  To add another Stata dataset below the end of data1 data3
the dataset in memory, type: id Name Age Source id Name Age Source
001 Alice 21 wave1 001 Alice 21 wave1
append using filename 002 Mary 20 wave1 002 Mary 20 wave1
003 David 23 wave1 003 David 23 wave1
•  Dataset in memory is called “master dataset”.
004 Jim 25 wave2
•  Dataset filename is called “using dataset”. 005 Linda 24 wave2
•  Variables (i.e. with same name) in both datasets data2 Example:
will be combined. id Name Age Source
use data1,clear
•  Variables in only one dataset will have missing 004 Jim 25 wave2
append using data2
005 Linda 24 wave2
values for observations from the other dataset.
save data3

3
18. Combining Datasts: 19. Combining Datasets:
Appending (Example) Merging
data1 data3 •  To join corresponding observations from a Stata dataset
id Name Age Source id Name Age Source with those in the dataset in memory, type:
001 Alice 21 wave1 001 Alice 21 wave1 merge [varlist] using filename [, options]
002 Mary 20 wave1 002 Mary 20 wave1
003 David 23 wave1 003 David 23 wave1 •  A match merge joins observations with common values
004 Jim . wave2 of varlist, which must be present in both datasets.
data2 005 Linda . wave2 •  The update option updates missing values of same-
named variables in master with values from the using
Example:
id Name Source dataset
use data1,clear
004 Jim wave2 •  The update option with replace replaces all values of
005 Linda wave2
append using data2 same-named variables in master with nonmissing values
save data3 from the using dataset

20. Combining Datasets: 21. Combining Datasets:


Merging (Example) Merging (Example)
data1 data3 data1 data3
id Name Age id Name Age inc96 inc97 id Name Age id Name Age inc96 Inc97 _merge
001 Alice 21 001 Alice 21 32000 37000 001 Alice 21 001 Alice 21 32000 37000 3
002 Mary 20 002 Mary 20 21000 24000 002 Mary 20 002 Mary 20 . . 1
003 David 23 003 David 23 22000 25000 003 . 22000 25000 2
Example: Example:
data2 use data2, clear data2 use data2, clear
sort id sort id
id Inc96 inc97 save data2, replace id Inc96 inc97 save data2, replace
003 22000 25000 use data1, clear 003 22000 25000 use data1, clear
001 32000 37000 001 32000 37000 sort id
sort id
merge 1:1 id using data2, update
002 21000 24000 merge id using data2
tab _merge
save data3,replace save data3,replace

22. Combining Datasets: 23. Combining Datasets:


Merging (cont.) Merging (cont.)
•  One-to-one merge on specified key variables •  For a match merge, data in both master and
merge 1:1 varlist using filename [, options] using datasets must first be sorted by varlist.
•  Many-to-one merge on specified key variables •  The variable _merge is automatically added to
merge m:1 varlist using filename [, options]
the dataset, containing:
•  One-to-many merge on specified key variables
merge 1:m varlist using filename [, options] _merge==1 Observation in master data only
•  One-to-one merge by observation _merge==2 Observation in using data only
_merge==3 Observation in master and using data
merge 1:1 _n using filename [, options]
_merge==4 Observation in both, missing values updated
•  Many-to-many merge on specified key variables _merge==5 Observation in both, conflicting nonmissing values
merge m:m … BE CAUTIOUS!

4
24. Combining Datasets: EXERCISE 2
Merging (cont.) 25. Combining Datasets: Merging
•  To form all pairwise combinations between two •  Open "State admission data.dta" - the
datasets, type: using dataset - and sort by state, then save
(Part I and Slide 13)
joinby [varlist] using filename •  Open "Stata course data 3.dta“ - the
master dataset - and sort by state (Part I and
•  Unlike merge, joinby can handle “many-to- Slide 13)
many” merges. •  Merge with "State admission data.dta"
using state as the match variable (Slide 19)
•  Type tabulate _merge to check the results of
the merge. Think about what the values of this
indicate. (Slide 23)

EXERCISE 2 (cont.) 27. Data Management:


26. Combining Datasets: Merging “By Group” Processing
•  Correct for the missing values of region by •  To execute a Stata command separately for
typing (Part I): each group of observations for which the values
of the variables in varlist are the same, type:
replace region=“South” if stateabb==“DC”
by varlist: command
•  Save the modified dataset as “Combined
•  Example:
state data.dta” (Part I)
by foreign: summarize rep78
•  Most commands allow the by prefix.
•  Requires that data be sorted by varlist (precede
command with sort varlist or use bysort).
•  Example:
bysort foreign: summarize rep78

28. Data Management: 29. Data Management:


Collapsing Datasets Collapsing Datasets (cont.)
•  To create a dataset of means, sums etc., type: •  Be sure to save data before attempting collapse as
there is no “undo” facility.
collapse (stat) varlist1 (stat) …
•  Otherwise, we have to use preserve - restore.
[, by(varlist2)]
•  Example:
•  stat can be mean, sd, sum, median or other
preserve
statistics.
•  by(varlist2) specifies the groups over which the collapse weight length (median)
price, by (foreign)
means etc. are to be calculated.
list
restore

5
EXERCISE 3 31. Panel Data Analysis:
30. Collapsing What is Panel Data?
•  Collapse the dataset "Combined state •  Panel data generally refer to the repeated
data.dta" by region to produce a dataset observation of a set of fixed entities at fixed
containing the means of inc and unemplrate intervals of time (also known as longitudinal
(Slide 27) data).
•  Which region has historically had the lowest •  Stata is particularly good at arranging and
unemployment rate? analyzing panel data.
•  Stata refers to two panel display formats:
•  Which region has historically had the highest
income level? –  Wide form: useful for display purposes and
often the form data obtained in.
•  Do not save the new dataset. Instead, use –  Long form: needed for regressions etc.
outsheet to export the new dataset into Excel
file named "collapse.xls"

32. Panel Data Analysis: 33. Panel Data Analysis:


Wide Form Long Form
Example of wide form:
Example of long form:
i j xij
i xij
id year sex inc
id sex inc1999 inc2000 inc2001 1 1999 0 5000
1 2000 0 5500
1 0 5000 5500 6000
1 2001 0 6000
2 1 2000 2200 3300 2 1999 1 2000
2 2000 1 2200
3 0 3000 2000 1000
2 2001 1 3300
3 1999 0 3000
•  Note the naming convention for inc. 3 2000 0 2000
3 2001 0 1000

34. Panel Data Analysis: 35. Panel Data Analysis:


Reshape Command Dummy Variables
•  To change from wide to long form, type: •  Dummy variables take the values 0 and 1 only.
These are very useful in panel data (and
reshape long varnames, elsewhere).
i(varlist) [j(varname)] •  Large sets of dummy variables can be created
•  Example: with:
reshape long inc, i(id) j(year) tab varname, gen(dummyname)
•  To change from long to wide form, type: •  When using large numbers of dummies in
reshape wide varnames, regressions, useful to name with pattern, e.g.
ind1, ind2… Then ind* can be used to refer
i(varlist) [j(varname)]
to all variables beginning with *.
•  Example:
reshape wide inc, i(id) j(year)

6
36. Panel Data Analysis: 37. Panel Data Analysis:
Dummy Variables (cont.) Lag Variables
group group g1 g2 g3 •  Assuming the data are in chronological order,
1 lags can be created with:
1 1 0 0

3
gen lagname = varname[_n-1]
3 0 0 1

2
•  Similarly _n+1 gives lead.
2 0 1 0
•  Care must be taken with panel data (in long
1 1 1 0 0 form) so that first observation in each state etc.
has a missing value. Use, for example:
2 2 0 1 0
use sp500, clear
tab group, gen(g) gen lag_vol=volume[_n-1]

EXERCISE 4 EXERCISE 4 (cont.)


38. Manipulating a Panel 39. Manipulating a Panel
•  Open the dataset “Combined state •  The data are now in a form that is easily
data.dta” (Part I) interpreted and could be copied into Excel etc.
•  Drop all variables other than state, year and for presentation.
unemplrate using the keep command - •  Return the data to long form using reshape
quicker than using drop (Part I) long (Slide 34)
•  Use the reshape wide option to rearrange the
data so that the first column represents the state
and the other columns contain unemplrate for
a particular year (Slide 34)

40. Basic Data Analysis: 41. Basic Data Analysis:


Correlation Correlation (cont.)
•  To obtain the correlation between a set of •  pwcorr displays all the pairwise correlation
variables, type: coefficients between the variables in varlist:
correlate [varlist] pwcorr [varlist][, sig]
[,covariance _coef]
•  sig option adds a line to each row of matrix
•  covariance option displays the covariances reporting the significance level of each
rather than the correlation coefficients.
correlation coefficient.
•  coef option displays the correlations (or
covariances) between the coefficients of the •  Difference between correlate and pwcorr is
most recently estimated model (varlist not that the former performs listwise deletion of
specified in this case). missing observations while the latter performs
pairwise deletion.

7
42. Regression Analysis: 43. Regression Analysis:
Regression Basics Regression Basics (cont.)
•  To perform a linear regression of depvar on •  Some more sophisticated estimators:
varlist, type: –  Logit
regress depvar [varlist] [if exp] [in logit depvar varlist
range] [, noconstant] –  Probit
•  depvar is the dependent variable. probit depvar varlist
•  varlist is the set of independent variables –  Panel regression
(regressors).
xtreg depvar varlist
•  By default Stata includes a constant. The
noconstant option excludes it.
•  Example:
regress mpg weight length

44. Regression Analysis: 45. Regression Analysis:


Prediction Testing
•  After all estimation commands (i.e. regress, •  Linear hypotheses can be tested (e.g. t-test or F-
logit) several predicted values can be test) after estimating a model by using test.
computed using predict.
•  test varlist tests that the coefficients
•  predict refers to the most recent model corresponding to every element in varlist jointly
estimated.
equal zero.
•  predict yhat, xb creates a new variable yhat
equal to the predicted values of the dependent •  test eqlist tests the restrictions in eqlist.
variable. •  The option accumulate allows a hypothesis to
•  predict res, residual creates a new be tested jointly with the previously tested
variable res equal to the residuals. hypotheses.

46. Regression Analysis: EXERCISE 5


Testing (Example) 47. Simple Regression Analysis
•  Example: •  Open the dataset “US Economic
Data.dta” (Part I)
regress mpg weight length
•  Create realgdpcap (real GDP per capita) by
test weight dividing realgdp by population and
test length, accum multiplying by 1 billion, since all GDP figures
have been in billions of dollars (Part I)
•  Compute the pairwise correlation between
realgdpcap and unemplrate. Is it significant
at the 10% level? (Slide 41)

8
EXERCISE 5 (cont.) EXERCISE 5 (cont.)
48. Simple Regression Analysis 49. Simple Regression Analysis
•  Run a linear regression explaining realgdpcap •  Regress inc on unemplrate and the set of
in terms of unemplrate, persbankrupt and dummy variables without a constant (Slide 42),
treasurybillrate. Which are significant using:
regressors (at the 5% level)? (Slide 42)
•  Save the dataset, clear the memory and open regress inc unemplrate st1-st50 ,nocons
the dataset “Combined state data.dta”.
•  Perform a test of the null hypothesis that the
(Part I)
coefficients of unemplrate is insignificant.
•  Create a set of 51 dummy variables for the
(Slide 45)
states, using: tab stateabb, gen(st).
(Slide 35)

50. Graphical Data Exploration: 51. Graphical Data Exploration:


Types of Graphs Histogram
•  To obtain a basic histogram of varname, type: histogram rep78, discrete freq
histogram varname, discrete freq normal
•  To draw a boxplot, type:
30

graph box varname


•  To draw a bar chart, type:
20
Frequency

graph bar [(stat)]varname


•  To display a scatterplot of two (or more)
10

variables, type:
scatter varlist
0

0 2 4 6

•  To draw draws line plots, type: Repair Record 1978

line varlist

52. Graphical Data Exploration: 53. Graphical Data Exploration:


Boxplot Bar Chart
graph box mpg, by (foreign) graph bar (mean) price, over(foreign)
Domestic Foreign
6,000
40

4,000
Mileage (mpg)

30

mean of price
20

2,000
10

Graphs by Car type


Domestic Foreign

9
54. Graphical Data Exploration: 55. Graphical Data Exploration:
Scatter Plot Linear Prediction Plot
scatter mpg weight twoway (scatter mpg weight) (lfit mpg
weight)
40

40
30
Mileage (mpg)

30
20

20
10

10
2,000 3,000 4,000 5,000 2,000 3,000 4,000 5,000
Weight (lbs.) Weight (lbs.)

Mileage (mpg) Fitted values

56. Graphical Data Exploration: 57. Graphical Data Exploration:


Line Plot Graph Optins
sysuse sp500,clear •  There are options for (among other things):
line high date –  Adding a title: title
1400

–  Altering the scale of the axes: xscale,


yscale
1300

–  Specifying what axis labels to use: xlabel,


High price

1200

ylabel
–  Changing the markers used: msymbol
1100

–  Changing the connecting lines: connect


1000

01jan2001 01apr2001 01jul2001 01oct2001 01jan2002


Date

58. Graphical Data Exploration: 59. Graphical Data Exploration:


Graph Options (cont.) Graph Options (cont.)
•  Example: •  Particularly useful is mlabel(varname) which
uses the values of varname as markers in the
scatterplot.
use sp500,clear
•  Example:
scatter volume date, title("S&P
500") yscale(log) xlabel(15035 scatter mpg weight, mlabel(make)
15249) ylabel(10000 20000)
msymbol(oh) connect(l)

10
60. Graphical Data Exploration: EXERCISE 6 (cont.)
Saving Graphs 61. Graphs
•  Graphs are not saved by log files (separate •  Open the dataset “Combined state
windows). data.dta”. (Part I)
•  Select File à Save Graph. •  Create a dataset of time-averaged data using
the collapse command. Specifically, create a
•  To insert in a Word document etc., select Edit à dataset, by stateabb, containing the means of
Copy and then paste into Word document. This inc and unemplrate (Slide 28)
can be resized but is not interactive (unlike Excel •  Create a scatterplot of inc against
charts etc.). unemplrate using stateabb as the marker.
(Slides 54 & 57)

62. Good-to-Know Stata Commands 63. The End


clear use save describe log

label drop keep rename replace


Thank you!
generate summarize list tabulate table

append merge collapse sort correlate

regress test predict scatter line

64. Solutions 65. Solutions (cont.)


do “c:\stataworkshop\Part 1 commands.do”

Exercise 1 Exercise 3
cd c:\stataworkshop
use "Combined state data",clear
use “US economic data”, clear
gen yearstr = string(year) collapse (mean) inc unemplrate, by(region)
gen decade = substr(yearstr,3,1) outsheet using "collapse.xls"
* gen decade = substr(string(year),3,1)
tab decade highun
table decade highun, contents(mean realgdp sum persbankrupt) Exercise 4
Exercise 2 use "Combined state data",clear
use "State admission data", clear keep state year unemplrate
sort state reshape wide unemplrate, i(state) j(year)
save "State admission data", replace
use "Stata course data 3", clear reshape long unemplrate, i(state) j(year)
sort state
merge state using "State admission data"
tab _merge
replace region="South" if stateabb=="DC"
save "Combined state data"

11
66. Solutions (cont.)
Exercise 5
use "US economic data",clear
gen realgdpcap=1000000000*realgdp/population
pwcorr realgdpcap unemplrate, sig

regress realgdpcap unemplrate persbankrupt treasurybillrate


save "US economic data", replace
use "Combined state data", clear
tab stateabb, gen(st)
regress inc unemplrate st1-st50, nocons
test unemplrate

Exercise 6
use "Combined state data", clear
collapse (mean) inc unemplrate, by(stateabb)
scatter inc unemplrate, mlabel(stateabb)

12

S-ar putea să vă placă și