Documente Academic
Documente Profesional
Documente Cultură
R. Gutierrez (StataCorp)
1 / 33
1. About Stata 2. Using svyset 3. Data analysis 4. Bootstrapping with replicate weights 5. Concluding remarks
R. Gutierrez (StataCorp)
2 / 33
Founded in 1982 in Santa Monica, CA, under the name CRC Bill Gould and Finis Welch, UCLA Sold time on a mainframe Stata 1.0 released January 1985 Gave up mainframe business in 1986
Relocated to College Station, TX in 1993 Changed name to Stata Corporation at that time Later created Stata Press, a division of StataCorp
R. Gutierrez (StataCorp)
3 / 33
Pronounce it anyway you like. We say it should rhyme with data Stata is a name, however, and not an acronym (STATA) Available on virtually all platforms Windows Macintosh Multiple avors of Unix
Linux IBM AIX Sun Solaris SGI Irix DEC Alpha
R. Gutierrez (StataCorp)
4 / 33
Current version is Stata 10.0 Major versions (e.g., Stata 8, Stata 9, Stata 10) are sold Minor versions (e.g., Stata 9.1) are free updates Other additions/xes are also free updates Updates are done over the web
R. Gutierrez (StataCorp)
5 / 33
As of Stata 10, over 8,000 pages Organized in thirteen volumes [GS] [U] [D] [P] [R] [ST] [TS] [XT] [SVY] [MV] [G] Getting Started Users Guide Data Management Programming Base Reference (3 volumes) Survival Analysis/Epidemiological Tables Time Series Panel/Longitudinal Data Survey Data Multivariate Statistics Graphics
R. Gutierrez (StataCorp)
6 / 33
R. Gutierrez (StataCorp)
7 / 33
Regression with Survey Data About Stata Graphical User Interface (GUI)
You can point and click all the way Main menus are Data, Graphics, and Statistics Filling out a dialog box generates the needed command As such, it is a great way to learn Stata You can choose to work interactively by typing commands and/or using the menus You can also work through editable scripts of commands, known as do-les (reproducibility)
R. Gutierrez (StataCorp)
8 / 33
Regression with Survey Data About Stata Graphical User Interface (GUI)
R. Gutierrez (StataCorp)
9 / 33
Regression with Survey Data About Stata Graphical User Interface (GUI)
R. Gutierrez (StataCorp)
10 / 33
50
R. Gutierrez (StataCorp)
11 / 33
Stata 10.0 is fully survey-capable In Stata, there is a clear separation between setting the design and performing the actual analysis You declare the design characteristics using svyset This declaration is a one-time event. You save the survey settings along with the data You perform the analysis just as you would with i.i.d. data you just have to add the svy: prex As such, survey in Stata is as easy as learning to use svyset
R. Gutierrez (StataCorp) October 16, 2007 12 / 33
Example Consider data on American high school seniors, collected following a multistage design Sex, race, height, and weight were recorded In the rst stage of sampling, counties were independently selected from each state In the second stage, schools were selected within each chosen county Within each school, every attending senior took the survey The data are at http://www.stata-press.com, easily accessible from within Stata
R. Gutierrez (StataCorp)
13 / 33
. use http://www.stata-press.com/data/r10/multistage . describe Contains data from http://www.stata-press.com/data/r10/multistage.dta obs: 4,071 vars: 11 29 Mar 2007 00:53 size: 122,130 (98.8% of memory free) storage type byte byte float float double byte byte byte int byte int state display format %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g school value label sex race
variable name sex race height weight sampwgt state county school id ncounties nschools Sorted by:
variable label 1=male, 2=female 1=white, 2=black, 3=other height (in.) weight (lbs.) sampling weight State ID (strata) County ID (PSU) School ID (SSU) Person ID Stage 1 FPC Stage 2 FPC
county
R. Gutierrez (StataCorp)
14 / 33
. svyset county [pw=sampwgt], strata(state) fpc(ncounties) || school, fpc(nschools) pweight: sampwgt VCE: linearized Single unit: missing Strata 1: state SU 1: county FPC 1: ncounties Strata 2: <one> SU 2: school FPC 2: nschools . save highschool file highschool.dta saved
Since we save the data with the survey settings as highschool.dta, we dont ever have to specify the design again it is part of the dataset.
R. Gutierrez (StataCorp) October 16, 2007 15 / 33
Other features of svyset include: You can have more than two stages, each separated by || The default variance estimation is set to Taylor linearization, but you could also choose the jackknife, or balanced and repeated replication (BRR) You can also set replicate weight variables to mimic jackknife and BRR estimation, if you are privacy conscious You can also tell Stata how you would like to treat strata with singleton PSUs You can treat them either as an error condition (missing), or as certainty units that can be centered and/or scaled
R. Gutierrez (StataCorp) October 16, 2007 16 / 33
. svydescribe weight Survey: Describing stage 1 sampling units pweight: sampwgt VCE: linearized Single unit: missing Strata 1: state (output omitted ) #Obs with #Obs with #Units #Units complete missing Stratum included omitted data data 1 2 2 2 3 2 4 2 (output omitted ) 47 2 48 2 49 2 50 2 50 100 0 0 0 0 0 0 0 0 0 92 112 43 37 67 56 78 64 4071 4071 0 0 0 0 0 0 0 0 0
#Obs per included Unit min 34 51 18 14 28 23 39 31 14 mean 46.0 56.0 21.5 18.5 33.5 28.0 39.0 32.0 40.7 max 58 61 25 23 39 33 39 33 81
R. Gutierrez (StataCorp)
17 / 33
To get some means and condence intervals treating the data as a simple random sample, you would type
. mean height weight, over(sex) Mean estimation male: sex = male female: sex = female Over height male female weight male female 163.0539 138.0472 .7094428 .7112746 161.663 136.6527 164.4448 139.4416 69.22091 65.48295 .0737168 .0615088 69.07639 65.36236 69.36544 65.60354 Mean Number of obs = 4071
Std. Err.
R. Gutierrez (StataCorp)
18 / 33
= 4071 = 8.0e+06 = 50
Mean
69.64261 65.79278
.1187832 .0709494
69.40403 65.65027
69.88119 65.93529
165.4809 136.204
1.116802 .9004157
163.2377 134.3955
167.7241 138.0125
R. Gutierrez (StataCorp)
19 / 33
= = = = = =
[95% Conf. Interval] -28.5869 .0977517 11.61581 351.7408 -9.729724 .2388083 18.15656 982.0467
R. Gutierrez (StataCorp)
20 / 33
This also works for nonlinear models, such as linear regression Lets use the NHANES2 data
. use http://www.stata-press.com/data/r10/nhanes2d, clear . svyset pweight: finalwgt VCE: linearized Single unit: missing Strata 1: strata SU 1: psu FPC 1: <zero>
Typing svyset without arguments will replay the survey settings for you
R. Gutierrez (StataCorp)
21 / 33
We can use these data to t a logit model for high blood pressure, and get survey-adjusted odds ratios and standard errors
. svy: logistic highbp height weight age female (running logistic on estimation sample) Survey: Logistic regression Number of strata = 31 Number of obs Number of PSUs = 62 Population size Design df F( 4, 28) Prob > F Linearized Std. Err. .0056821 .0032829 .0024816 .0641185
[95% Conf. Interval] .9573369 1.045814 1.045424 .6053533 .9805151 1.059205 1.055547 .8683151
R. Gutierrez (StataCorp)
22 / 33
This is not the same as throwing away the data on males, and Stata knows this
R. Gutierrez (StataCorp) October 16, 2007 23 / 33
R. Gutierrez (StataCorp)
24 / 33
When performing simultaneous tests, denominator degrees of freedom need to be adjusted for strata and PSUs
. test height weight Adjusted Wald test ( 1) height = 0 ( 2) weight = 0 F( 2, 30) = 58.21 Prob > F = 0.0000 . test height weight, nosvyadjust Unadjusted Wald test ( 1) height = 0 ( 2) weight = 0 F( 2, 31) = 60.15 Prob > F = 0.0000
Other postestimation routines, such as linear combinations of estimates, and nonlinear tests and combinations can also be applied after survey estimation
R. Gutierrez (StataCorp)
25 / 33
After tting the model, you can obtain design eects due to survey by using estat
. estat effects Jackknife Std. Err. .0094699 .0042651 .0033482 1.561851
. estat effects, meff meft Jackknife Std. Err. .0094699 .0042651 .0033482 1.561851
R. Gutierrez (StataCorp)
26 / 33
Regression with Survey Data Data analysis Regression for survival data
Semiparametric Cox and fully-parametric (e.g., Weibull) regression models can be t with survey data Declaring survival data to Stata works similarly to declaring survey data In the case of survival data, you declare time variable(s), censoring indicators, sampling weights, etc. These declarations layer over the survey declarations, and Stata makes sure there are no conicts Of course, survival settings can also be saved with the data
R. Gutierrez (StataCorp)
27 / 33
. use http://www.stata-press.com/data/r10/nhefs, clear . svyset psu2 [pw=swgt2], strata(strata2) pweight: swgt2 VCE: linearized Single unit: missing Strata 1: strata2 SU 1: psu2 FPC 1: <zero> . stset age_lung_cancer [pw=swgt2], fail(lung_cancer) failure event: lung_cancer != 0 & lung_cancer < . obs. time interval: (0, age_lung_cancer] exit on or before: failure weight: [pweight=swgt2] 14407 5126 9281 83 599691 total obs. event time missing (age_lung_cancer>=.) obs. remaining, representing failures in single record/single failure data total analysis time at risk, at risk from t = earliest observed entry t = last observed exit t =
PROBABLE ERROR
0 0 97
R. Gutierrez (StataCorp)
28 / 33
. svy: stcox former_smoker smoker male urban1 rural (running stcox on estimation sample) Survey: Cox regression Number of strata = 35 Number of obs Number of PSUs = 105 Population size Design df F( 5, 66) Prob > F Linearized Std. Err. .6205102 2.593249 .3445315 .3285144 .5281859
[95% Conf. Interval] 1.788705 4.061457 .6658757 .3555123 .8125799 4.345923 15.17051 2.118142 1.816039 3.078702
R. Gutierrez (StataCorp)
29 / 33
Regression with Survey Data Bootstrapping with replicate weights User-written command
Another variance estimate is based on bootstrapping with replicate weights This is not part of ocial Stata, but easily installed from the web as a user-written program The author is Je Pitblado (jpitblado@stata.com) of StataCorp, so in a way it is ocial It will eventually be part of ocial Stata.
R. Gutierrez (StataCorp)
30 / 33
Regression with Survey Data Bootstrapping with replicate weights Installing bs4rw
But the above assumes you know where to go. An alternative is to type
. findit survey bootstrap
and follow the links toward installing. findit is like Google for Stata
R. Gutierrez (StataCorp)
31 / 33
Regression with Survey Data Bootstrapping with replicate weights Running bs4rw
bs4rw is a prex command, analogous to svy:. It works with all the commands that work with svy:
. use http://www.stata-press.com/data/r10/autorw, clear (1978 Automobile Data) . bs4rw, rweights(boot*): regress mpg for weight (running regress on estimation sample) BS4Rweights replications (300) (output omitted ) Linear regression Number of obs Replications Wald chi2(2) Prob > chi2 R-squared Adj R-squared Root MSE Observed Coef. -1.650029 -.0065879 41.6797 Bootstrap Std. Err. 1.065621 .0005102 1.666637
= = = = = = =
Normal-based [95% Conf. Interval] -3.738608 -.0075879 38.41315 .4385502 -.0055879 44.94625
R. Gutierrez (StataCorp)
32 / 33
StataCorp is your friendly, neighborhood statistical software company Current version is Stata 10.0, and it is fully-functional for survey data The key is to master svyset, and we are happy to help out here Multistage designs work just ne, as does Cox regression and parametric survival models Bootstrapping based on replicate weights available as a user-written add-on
R. Gutierrez (StataCorp) October 16, 2007 33 / 33