Sunteți pe pagina 1din 78

A Gentle Introduction to Stata

A Gentle Introduction to Stata

Alan Acock
Oregon State University

A Stata Press Publication


STATA CORPORATION
College Station, Texas

Stata Press, 4905 Lakeway Drive, College Station, Texas 77845

c 2005 by StataCorp LP
Copyright
All rights reserved
Typeset in LATEX 2
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
ISBN !!
This book is protected by copyright. All rights are reserved. No part of this book may be reproduced, stored in a retrieval system, or transcribed, in any form or by any meanselectronic,
mechanical, photocopying, recording, or otherwisewithout the prior written permission of
StataCorp LP.
Stata is a registered trademark of StataCorp LP. LATEX 2 is a trademark of the American
Mathematical Society.

Acknowledgements
I would like to acknowledge the support of the Stata staff who have worked with
me on this project. Special thanks goes to Lisa Gilmore, the Production Manager,
xxxxx, the Copy Editor, and xxx for verifying all the commands used in this volume. I
also want to thank my students who have tested my ideas for the book. They are too
numerous to mention, but special thanks goes to Patricia Meierdiercks and Shannon
Wanless.
Bennet Fauber, during the time he was affiliated with Stata Corporation, provided
hours and hours of support on all aspects of this project. He taught me the LATEX 2
document preparation system used by Stata Press and his patience with many of my
problems and mistakes has inspired me to have more patience with my own students.
Bennet also had a major input on the topical coverage and organization of this volume.
He provided the initial draft of chapter 4 and his superior expertise on Stata commands,
data management, and do-files was critical. Bennet also provided extensive editorial
suggestions and substantive editing for the first three chapters of the book. Whatever
quality this book has owes an enormous debt to Bennets conceptual and technical
contributions. The books completion owes a lot to his encouragement.
Finally, I would like to thank my wife, Toni Acock, for her support and for her
tolerance for my endless excuses for why I could not do things. She had to pick up
many tasks I should have done and she usually smiled when told it was because I had
to finish this book.

Contents
Preface

xiii

Notation and Typography


1 Getting Started

xv
1

1.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

The Stata Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

Using an existing dataset . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4

An example of a short Stata session . . . . . . . . . . . . . . . . . . . .

1.5

Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

1.6

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

1.7

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2 Entering Data

19

2.1

Creating a dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.2

An example questionnaire . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.3

Develop a coding system . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.4

Entering data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.4.1

Labeling values . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.5

Saving your dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.6

Checking the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

2.7

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.8

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3 Preparing Data for Analysis


3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37
37

viii

Contents
3.2

Plan your work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.3

Create value labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.4

Reverse code variables . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

3.5

Create and modify variables . . . . . . . . . . . . . . . . . . . . . . . .

46

3.6

Create scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.7

Save some of your data . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

3.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

3.9

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

Author index

59

Subject index

61

List of Tables
2.1

Example questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.2

Example codebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.3

Example coding sheet . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.1

Sample project task list . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.2

NLSY97 sample codebook entries . . . . . . . . . . . . . . . . . . . . .

40

3.3

Reverse coding plan . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

3.4

Arithmetic symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

List of Figures
1.1

Statas opening screen . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

The . Prefs . Save Windowing Preferences menu . . . . . . . . . . . . .

1.3

The . Prefs . Stata compact setting appearance menu . . . . . . . . . .

1.4

The Stata screen layout used in this book . . . . . . . . . . . . . . . .

1.5

Statas tool bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.6

Stata command to open cancer dataset . . . . . . . . . . . . . . . . . .

1.7

Histogram of age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.8

Histogram dialog box . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

1.9

The tabs on the histogram dialog box . . . . . . . . . . . . . . . . . .

12

1.10 The Title tab of the histogram dialog box . . . . . . . . . . . . . . . .

13

1.11 The Options tab of the histogram dialog box . . . . . . . . . . . . . .

13

1.12 First attempt at an improved histogram . . . . . . . . . . . . . . . . .

14

1.13 Final histogram of age . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.1

Data editor window . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.2

Variable name and variable label . . . . . . . . . . . . . . . . . . . . .

27

2.3

Define schemes dialog box . . . . . . . . . . . . . . . . . . . . . . . . .

29

2.4

Define schemes dialog box . . . . . . . . . . . . . . . . . . . . . . . . .

29

2.5

Describe dataset dialog box . . . . . . . . . . . . . . . . . . . . . . . .

34

3.1

Create new variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.2

Recode: specifying recode rules on the main tab . . . . . . . . . . . . .

44

3.3

Recode: specifying new variable name on the options tab

. . . . . . .

44

3.4

Create new variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.5

Two-way tabulation dialog . . . . . . . . . . . . . . . . . . . . . . . . .

50

xii

List of Figures
3.6

The extended generate dialog . . . . . . . . . . . . . . . . . . . . . . .

52

3.7

The extended generate dialog . . . . . . . . . . . . . . . . . . . . . . .

53

3.8

Selecting variables to drop . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.9

Selecting observations with an expression . . . . . . . . . . . . . . . .

55

Preface
This book was written with a particular reader in mind. This reader needs to learn
Stata, but has no prior experience with other statistical software packages and is learning social statistics. When I learned Stata myself I found no books that I felt were
written explicitly for this reader. There are certainly excellent books on Stata, but they
assumed prior experience with other packages such as SAS or SPSS, they assumed a
fairly advanced working knowledge of statistics, or both of these. These books are able
to move more quickly to more advanced topics, but they left my intended reader in the
dust. Readers who have more background in statistical software and statistics than I
am assuming here will be able to read chapters quickly and even skip sections. The goal
is to move the true beginner to a level of competence using Stata.
With our target reader in mind, I make far more use of the Stata menu system than
any other books about Stata. Advanced users may not see the value to using the menus
and the more people learn about Stata the less they will rely on the menus. Also, even
when using the menu system it is still important to save a record of the sequence of
commands you ran. Even though I rely on the commands much more than the menus
in my own work, I still find value in the menus. They include many options that I might
not have known or might have forgotten. This is most evident with graphs where the
visual quality of graphs can be greatly enhanced using the menu system.
To illustrate the menu system as well as graphics, I have included over 80 figures,
many of which show menus. There are numerous tables and extensive Stata results
that are presented as they appear on the screen and are given a substantive interpretation. This is done in the belief that beginning Stata users need to learn more than just
how to produce the results. It is also necessary to go through the results and interpret
them.
I have tried to use real data. There are a few examples where it is just too much
easier to illustrate a point with hypothetical data, but for the most part, I use data
that is in the public domain. The General Social Survey for 2002 is used in many
chapters as is the National Survey of Youth, 1997. Ive simplified the files by dropping
many of the variables in the original datasets, but Ive kept all of the observations. I
have tried to pick examples from a variety of social science fields and I have included
additional variables so that instructors as well as readers can make additional examples
and exercises that are tailored to their discipline. People who are use to working with
statistics books that have contrived data with just a few observations, presumably so
work can be done by hand, may be surprised to see over 1,000 observations in our
datasets. Working with these files provides better experience for other real world data

xiv

Preface

analysis.
The exercises use the same datasets that are used in the rest of the book. A number
of the exercises require some data management prior to estimating a model. This is
done in the belief that learning data management requires a lot of practice and cannot
be isolated in a single chapter or single set of exercises.
This book takes the student through much of what is done in introductory and
intermediate statistics courses. We cover descriptive statistics, charts, graphs, tests of
significance for simple tables, tests for one and two variables, correlation and regression,
analysis of variance, multiple regression, and logistic regression. By combining this with
an introduction to creating and managing a dataset, students are well prepared to go
even further. More advanced statistical analysis using Stata is often even simpler from a
programming point of view than what we cover. If an intermediate course goes beyond
what we do with logistic regression to multinomial logistic regression, for example,
the programming is simple enough. The command logit is simply replaced with the
command mlogit. The added complexity of these advanced statistics is the statistics
themselves and not the Stata commands that implement them. Therefore, although
more advanced statistics are not included in this book, the reader who learns these
statistics will be more than able to learn the corresponding Stata commands from the
Stata documentation and help system.
I assume the reader is running Stata on Windows based PC. Stata works as well
or better on Macs and Unix systems. Readers who are running Stata on one of those
systems will have to make minor adjustments.
Alan C. Acock Corvallis, Oregon

Notation and Typography


We designed this book for you to learn by doing, so we expect you to read this book
while sitting at a computer so you can try using the sequences of commands contained
in the book to replicate our results. In this way, you will be able to generalize these
sequences to suit your own needs.
Generally, we use the typewriter font command to refer to Stata commands, syntax,
and variables. We show commands as they will appear in Stata output. That includes
the . prompt, which is not part of the command and should not be typed.
Except for some very small expository datasets, all the data we use in this book are
freely available for you to download. For example,
. use http://www.stata-press.com/data/!!!/gss 2002.dta, clear

Try it.
This text complements the material in the Stata manuals but does not replace it.
For example, how to generate tables and graphs are shown in Chapters 5 and 6, but
these are only a few of the possibilities described in the Stata Reference Manual. Our
hope is to give you sufficient background that you can use the manuals effectively.

Getting Started
1.1
1.2
1.3
1.4
1.5
1.6
1.7

1.1

Introduction
The Stata Screen
Using an existing dataset
An example of a short Stata session
Conventions
Chapter Summary
Exercises

Introduction

This book was written with the belief that the best way to learn data analysis is to
actually do it with real data. These days, doing statistics means doing statistics with a
computer.
Work along with the book
Although it isnt absolutely necessary, youll probably find it very helpful to have
Stata running while you read this book so you can follow along and experiment
for yourself when you have a question about something. Having your hands on a
keyboard and replicating the instructions in this book will make the lessons that
much more effective, but more importantly, youll get in the habit of just trying
something new when you think of it and seeing what happens. In the end, that is
how you will really learn how Stata works. The other great advantage to following
along is that you can save the examples we do for future use.

Stata is a powerful tool for analyzing data. Stata can make statistics and data
analysis fun because it does so much of the tedious work for you. A new user of Stata
should start by using the menus. As you learn more about Stata you will be able to
do more sophisticated analyses with Stata commands. Commands can be saved in files,
that Stata calls do-files so a series of commands can be run all at once, which can save a
great deal of time. Learning Stata well is an investment that will pay off in saved time
1

Chapter 1. Getting Started

later. Stata is constantly being extended with new capabilities, which can be installed
by Stata itself from the internet. Stata is a program that grows with you.
Stata is a command driven program. It has a remarkably simple command structure
that you use to tell it want you want it to do. You can use a menu to generate the
commands (this is a great way to learn the commands or prompt yourself if you dont
remember one exactly), or you can enter them directly. If we enter the summarize
command, well get a summary of all the variables in our dataset (mean, standard
deviation, number of observations, minimum value, and maximum value). Enter the
command tabulate gender and Stata will make a frequency distribution of the variable
called gender, showing you the number and percentage of men and women in your
dataset.
After you have used Stata for awhile, you may want to skip the menu and enter these
commands directly. When you are just beginning, however, it is easy to be overwhelmed
by all of the commands available in Stata. If you were learning a foreign language you
would have no choice but to memorize hundreds of common words right away. This is
not necessary when you are learning Stata because the menus are so easy to use. You
will learn how to use Statas menus to do much of your work. Even experienced users
can forget commands and they find the menus useful, especially for the more complex
procedures, such as complex graphs.
Stata has done a lot to make the menus as friendly as possible so you feel confident
using them. The menus often show many options, which control the results that are
shown and how they are displayed. Even experienced Stata programmers will discover
possibilities in the menus that they did not know existed. The menus have default
values that are often appropriate, so you may be able to do a great deal of work without
specifying any options.
As we progress, you will be doing more and more complex analyses. It is possible
to do this with the menus, however Stata lets you create files that contain a series of
commands that can be run all at once. These files, called do-files, are essential once you
have a large number of commands to run. You can reopen the do-file a week or even
months later and repeat exactly what you did. Keeping a record of what you did this
way is an obligation of scholars. The ability to replicate results of elaborate analyses
is impossible if you fail in this obligation. Fortunately, Stata makes this easy. You will
learn more about this in Chapter 4. The do-files that reproduce most of the tables,
graphs, and statistics for each chapter are available on the web page for this book.
We will cover many of the graphical and statistical ways of analyzing data that are
included in standard statistics textbooks. Because Stata is so powerful and easy to use,
we may include some analyses that are not covered in your textbook. If you come to
a procedure that you have not already learned in your statistics text, give it a try. If
it seems too daunting, you can skip that section and move on. On the other hand, if
your statistics textbook covers a procedure that we omit, you might search the menus
yourself. Chances are you will find it there.
Depending on your needs you might want to skip around in the book. We tend to

1.2 The Stata Screen

learn best when we need to know something, so skipping around to the things you know
will be useful immediately may be the best use of the book and your time. Some things,
though, require prior knowledge of other things, so if you are new to Stata, you may
find it best to work through the first four chapters carefully in and in order. After that,
you will be able to skip around more freely as your needs or interests take you.

1.2

The Stata Screen

When you open Stata you will see a screen that looks something like Figure 1.1.

Figure 1.1: Statas opening screen

You can rearrange the windows to look the way you want. You rearrange the windows
just as you would in other programs. A lot of experienced Stata users have particular
ways of arranging these screens and you should feel free to experiment with the layout.
There is a web page at UCLA that has movies showing you how to manipulate the
screen to your suiting (http://www.ats.ucla.edu/stat/stata/seminars/stata9/gui.htm) .
If you have Stata installed on your own computer, use the . Prefs menu to work
with your windowing preferences, as shown in Figure 1.2.
You can click on the . Prefs menu, . Manage Preferences, . Load Preferences, . Compact Windows settings option. This produces a screen like the one in Figure 1.3. Click

Chapter 1. Getting Started

Figure 1.2: The . Prefs . Save Windowing Preferences menu

on the tab that says Review (right side of Figure 1.3). You will see a small stick pin
with the point going to the left. Click on this pin one time and the pin will turn upright.
Next, click on the tab that says Variables and click on the stick pend for that window.
Now you should have two panels on the right side, with Review on the top and Variables
below it. The final thing we want to do is to get the Command window to go all the way
across the bottom. If you put your cursor in the blue bar at the top of the command
window, click and drag it you will see a diamond appear near the bottom of the screen
(ignore the diamonds on the other edges and in the middle of the screen). Continue to
drag your cursor until your cursor is over the diamond on the bottom of the screen and
let go. If you do this right, the screen will look like the one in Figure 1.4. This is the
screen layout we use in this book.
If you want to save this screen configuration click on . Prefs menu, . Manage Preferences, . Save Preferences, . New Preference Set. This opens a dialog window in which
you can enter a name. This will save the configuration under this name. In the future,
anytime you want this configuration you can load it. This flexibility is an extended
feature in Stata 9. You eventually will probably enjoy this flexibility, but it is a bit
confusing the first time you use this feature. You can check out the UCLA movie for an
extensive explanation of how to do this. If you cant get it to work exactly as we have
here, it is always easy to select the compact setting option.
If you are in a computer lab, the computers may be set up to erase all your changes
when the computer is restarted. If this is the case, there is nothing you can do to
preserve your preferences. Luckily, it is easy to rearrange the screen the way you like it
when you return to Stata.
Statas main window contains four other windows, each of which has a name that
we will use in the rest of the book. The window name appears in the title bar of each,
and they are Review, Variables, Results, and Command.
When you open a file that contains Stata data, which well call a Stata dataset, a
list of the variables will appear in the Variables window.
When Stata executes a command, it prints any output into the Results window.
The command itself will also be printed to the Results window, preceded by a . (dot)
prompt, and the command will be followed immediately by the output the command
generates. The commands you run are also kept in the Review window. If you click
on one of the commands listed in the Review window, it will appear in the Command

1.2 The Stata Screen

Figure 1.3: The . Prefs . Stata compact setting appearance menu

Chapter 1. Getting Started

Figure 1.4: The Stata screen layout used in this book

1.3 Using an existing dataset

window.
When not using the menu system, the Command window is where commands you
want to execute get entered. You can recall commands from the Review window or type
them in yourself. You can also edit commands that appear in the Command window.
You can also use the PgUp and PgDn keys to recall commands from the Review window.
Well illustrate all of these methods in the coming chapters.
The grey bar at the bottom is called the status bar, and it will display the current
working directory (folder). This will most likely be either C:\Stata8 or C:\Data. This
is the active folder where Stata will save data. It is possible to change this and to store
data wherever you wish.
Stata has the usual Windows title bar, on the right of which appear the three
buttons (in order from left to right) to minimize, to expand to full-screen, and to close
the program.
Immediately below the title bar is the menu bar, where the names of the menus
appear, as shown in Figure 1.2 with the . Prefs menu pulled down. Some of the menu
items will look familiar because they are used in other programs such as . File, . Window,
and . Edit. The . Data, . Graphics, and . Statistics are more likely to be new, but their
names provide a good idea of what you will find under them.
Below the title bar is the toolbar, shown in Figure 1.5, that contains icons (or tool

Figure 1.5: Statas tool bar

buttons) that provide alternate ways to do some of the things you would normally do
with the menus. The most useful of these are the four leftmost icons. From left to right,
they are icons to open a dataset, to save a dataset, to print the contents of the Results
window, and to start or stop a log of your session. A complete list of the toolbar icons
and their functions appears in the Getting Started with Stata for Windows manual.

1.3

Using an existing dataset

In Chapter 2 we will learn how to create our own dataset, save it, and use it again.
We will also learn how to use datasets that are on the internet. For now we will use a
simple dataset that came with Stata. Although we could use the menu to do this, for
now we will enter a simple command. Click once in the Command window to put the
cursor there, then type the command sysuse cancer, clear so the window looks like
the one in Figure 1.6.
Note that we used a typewriter font to indicate the text to type into the Command
window. Stata commands do not have any special characters at the end, so if you see
a command that seems to end with a punctuation mark, the punctuation is not part of

Chapter 1. Getting Started

Figure 1.6: Stata command to open cancer dataset

the command. Otherwise, anything that should be entered as a command will appear
in a typewriter font. There may be times when we put a command onto a line by itself.
When we do this we will have the dot preceding it to be consistent with Stata manuals,
as in
. sysuse cancer, clear

All of Statas menus generate commands, which will be printed in the Review window
and also in the Results window after the dot prompt, which we mentioned earlier. If
you make a point of looking at the command Stata prints each time you use the menus,
you will quickly learn the commands. We may also include the equivalent command in
our text after explaining how to navigate to it through the menus.
Stata commands should always be typed into the Command window using the Return
key. The command may wrap onto more than one line, but if you use the Return key
in the middle of a command, Stata will interpret that as the end of the command, and
an error will likely be generated. The rule is that you should just keep typing when
entering a command, no matter how long the command is. You only press the Return
when you want to execute the command.
In addition to using the typewriter font for commands, we will also use it for variable
names, for the names of datasets, and to show Statas output. In general, we will use
the typewriter font whenever the text is something that we can type into Stata or when
the text is something that Stata might print as output. This may seem cumbersome
now, but youll catch on quickly. That said, lets move on.
The sysuse command we just used will find the sample datasets on your computer
by name alone, without the extension, in this case the dataset name is cancer and the
file that is actually found is called cancer.dta. The cancer dataset was installed with
Stata. This particular dataset has 48 observations, and 4 variables related to a cancer
treatment.
What if you forget the command sysuse? You could open a file that comes with
Stata using the menus: . File, . Example datasets. This opens up a new window in
which you can click on Example datasets installed with Stata. This takes you to
a window that lists all the datasets that come with Stata. You can click on . use to
open the dataset.
Now that we have some data read into Stata, enter the describe command in the
Command window. Thats it, just describe followed by the Enter key, which will yield
a brief description of the contents of the dataset.

1.4 An example of a short Stata session

. describe
Contains data from C:\Stata8\ado\base/c/cancer.dta
obs:
48
Patient Survival in Drug Trial
vars:
4
3 Sep 2002 12:25
size:
576 (99.9% of memory free)
variable name
studytime
died
drug
age

storage
type
int
int
int
int

display
format
%8.0g
%8.0g
%8.0g
%8.0g

value
label

variable label
Months to death or end of exp.
1 if patient died
Drug type (1=placebo)
Patients age at start of exp.

Sorted by:

The description includes a lot of information: The full name of the file, cancer.dta
(including the path entered to read the file), the number of observations (48), the number
of variables (4), the size of the file (576 bytes)and how much of Statas memory is still
available, a brief description of the dataset (Patient Survival in Drug Trial), and
the date the file was last saved are displayed at the top of the table displayed. The body
of the table displayed shows the names of the variables on the far left and the labels
attached to them on the far right. Well discuss the middle columns later.
Now that you have opened the cancer dataset, note that the Variables window lists
the four variables studytime, died, drug, and age.
Box 1.1 Internet Access to Our Data
Stata can use data that is stored on the internet just as easily as data stored on your
own computer. If you did not have the cancer dataset installed on your machine,
you could read it by using the command webuse cancer, clear which will read
the cancer dataset from Statas website. This is a particularly easy way to use the
sample datasets. You are not limited to data stored at the Stata site, though. The
command
use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
will open a dataset stored at UCLA. Stata can use any URL (web address) that
points to a Stata dataset.

1.4

An example of a short Stata session

Since you have loaded the cancer dataset, we will execute a basic Stata analysis command. Type summarize into the Command window, then press Enter.
Rather than typing in the command directly, you could use the . Data . Describe
data . Summary Statistics menus to open the corresponding dialog box. Simply clicking
the OK button will produce the summarize command we just entered.

10

Chapter 1. Getting Started

You might want to select specific variables to summarize instead of summarizing


them all. On the dialog box, click on the pulldown menu button on the Variables box,
located on the right of the box, to display a list of variables. Clicking on a variable
name will add it to the list in the box. Note that the menus will allow you to enter
a variable more than once, in which case it will appear in the output more than once.
You can also type variable names into the Variables box. If you enter the summarize
command into the Command window, simply follow it with the names of the variables
for which you want summary statistics. For example, the summarize studytime age
will only display statistics for the two variables named. Otherwise, summarize will give
you a summary of all the variables.
The summarize command will display the number of observations (also called cases
or N ), the mean, the standard deviation, the minimum value, and the maximum value
for each variable in the Results window.
. summarize
Variable

Obs

Mean

studytime
died
drug
age

48
48
48
48

15.5
.6458333
1.875
55.875

Std. Dev.

Min

Max

10.25629
.4833211
.8410986
5.659205

1
0
1
47

39
1
3
67

The first line of output is the dot prompt followed by the command. After that the
output appears as a table. As you can see there were 48 observations in this dataset,
which is abbreviated as Obs in the output. Observations is a generic term. These could
be called participants, patients, or subjects, depending on your field of study. The
average, or mean, age is 55.875 years with a standard deviation of 5.659, and subjects
are all between 47 (the minimum) and 67 (the maximum) years old. We may round
numbers in the text to fewer digits than shown in the output unless it would make
finding the corresponding number in the output difficult.
If you have computed means and standard deviations by hand you know how long this
can take. Statas virtually instant statistical analysis is what makes Stata so valuable.
It takes time and skill to set up a dataset so Stata can be used to analyze it, but once
you learn how to set up a dataset in Chapter 2, you will be able to compute a wide
variety of statistics in very little time.
Well do one more thing in this Stata session: Well make the histogram for the
variable age shown in Figure 1.7. In case you havent gotten to histograms yet, a
histogram is just a graph that shows the distribution of a variable, like age, that takes
on many numeric values.
Simple graphs are simple to create. Just enter the command histogram age into the
Command window, and Stata will produce a histogram using reasonable assumptions.
Well see how to use the menus for more complicated graphics shortly.
At first glance you may be happy with this graph. Stata used a formula to determine

11

.02

Density
.04

.06

.08

1.4 An example of a short Stata session

45

50

55
60
Patients age at start of exp.

65

70

Figure 1.7: Histogram of age

that seven bars should be displayed, and this is reasonable. However, Stata starts the
lowest bar (called a bin) at 47 years old and each bin is 3.33 years wide (this information
is displayed in the Results window), whereas we are not accustomed to measuring years
in thirds of a year. Also, notice that the vertical axis measures density, whereas we
might prefer that it measure the frequency, that is, the number of people represented
by each bar, instead.
This is a case where using the menus can help you enormously. Lets go to the
. Graphics . Easy graphs . Histogram menu and customize our histogram
The histogram dialog box is shown in Figure 1.8. Its worth noting a few things
about this dialog box, and dialog boxes in general.
Lets go over the parts of the dialog box quickly. There is a text box labeled Variable
with a pull-down button. As we saw on the summarize dialog, you can pull down the
list of variables, and click on a variable name to enter it in the box, or you can type the
variables name yourself. Only one variable can be used for a histogram, and here we
want it to be age. If we stop here and click OK, we will have recreated the histogram
shown in Figure 1.7.
There are two radio buttons visible on this dialog box, one labeled Continuous data
(which is shown selected in Figure 1.8), and one labeled Discrete data. Radio buttons
indicate mutually exclusive itemsyou can choose only one of them. In this case, we
are treating age as if it were continuous, so make sure thats the radio button that is
selected.
Note that the dialog box shows a sequence of tabs just under its title bar, as shown
in Figure 1.9. Different categories of options will be grouped together, and you make
a different set of options visible by clicking on its tab. The options youve set on the

12

Chapter 1. Getting Started

Figure 1.8: Histogram dialog box

current tab will not be canceled by clicking on another tab.

Figure 1.9: The tabs on the histogram dialog box

Graphs are usually clearer when there is a title of some sort, so lets click the Titles
tab and add one. Here well enter Age Distribution of Participants in Cancer
Study in the Title box. Lets add the text Data: Sample cancer dataset to the
Note box so we know which dataset we used for this graph.
Now click the Options tab. Well only use a few of the options available on this
tab. Lets select s1 monochrome from pull-down list on the Scheme box. Schemes are
basically templates that determine what the standard attributes of a graph will be,
things like colors, fonts, size, which elements will be shown, and more.
In the Legend box, lets select Yes. Whether a legend will be displayed is determined
by the scheme that is being used, and if we were to leave Default in this box, our
histogram might have a legend or it might not, depending on the scheme. If we choose
Yes or No, we override the scheme, and our selection will always be honored.
Under the section labeled Y-Axis (bottom left of the Options tab), check the Frequency button, since we want the bars on our histogram to represent the number of
people.

1.4 An example of a short Stata session

Figure 1.10: The Title tab of the histogram dialog box

Figure 1.11: The Options tab of the histogram dialog box

13

14

Chapter 1. Getting Started

In the section labeled Bins, check the box labeled Width of bins, and enter 2.5 into
the text box that becomes active (since the variable is age, that indicates 2.5 years).
Also check the box labeled Lower limit of first bin, and enter 45, which will be the
smallest age represented by the bar on the left.
We made these selections to try to improve the appearance of the graph. The
default graph shown above started at age 47 and each bin or category was 3.33 years
wide. Starting at 45 and having a width of 2.5 years seems more natural. Now that
weve made these changes, click Submit instead of OK to generate the histogram shown
in Figure 1.12. Note that the dialog box doesnt close. To close the dialog box, click on
the close button on its upper-right corner.

Frequency
4
6

10

Age Distribution of Participants in Cancer Study

45

50

55
60
Patients age at start of exp.

65

Frequency
Data: Stata sample cancer dataset

Figure 1.12: First attempt at an improved histogram

If you look at the complex command that the menu generated, youll see why even
experienced Stata programmers will often rely on the menu system to create graph
commands.
. histogram age, frequency width(2.5) start(45) title(Age Distribution of Parti
> cipants in Cancer Study) note(Data: Stata sample cancer dataset) scheme(s1mo
> no) legend(on)
(bin=9, start=45, width=2.5)

It is much more convenient to use the menu system to generate that command than to
try to remember all its parts and the rules of their use. If you do want to enter a long
command into the Command window, remember that Stata commands must be typed
as one line. Whenever you press Enter, Stata assumes you have finished the command
and are ready to submit it for processing.

1.4 An example of a short Stata session

15

Box 1.2 When to use Submit and when to use OK


Statas dialogs give you two ways to run a command: by clicking OK or by clicking
Submit. If you click OK, Stata creates the command from your menu selections,
runs the command, and closes the dialog box. This is just what you want for most
tasks. There will be times, though, when you know you will want to make minor
adjustments to get things just right, so Stata provides the Submit button, which
runs the command but leaves the dialog open. This way, you can go back to the
menu and make changes without having to reopening the menu.

The resulting histogram in Figure 1.12 is an improvement, but we might want fewer
bins. Here we are making small changes to a Stata command, then looking at the results,
then trying again. The submit button useful very useful for this kind of interactive,
iterative work. If the dialog box is hidden, the Ctrl-Tab key combination can be used to
move through Statas windows until the one you want is on top again.
Instead of a width of 2.5 years, lets use 5 years, which is a common way to group
ages. If you clicked OK instead of Submit, note that when you return to a dialog that
youve already used in the current Stata session, the menu reappears with the last
values still there. So, all you need to do is change the 2.5 to 5 in the Bin width box
on the Options tab. The result is shown in Figure 1.13. Notice that these graphs are
quite different and a researcher needs to use professional judgment to pick the best
combination. It is easy to misrepresent a distribution using a graph and we should
guard against this.

Frequency
10

15

Age Distribution of Participants in Cancer Study

45

50

55
60
Patients age at start of exp.

65

70

Frequency
cancer.dta

Figure 1.13: Final histogram of age

To finish our first Stata session, we need to close Stata. Do this with . File . Exit.

16

1.5

Chapter 1. Getting Started

Conventions

We have already mentioned some of the conventions that well use throughout the rest
of this book. We thought it might be convenient to list them all in one place, should
you want to refer to them quickly.
First, we will indicate some things by a change of font.
Typewriter font We use this font when something would be meaningful to Stata as
input. We also use it to indicate Stata output.
Stata commands, as in
Type the describe command into the Command window.
Variable names, as in
We want to see a histogram of age before we decide.
Folder and file names, as in
The file survey.dta is in the C:\DATA directory (or folder). Stata assumes
that .dta will be the extension, so you can use just survey if you prefer.
Sans serif font We use this font to indicate menu items (in conjunction with the . symbol), button names, and particular keys.
Menu items, as in
Go to the . Data . Variable utilities . Rename variable menu.
Buttons that can be clicked, as in
Remember, if you are working on a dialog box, it will now be up to you to
click OK or Submit, whichever you prefer.
Keys, as in
The PgUp and PgDn keys will move you backward and forward through the
commands in the Review window.
Some functions require the use of the Shift, Ctrl, or Alt keys, which are held
down while the second key is pressed. For example, Alt-F will open the . File
menu.
Quotes We will use double quotes when we are talking about labels in a general way,
but as with single quotes, well use the typewriter font to indicate a specific label
in a dataset. For example,
If you decide to label this variable ageAge at first birth, then you would enter
Age at first birth into the text box.
Capitalization Stata is case sensitive. So, summarize is a Stata command, whereas
Summarize is not and will generate an error if you use it. Stata also recognizes
capitalization in variable names, so agegroup, Agegroup, and AgeGroup will be
three different variables. Although you can certainly use capital letters in variable

1.6 Chapter Summary

17

names, you will probably find yourself making more typographical errors if you
do. We have found using all lower-case letters combined with good variable and
value labeling to be the best practice.
If you remember from the text, well also capitalize the names of the various
Stata windows, but we do not set them off by using a different font. So, we will
type commands into the Command window and look at the output in the Results
window.
When we talk about labeled elements of a dialog box, we wont use a special font,
but well capitalize the label as it is capitalized on the dialog box.

1.6

Chapter Summary

We covered the following topics in this chapter.


The Stata interface, and how you can customize it.
How to open a sample Stata dataset.
The parts of a dialog box and the use of the OK and Submit buttons.
How to summarize the variables
How to create a simple histogram and how to modify it.
What the font and punctuation conventions well use throughout the book are.

1.7

Exercises

Some of these exercises involve little or no writing. They are simply things you can
do to make sure you understand the material. Other exercises may require a written
answer.
1. Open Stata, maximize the screen, rearrange the windows, then save these windowing preferences using the . Prefs menu. Close Stata and reopen it. Did you
have any trouble doing this? If you did, view the movies at
(http://www.ats.ucla.edu/stat/stata/seminars/stata9/gui.htm).
2. You can copy and paste text to and from Stata as you wish. You should try
highlighting some text in Statas Results window, copying it to the clipboard, and
pasting it into another program, such as your word processor. To copy highlighted
text, you can use the . Edit . Copy menu or, as indicated on the menu, Ctrl-C.
You will probably need to change the font to a monospaced font (e.g., Courier),
and you may need to reduce its size of the font (e.g., 9 point) after pasting it
to prevent the lines from wrapping. You may wish to experiment copying Stata
output into your word processor now, so you know which font size and typeface

18

Chapter 1. Getting Started


work best. It may help to use a wider margin such as 1 inch on each side. After
you highlight material in the Results window, press the right mouse button. You
can save this output in several formats including HTML. Press the right mouse
button, then select HTML. Switch to your word processor. Press Ctrl-v to paste
what you copied. In your word processor you should use a monospaced font such
as Courier. Otherwise, columns will not line up correctly.
3. Stata has posted all the datasets used to illustrate how to do procedures in its
manuals. You can access the manual datasets from within Stata by going to the
. File . Example datasets menu, which will open Statas Viewer. Click on Stata
9 manual datasets and then click on Users Guide [U].
The Viewer works much like a web browser, so you can click on any of the links to
the list of datasets. Scroll down to Chapter 25 and select states3.dta. This opens
a dataset that is used for Chapter 25 of the Users Guide. Run two commands,
describe and summarize. What is the variable divorce rate and what is the
mean (average) divorce rate for the 50 states?
4. Open the cancer dataset. Create histograms for age using a bin widths of 1, 3,
and 5. Use the right mouse button and copy each graph to the clipboard and then
paste it into your word processor. Does the overall shape of the histogram change
as the bins get wider? How?
5. UCLA has a Stata portal that has a great deal of helpful material about Stata.
You might want to browse the collection now just to get an idea of what topics
there are. The URL for the main UCLA Stata page is
http://www.ats.ucla.edu/stat/stata/
In particular, you might want to look at the links listed under Learning Stata. On
the Stata Starter Kit page, youll find a link to Class notes with movies. These
demonstrate using Statas commands rather than the menu system. The topics
we will cover in the first few chapters of this book are also covered on the UCLA
web page using the commands. Each movie is about 25 minutes long. If you rely
on a modem for internet connectivity, you might want not to view the movies

Entering Data
2.1
2.2
2.3
2.4

Creating a dataset
An example questionnaire
Develop a coding system
Entering data
2.4.1

2.5
2.6
2.7
2.8

2.1

Labeling values

Saving your dataset


Checking the data
Chapter Summary
Exercises

Creating a dataset

In this chapter, well learn how to create a dataset. Data entry and verification can be
tedious, but it is essential to get accurate data. More than one analysis has turned out
to be wrong because of errors in the data.
In the first chapter we worked with some sample datasets. Most of the analyses in
this book use datasets that have already been created for you and and are available on
our web page or they may already be installed on your computer. If you must, you can
come back to this chapter when you need to create a dataset. You should look through
the sections in this chapter that deal with documenting and labeling your data, at least,
as not all prepared datasets will be well documented or labeled.
The ability to create and manage a dataset is valuable. People who know how to
create a dataset are valued members of any research team. For small projects, collecting,
entering, and managing the data can be quite straightforward. For large, complex
studies that extend over many years, managing the data and its documentation can be
as complex as many of the analyses that may be conducted.
Stata, and most other statistical software almost always requires data that is laid out
in a grid like a table, with rows and columns. Each row contains data for an observation
(which is often a subject or a participant in your study), and each column contains the
19

20

Chapter 2. Entering Data

measurement of something of interest about the observation such as age. This will be
familiar to you if youve used a spreadsheet.
Variables and Items
When working with questionnaire data, an item from the questionnaire almost
always corresponds to a variable in the dataset. So, if you ask for a respondents
age, youll have a variable called age. Many questionnaire items are designed to
be combined together to make scales or composite measures of some sort, and a
new variables will be created to contain those scales, but there is no corresponding
item on the questionnaire. A questionnaire may also have a single item where the
respondent is asked to mark all that apply, and it is fairly common for there to
be a variable for each category that could be marked under that item. The terms
item and variable are often used interchangeably, but they are not synonyms.

There is more to a dataset than just a table of numbers. Datasets usually contain
labels that help the researcher use the data more easily and efficiently. In a dataset,
the columns correspond to variables, and variables must be named. We can also attach
a descriptive label to each variable, which will often appear in the output of statistical
procedures. Since most data will be numbers, we can also attach value labels to the
numbers, so its clear what the numbers mean.
It is extremely helpful to pick descriptive names for each item. We call these
variable names. For example, we might call a variable that contains peoples responses to a question about their states schools q23, but it is often better to use a
more descriptive name, such as schools. If there were two questions about schools, it
might make sense to name them schools1 and schools2 than q23a and q23b. If many
items are to be combined into scales and are not intended to be used alone, then it
may be clearer to use names that correspond to the questionnaire items for the original
variables and reserve the descriptive names for the composite scores. No matter what
the logic of your naming, try to keep names short. You will probably have to type them
often, and short names offer fewer opportunities for typographical errors.
Even with relatively descriptive variable names, it is usually helpful to attach a
longer, and hopefully more descriptive, label to each variable. We call these variable
labels to distinugish them from variable names. For example, you might want to
label the variable schools with the label Public school rating. The variable label
gives us a clearer understanding of the data stored in the variable.
For some variables, the meaning of the data is obvious. If you measure peoples height
in inches, when you see the values, you wont wonder a lot about what they mean. Most
of us have run across questions that ask us if we Strongly agree, Agree, Disagree,
or Strongly disagree, or that ask us to answer some question with a Yes or No.
This sort of data is usually coded as numbers such as 1 for Yes and 2 for No, and
it can make understanding tables of data much easier if the value labels are printed in
the output in addition to (or instead of) the numbers. When we label the values we call

2.2 An example questionnaire

21

these value labels.


If your project is just a short homework assignment, you could consider skipping the
labeling (though your instructor would no doubt appreciate anything that makes your
work easier to understand). For any serious work, though, clear labeling will make your
work much easier in the long run.

2.2

An example questionnaire

Weve talked about datasets in general, now lets move on to creating one. Suppose we
conducted a survey of 20 people and asked each of them six questions, which are shown
in the example questionnaire shown in Table 2.1.
What is your gender?
Male
Female
How many years of education have you completed?
08
9
10
11
12
13
14
15
16
17
18
19
20 or more
How would you rate public schools in your state?
very poor
poor
okay
good
very good
How would you rate public schools in the community you lived in as a teenager?
very poor
poor
okay
good
very good
How would you rate the severity of sentences of criminals to prison?
much too lenient
somewhat too lenient
about right
somewhat too harsh
much too harsh
How liberal or conservative are you?
very liberal
somewhat liberal
moderate
somewhat conservative
very conservative
Table 2.1: Example questionnaire
Our task is to convert the answers to the questions on this questionnaire into a
dataset we can use with Stata.

22

2.3

Chapter 2. Entering Data

Develop a coding system

Statistics is most often done with numbers, so we need to have a numeric coding system
for the answers to our questions. Stata can use numbers or words. For example, we
could just enter Female if a respondent checked that box. However, it is usually better
to use some sort of numeric coding, so you might type 1 if the respondent checked the
Male box on the questionnaire and 2 if Female was checked. This means we need to
assign a number to enter for each possible response for each of the items on the survey.
We will also need a short variable name for the variable that will contain the
data for each item. Variable names can only contain upper- and lower-case letters,
numerals, and the underscore character, and they can be up to 32 characters long. It
is generally better to keep your variable names 10 characters or shorter, and 8 or fewer
is best. Variable names cannot contain a space and should start with a letter. Along
with the Variable name, compose a Variable label that will help relate the variable
back to the original questionnaire.
If it is appropriate, you should explain the relationship between any numeric codes
and the responses as they appeared on the questionnaire. For an example of all this put
together, see the example codebook for our questionnaire that appears in Table 2.2.
The codebook is a translation tool. It translates the numeric codes in your dataset
back into the questions you asked your participants and the choices you gave them for
answers. Regardless of whether you gather data with a computer aided interviewing
system or with paper questionnaires, the codebook is an essential part of being able to
make sense of your data and your analyzes. If you do not have a codebook, you might
not realize that everyone with eight or fewer years of education is coded the same way,
with an 8. That may have an impact on how you use that variable in later analyzes.
We have added an id variable to identify each respondent. In simple cases, we
just number the questionnaires sequentially. If we have a sample of 5,000 people we
will number the questionnaires from 1 to 5000. Write the identification number on the
original questionnaire, and record it in the dataset. If we discover a problem in our
datasetsomebody with a coded value of 3 for gender, saywe can go back to the
questionnaire and determine what the correct value should be. Some researchers will
eventually destroy the link between the ID and the questionnaire as a way to protect
human participants. It is good to keep the original questionnaires in a safe place that
is locked and accessible only by members of the research team.
Some data will be missingpeople refuse to answer certain questions, interviewers
forget to ask questions, equipment failsthe reasons are many, and we need a code to
indicate that data is missing. If we know why the answer is missing, we will want to
record the reason, too, so we may want different codes that correspond to the different
reasons the data might be missing.
On surveys, respondents may refuse to answer, may not express an opinion, or may
not have been asked a particular question because of the answer to an earlier question.
In this case, we might code invalid skip (interviewer error) as 5, valid skip (not

2.3 Develop a coding system

Question
Identification number

23

Variable Name

Value Labels

Code

id

Record in order

1 to 20

gender

Male
Female
No Answer

1
2
9

0-8
920
No Answer

8
9 to 20
9

What is your gender?

How many years of education have you completed?


education

Rate public schools in your state.


sch st

Very poor
Poor
Okay
Good
Very good
No Answer
Rate public schools in the community you lived in as a teenager.
sch com
Very poor
Poor
Okay
Good
Very good
No Answer
Rate severity of sentences of criminals to prison. Are they?
prison
Much too lenient
Too lenient
About right
Too harsh
Much too harsh
No Answer
How liberal or conservative are you?
conserv
Very liberal
Liberal
Moderate
Conservative
Very conservative
No Answer
Table 2.2: Example codebook

1
2
3
4
5
9
1
2
3
4
5
9
1
2
3
4
5
9
1
2
3
4
5
9

24

Chapter 2. Entering Data

applicable to this respondent) as 4, refused to answer as 3, 2 as dont know,


and 1 as missing (any other reason). For example, adolescent boys should not be
asked when they had their first menstrual period, so we would enter a 4 for that
question if the respondent is male. It is important that we pick values that can never be
a valid response. Stata has multiple values that are treated as missing, and well show
you how to use them, but for our example questionnaire, well only use one missing
value code, 9.
Well be entering the data ourselves, so after we administered the questionnaire to
our sample of 20 people, we prepared a coding sheet that will be used to enter the
data. The coding sheet originates from the days when data was entered by professional
keypunch operators, but it can still be very useful or necessary. When you create a
coding sheet, you are converting the data from the format used on the questionnaire to
the format that will actually be stored in the computer (a table of numbers). The more
the format of the questionnaire differs from a table of numbers, the more likely that a
coding sheet will help prevent errors.
There are other reasons you may want to use a coding sheet. For example, there
may be confidential information on your questionnaires that should be shown to as few
people as possible; if there are many open-ended questions for which extended answers
were collected but that will not be entered, working from the original questionnaires
can be unwieldy.
In general, if you transcribe the data from the questionnaires to the coding sheet,
the decision about which responses go in which columns is yours. This will reduce errors
that might arise from uncertainty by the people who do the data entry, who may not
have the information needed to make those decisions properly.
Whether you enter directly from the questionnaires or create a coding sheet will be
dependent largely on the study and on the resources that are available to you. Some
experience with data entry is valuable because it will give you a better sense of the
problems you may encounter, whether it is you or someone else who enters the data, so
we have created a coding sheet for our example questionnaire, shown in Table 2.3.
Since we are not reproducing the 20 questionnaires in this book, it may be helpful to
examine how we entered the data from one of them. We will use the ninth questionnaire.
Weve assigned an id of 9, as can be seen in the first column. Reading from left to
right, in the second column, for gender, weve recorded a 2 to indicate a woman and
for education weve recorded a response of 16 years. This woman rates schools in her
state and in the community in which she lived as a teenager as very good, which we can
see from the 5s in the fourth and fifth columns. She indicated that she thinks prison
sentences are too long and that she considers herself very liberal (4 and 1 in the sixth
and seventh columns).

2.3 Develop a coding system

25

id

gender

education

sch st

sch com

prison

conserv

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

2
1
1
1
2
2
1
2
2
1
1
2
2
1
2
1
1
2
2
1

15
12
16
8
12
18
17
14
16
20
12
11
18
16
16
17
12
12
14
13

4
2
3
9
3
4
3
2
5
4
2
9
5
5
5
4
2
2
4
3

5
3
4
1
3
5
4
3
5
4
2
1
5
5
5
3
2
2
5
3

4
1
3
9
3
5
2
1
4
5
1
1
9
5
4
4
9
2
4
5

2
5
2
5
3
1
4
5
1
2
5
3
9
1
2
3
1
2
4
5

Table 2.3: Example coding sheet

26

2.4

Chapter 2. Entering Data

Entering data

Well use Statas Data Editor to enter our data from the coding sheet. The Data Editor
provides an interface that is similar to a spreadsheet, but it has some features that are
particularly suited to creating Stata datasets. Before opening the Data Editor, you
might save any open file and then enter the command clear in the Command window.
This will give you a fresh Data Editor in which to enter data. Only enter the clear
command if you want to start with a new dataset that has nothing in it. To open the
Data Editor, click on the . Data . Data Editor menu. There is also a shortcut on the
toolbar for opening the Data Editor (it looks like a spreadsheet), or you can use Ctrl-7, if
you prefer, or you can type edit into the Command window. The Data Editor window
is shown in Figure 2.1.

Figure 2.1: Data editor window

Data is entered in the white columns, just as in other spreadsheets. Well enter the
data for the first respondent to our example survey, then stop. In the first white cell
under the first column, enter the identification number of the first respondent. We enter
a 1. Press the Tab key to move across to the next cell, where we will enter the value
from the second column of the coding sheet. This value is 2 because the first case is a
woman. You press the Tab and continue until you have entered all the values for the
first participant. After youve entered the number from the last number in the first row
of the coding sheet, press Return/Enter, which will move you down to the second row of
the Data Editor. Press the Home key to return to the first column.
Lets interrupt our data entry here, after we have the data entered for the first
respondent. When we create a new dataset by entering data into the Data Editor,
Stata assigns each column a default Variable name. As youve no doubt noticed, the
columns are called var1, var2, var3, . . . , var7. We want to change these to the names
we recorded in our codebook. We also composed a label for each variable as part of our
codebook, and we should enter those, too.
Double click on the grey cell at the top of the first column. This cell now contains
the Variable name var1. Your double clicking on this cell opens the dialog box shown
in Figure 2.2, in which we have already entered the variable name, id, and the label,
Identification number. Enter these values in the boxes.

2.4 Entering data

27

Figure 2.2: Variable name and variable label

In addition to the variable name and variable label, the dialog box includes a box
where you specify the format and it shows %8.0g. This is the default format for numeric
data, and it instructs Stata to try to print numbers within 8 columns and to round
numbers to the nearest whole number, that is, with no numbers to the right of the
decimal place. Stata will adjust the format if you enter a number wider than 8 columns
in the Data Editor.
If you have string data (that is, data that can contain letters, numbers, or symbols),
youll see a format that looks like %9s, which indicates a variable with 9 characters that
are strings (the s). More detailed information about formats can be found at the books
web site.
I entered the letter l for the number 1
A common mistake is to enter a letter instated of a numeric value. This can happen
if you accidently type the letter l rather than the number 1 or you might simply
type the wrong key. When this happens Stata will make the variable a string
variable meaning that is is no longer a numeric variable. If you find you mistake and
correct it with the correct numeric value, Stata will still think it is a string variable.
There is a simple command that will force Stata to change the variable from a
string variable back to a numeric variable. The command is destring varlist,
replace. For example, if to change the variable id from string to numeric you
would type destring id, replace.

All of our data are numeric, none of our data contain decimals, and none is wider
than eight digits, so we can leave the format alone. Go ahead and click OK, then
rename and label the other variables in the dataset. Once you have defined a variable
as numeric, Stata will warn you if you try to enter data that isnt a number, which can
help reduce errors. Be careful to always enter the number 1 and not the lowercase letter

28

Chapter 2. Entering Data

l.
When you learn how to write Stata programs in do-files, you might want to enter
the labels by direct commands without using the menu. If we wanted to do this with
the variable prison we would enter the following commands.
. label variable prison "Harshness of prison sentences"

2.4.1

Labeling values

It is a good idea to enter the value labels we described in Table 2.2. Some people do
not do this and try to remember what the values mean. You can remember that 1
is the code for a male and 2 the code for a female, but if the questionnaire contains
hundreds of items, remembering all the values would be impossible. You could rely
on the codebook, but checking the codebook each time you come across a variable for
which youre not sure of the coding will take valuable time, often when you are facing
a deadline.
Creating value labels pays off if you are going to use the dataset more than a few
times, or if you share a dataset with others. Output will be nicely labeled, which will
make it much easier for you to read and make sense of your results. Value labels are
especially helpful for those on the team who may not work with the data regularly, or
who work with some parts of the dataset irregularly.
Value labels are used when the numbers have some language equivalent, such as a
scale from strongly disagree to strongly agree. If we record peoples ages, there is
no label we could sensibly use for 16. There can be times when the number of values
to be labeled is daunting (diagnostic codes, for example). In many cases, the labels
are available in usable form, and its just a matter of finding them. Ask colleagues,
librarians, and data providers.
Labeling values in Stata is done in two stages. In the first stage, we create the
labeling scheme where the text and the value to which it is attached are saved. In the
second stage, we assign the labeling scheme to the variable. Labeling schemes are given
names, which can be the same as variable names or different. When it makes sense to
do so, you may want to give the variable and the labeling scheme the same name just
to keep things simple. At other times you may choose a name that is descriptive of the
type of scale or measurement being labeled. The first step involves creating a scheme
for each set of labels in your data set.
We will create a scheme called sex using 1 for male and 2 for female. We will create
a scheme called rating using 1 for very poor, 2 for poor, 3 for okay, 4 for good, 5 for
very good, and -9 for missing. We will create a scheme called harsh using 1 for much
too lenient, 2 for too lenient, 3 for about right, 4 for too harsh, 5 for much too harsh,
and -9 for missing. Finally, we will create a scheme called conserv using 1 for very
liberal, 2 for liberal, 3 for moderate, 4 for conservative, 5 for very conservative, and -9
for missing.

2.4.1 Labeling values

29

At the bottom of Figure 2.2, just about OK you see Define/Modify. Clicking on this
opens a new dialog in which we define schemes. This new dialog appears in Figure 2.3
along with how we define the scheme sex.

Figure 2.3: Define schemes dialog box


Click Define in Figure 2.3, which opens the Define label dialog box in which we
enter the name for the new labeling scheme, sex. The name can contain only letters,
numerals, and the underscorethe same characters that can be used for variable names.
Well call it sex, because we will use it for our gender variables. As soon as we click on
OK a new dialog box opens.
We enter the first value, 1 and text for the label, male. When we click on OK we
enter the second value, 2 and the text for its label, female. After clicking OK we enter
the third value, -9 and its label, missing and click OK. Since this is the last value for
this scheme, we are back in the menu in Figure 2.3 and here we click Add. This adds
these value labels to the scheme sex. Using this process we define the other schemes.
When we are finished Figure 2.3 becomes Figure 2.4.

Figure 2.4: Define schemes dialog box


Close this dialog box and return we go back to the origianl dialog in Figure 2.2. This
time click OK and then click at the top of the second column where it says var2. This

30

Chapter 2. Entering Data

opens a dialog where we enter the variable name, gender, the variable label, Gender of
participant, and then we can click the arrow to get a list of the value label schemes
where we click on sex to assign that scheme to the variable gender. This is the second
set in which we assign the value label schemes to the variables. The menu for doing this
for gender is in Figure 2.3. We do this for each variable to assign the scheme rating
to both sch st and sch com, the scheme harsh to the variable prison, and the scheme
conserv to the variable conserv.
Going through this series of menus can be confusing. This is one case where it may
be easier to simply enter the commands. The key thing to remember is that it is a two
step process. The first step defines a different scheme for each set of labels. The second
step assigns these to the appropriate variables. Here is how we would enter the first
step:
.
.
.
.
.
.
.
.
.

label
label
label
label
label
label
label
label
label

define
define
define
define
define
define
define
define
define

sex 1 "male" 2 "female" -9 "missing"


harsh 1 "much too lenient" 2 "too lenient"
harsh 3 "about right", 4 "too harsh", add
harsh 5 "much too harsh" -9 "missing", add
rating 1 "very favorable" 2 "Favorable"
rating 3 "unfavorable" 4 "very unfavorable", add
rating -9 "missing", add
conserv 1 "very liberal" 2 "liberal" 3 "moderate"
conserv 4 "conservative" 5 "very conservative", add

Why do we show the dot prompt with these commands?


When we show a listing of Stata commands we place a dot and a space in front
of each command. When you enter these in the command window, you enter the
command itself and not the dot prompt or space. We include these because Stata
always shows commands this way in the Results window. Also, Stata manuals and
many other books about Stata follow this convention.

Note that we added the add option to the end of the second label define command,
which indicates that we are adding labels to the con labeling scheme and not trying
to redefine it. What happens if we make a mistake in how we entered a label? In the
definitions we just entered we defined the rating scheme value of 2 the text Favorable.
Sine we have not capitalized any of the other labels we might want to change this to
lower case, favorable. To do this, we enter the following command with the option of
modify. Stata tries to protect us against ourselves and wont let us change the definition
without explicitly including the modify option.
. label define rating 2 "favorable", modify

The second step is to assign these value labels to the appropriate variables. If we do
not want to go through the menus we can enter these commands directly:
. label values gender sex
. label values sch_st rating
. label values sch_com rating

2.5 Saving your dataset

31

. label values prison harsh


. label values conserv conserv

You may have experience with other statistical packages that do not use this two
step process. Assigning value labels to each variable, one at a time, would be reasonable
for a dataset that just has a few variables such as our example. When you are dealing
with complex datasets there may be dozens of variables that have the same scheme such
as strongly agree to strongly disagree or yes/no options. A complex data set might have
just 10 schemes, but 200 or more items. Once you have defined each scheme you can
assign them to all the variables for which they apply. This is an efficient approach for
all but the simplest datasets.

2.5

Saving your dataset

If you look at the Results window in Stata you can see that the menu has done a lot
of work for you. The Results window shows that a lot of commands have been run to
label the variables. These also appear in the Review window. The Variables window
lists all of your variables. Weve created our first dataset.
Up till now, Stata has been working with our data in memory, and we still need to
save the dataset as a file, which we can do from the . File . Save As menu. This will
open the familiar Windows dialog box to choose where you would like to save your file.
Once the data has been saved to a file, you can exit Stata or use another dataset and
not lose your changes.
Notice that Stata will give the file the .dta extension if you dont specify one. It is
important that you use this extension, as that is what identifies it as a Stata dataset.
You can give it any first name you want, but dont change the extension. Lets call this
file firstsurvey and let Stata add the .dta extension for us. If you look in the Results
window, youll see that Stata has printed something like
. save "C:\DATA\firstsurvey.dta"

It is a good idea to always look to make sure that you spelled the name as you
thought, and that it got saved into the directory you intended. Note, also, that Stata
added quotation marks around the filename. If you type a save command into the
Command window yourself, you must use the quotes around the file name only if there
is a space in it, however, if you always use the quotes around the filename, youll never
get an error by forgetting to use them.
Stata works with one dataset at a time. If you wish to work with a different dataset,
the new one will replace the current one in memory. When you open the new dataset,
Stata will clear the current one from memory first. Stata knows when you have changed
your data, and you will be prompted if you attempt to replace unsaved data with new
data. It is, however, a very good idea to get in the habit of saving before reading new
data because Stata will not prompt you if the only things that have changed are labels
and notes.

32

Chapter 2. Entering Data

A Stata dataset contains more than just the numbers, as weve seen in the work
weve done so far. A Stata dataset contains
The numerical data
The variable names
The variable labels
Missing values
Formats for printing data
Dataset label
Notes
Sort order and the name of the sort variables
In addition to the labeling weve already seen, you can store notes in the dataset,
you can attach notes to particular variables, which can be a very convenient way to
document changes, particularly if more than one person is working on the dataset. To
see all the things you can insert in a dataset, enter ubject)command, help labelthe
command help label. If the dataset has been sorted, that information is also stored.
We have now created our first dataset and saved it to disk, so lets close Stata. This
is done with the . File . Exit menu item.

2.6

Checking the data

Weve created a dataset, now we want to check the work we did defining the dataset.
Checking for the accuracy of our data entry is also our first statistical look at the data.
To open the dataset, use the . File . Open menu. Locate the firstsurvey dataset that
we saved in the preceding section, and click OK.
Lets run a couple of commands that will characterize our dataset and the data
stored in it in slightly different ways. We created a codebook to use when creating our
dataset. Stata provides a convenient way to reproduce much of that information, which
is useful if you want to check to make sure you entered the information correctly. It is
also very useful if you share the dataset with someone who doesnt have access to the
original codebook. Use the . Data . Describe Data . Describe Data Contents (Codebook)
menu to generate Statas codebook from the current dataset. Note that you can specify
variables for which you want a codebook entry, or you can get a complete codebook by
not specifying any variables. Try it first without specifying any variables.

2.6 Checking the data

33
Scrolling the Results

When Stata displays your result, it does so one screen at a time. When there is
more output than will fit in the Results window, Stata prints what will fit and
displays more in the lower-left corner of the Results window. Stata calls this
a more condition in its documentation. To continue, you can click on the more
in the Results window, but most of us find it easier to just press the space bar. If
you dont want to see all of the output, that is, you want to cancel displaying the
rest of it, you can press the q key. If you do not like having the output pause this
way, you can turn this paging off, by typing the command set more off into the
Command window. All of the output generated will print without pause until you
either change this setting back or quit and restart Stata. To restore the paging,
use the set more on command. To see what else you can set, enter the command
help set and you will get a long list of things you can customize.

Lets take a look at the codebook entry for gender as Stata produced it. For this
example, you can either scroll back in the Results window, or you can return to the
. Data . Describe Data . Describe Data Contents (Codebook) menu and put gender into
the Variables box, as weve done in the examples that follow.
. codebook gender
gender

participants gender
type:
label:
range:
unique values:
tabulation:

numeric (byte)
sex
[1,2]
2
Freq.
10
10

Numeric
1
2

units:
missing .:

1
0/20

Label
male
female

Lets go over the display for gender. The first line lists the variable name, gender,
and the variable label, participants gender. Next, the type of the variable, which is
numeric (byte) is shown. The value label scheme associated with this variable is sex.
The range of this variable (shown as the lowest value, then the highest value) is from 1
to 2, there are two unique values, and we have 0 missing values in our 20 cases. This is
followed by a table showing the frequencies, values, and labels. We have 10 cases with
a value of 1, and these are labeled male, and 10 cases with a value of 2 and these are
labeled female.
This is an excellent way to check your categorical variables, like gender and prison.
If, in the tabulation, you saw three values for gender or six values for prison, you
would know that there are errors in the data. Looking at these kinds of summaries of
your variables will often help you detect data-entry errors, which is why it is crucial to

34

Chapter 2. Entering Data

review your data after entry.


Lets also look at the display for education.
. codebook education
education

Years of education

type:
range:
unique values:
mean:
std. dev:
percentiles:

numeric (byte)
[8,20]
10
14.45
2.94645
10%
11.5

units:
missing .:

25%
12

50%
14.5

1
0/20

75%
16.5

90%
18

Because this variable takes on a larger number of values, Stata does not give us a
tabulation of the individual values. Instead, Stata gives us the range (8 to 20) along
with the mean, which is 14.45, and the standard deviation, which is 2.95. It also gives
the percentiles. The 50th percentile is usually called the median and in this case the
median is 14.5.
With a large dataset the complete codebook may be more information than you
need. For a quicker, condensed overview there is an alternate command, describe.
Enter the command into the Command window, or use the . Data . Describe Data
. Describe Variables in Memory menu, which will display the dialog box in Figure 2.5.

Figure 2.5: Describe dataset dialog box

If you do not list any variables in the Variable box, then all the variables will be
included in the output, as shown in the following output:

2.6 Checking the data

35

. describe
Contains data from C:\DATA\firstsurvey.dta
obs:
20
vars:
7
size:
240 (99.9% of memory free)
variable name

storage
type

display
format

id
gender
education
sch_st
sch_com

int
byte
byte
byte
byte

%8.0g
%8.0g
%8.0g
%9.0g
%9.0g

prison
conserv

byte
byte

%16.0g
%17.0g

value
label
sex
approve
approve
length
con

28 May 2004 09:39

variable label
Identification number
participants gender
Years of education
Rating of schools in your state
Ratings of schools in your
community of origin
Rating of prison sentences
Conservativism/liberalism

Sorted by:

As we can see by examining the output, the path to the data file is shown (directory
and file name), as is the number of observations, the number of variables, the size of
the file, and the last date and time the file was written (also called a time stamp. It
is possible to attach a label to a dataset, too, and if we had done so, it would have
been shown above the time stamp. Below the information that summarizes the dataset
is information on each variable: its name, its storage type (there are several types of
numbers, and string), the format Stata uses when printing values for the variable, the
value label scheme associated with it, and the variable label. This is clearly not as
complete a description of each variable as provided by codebook, but it is often just
what you want.
One last thing we will do to check the dataset is to get a table of the standard
summary, or descriptive, statistics we usually want to see with numeric data. Use the
. Data . Describe Data . Summary Statistics menu. If you do not specify any variables,
well get summary statistics for all the variables.
. summarize
Variable

Obs

Mean

id
gender
education
sch_st
sch_com

20
20
20
18
20

10.5
1.5
14.45
3.444444
3.5

prison
conserv

17
19

3.176471
2.947368

Std. Dev.

Min

Max

5.91608
.5129892
2.946452
1.149026
1.395481

1
1
8
2
1

20
2
20
5
5

1.550617
1.544657

1
1

5
5

Each variable appears as a row, with its name, the number of observations (which varies
because we have missing values for some of the participants), the mean, the standard
deviation, the minimum value, and maximum value. Once again, we want to examine

36

Chapter 2. Entering Data

these values to see if there is any indication that we may have errors in the data. For
example, if the maximum value shown for education is 30, you might want to verify
that value. Another good check to perform here is to make sure that we have properly
converted all of the codes we used for missing values when we entered the data into
Stata missing values. Had we not done so, you would see minimum values listed that
were negative numbers. We chose negative numbers to make it obvious when we look
at summaries like these whether we have converted them into missing values or not.
How long would it take you to compute the means and standard deviations for all
of these variables by hand, even for just the 20 observations we have? Here we got the
result almost instantly. Using the computer really does save time, and a lot of it, when
doing statistics. The more proficient you become with the computer and Stata, the
easier it will be to generate useful statistics quickly and painlessly.

2.7

Chapter Summary

We covered the following topics in the chapter.


How to create a codebook for a questionnaire.
How to develop a system for coding the data.
Entering the data into a Stata dataset.
Adding variable names and variable labels.
Creating value label schemes and associating them with variables.
Converting our missing value codes to Statas missing values.
Summarizing our dataset and our variables to check for errors and to get our first
statistical look at the data.

2.8

Exercises
1. From the coding sheet sheet shown in Table 2.3 translate the numeric codes back
into the answers given on the questionnaire by using the coding sheet in Table 2.2.
2. Using the menu, attach a descriptive label to the dataset we created in this chapter.
What command is printed into the Results window by the menu? You find this
menu under . Data . Labels.
3. Using the menu, experiment with Statas notes features. You can attach notes to
the dataset and to specific variables. See if you can attach notes, display them, and
remove them. What might the advantage be to keeping notes inside the dataset?
4. Administer your own small questionnaire to students in your class. Enter the data
into a Stata Dataset, label the variables and values.

Preparing Data for Analysis


3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9

3.1

Introduction
Plan your work
Create value labels
Reverse code variables
Create and modify variables
Create scales
Save some of your data
Summary
Exercises

Introduction

A substantial portion of any research project is spent preparing for analysis. Exactly
how much will depend on the data. If you collect and enter your own data, a majority of
the actual time you spend on the project will not be analyzing the data, it will be getting
it ready to analyze. If you are doing secondary analysis and are using well-prepared
data, substantially less time may be spent.
Stata has an entire manual on data management that is nearly 500 pages long, Data
Management Reference Manual . If your data is always going to be prepared for you
and you will only do statistics and graphs with Stata, you might skip this chapter. We
will only cover a few of Statas capabilities for managing data. Some, perhaps many, of
the data management tasks you perform you will be able to anticipate and plan, but
others will just arise as you progress through a project and youll need to deal with
them on the spot.

3.2

Plan your work

The data we will be using is from the U.S. Department of Labor, Bureau of Labor Statistics, National Longitudinal Survey of Youth, 1997, which we will abbreviate as NLSY97. We selected a set of four items to measure how adolescents feel
37

38

Chapter 3. Preparing Data for Analysis

about their mother and another set of four parallel items to measure how adolescents
feel about their father. We used an extraction program available from the web page
(http://www.bls.gov/nls/quex/y97quexcbks.htm) and extracted a Stata dictionary
that contained a series of commands for Constructing the Stata dataset we will use
along with the raw data in an ASCII format. We created a dataset called relate.dta.
In this chapter we will learn how to work with this dataset to create two variables,
namely, Youth Perception of Mother and Youth Perception of Father.
What is a Stata dictionary file?
A Stata dictionary file is a special file that contains Stata commands and raw
data in plain text called ASCII format. We will not cover this type of file here,
but you use this to import the raw data into a Stata dataset. Using the menu
. File . Import opens a menu where we select ASCII data in fixed format with
a dictionary. This takes us to a window in which we Browse for where we stored
the dictionary file. In this case, we find relate.dct. The extension .dct is always
used for a dictionary file. Sometimes the dictionary will be in one file and the raw
ASCII data will be in a separate file. Here they are both in the same file so we
do not need to point to an ASCII data file. Clicking OK creates the Stata dataset.
We can then save it using the name relate.dta. The dictionary we used provides
the variable names and variable labels. Had the variable labels not been included
in the dictionary, we would have created them prior to saving the dataset. If you
need to create a dictionary file yourself you can learn how to do that in the Data
Management Reference Manual.

It is a very good idea to make an outline of the steps you will need to take going
from data collection to analysis. The outline should include what needs to be done
to collect or obtain the data, enter (or read) and label the data, make any necessary
changes to the data prior to analysis, create any composite variables, and finally, create
an analysis-ready version (or versions) of the dataset.
An outline serves two vital functions: It provides direction so you dont lose your
way in what can often be a complicated process with many details, and it provides you
with a set of benchmarks for measuring progress. For a large project, this latter function
should not be underestimated, as it is the key to successful project management.
Our project outline, which is for a simple project on which only one person is working,
is shown in Table 3.1.
This is a pretty skeletal outline, but it is a fine start for a small project like this.
The larger the project, the more detail needed for the plan. Now that we have the
NLSY-supplied codebook to look at, we can fill in some more details of our plan.
First, we will run a describe command on the dataset we created. Lets just show
the results for the four items measuring the adolescents perception of their mother.
From the documentation on the dataset we can determine that these items are called

3.2 Plan your work

39

Consult NLSY97 documentation, Appendix 9, to determine which variables are needed (see page 23 in this Appendix that is a PDF file).
http://www.bls.gov/nls/quex/y97quexcbks.htm
Download the data and codebook (done for you as relate.dct and relate.cdb.
This was done for you.
Create a basic Stata dataset. This was done for you, see relate.dta.
Create variable and value labels.
Generate tables for variables to compare against the codebook to check for errors.
Convert missing-value codes to Stata missing values.
Reverse code those variables that need it; verify.
Copy variables not reversed to named variables; verify.
Create the scale variable.
Save the analysis-ready copy of the dataset.
Table 3.1: Sample project task list
R3483600, R3483700, R3483800, and R3483900. Thus our command is: describe
R3483600 R3483700 R3483800 R3483900. This produces the following description of
the data:
. describe R3483600 R3483700 R3483800 R3483900
variable name

storage
type

display
format

R3483600
R3483700
R3483800

float
float
float

%9.0g
%9.0g
%9.0g

R3483900

float

%9.0g

value
label

variable label
MOTH PRAISES R DOING WELL 1999
MOTH CRITICIZE RS IDEAS 1999
MOTH HELPS R WITH WHAT IMPT TO
R 1999
MOTH BLAMES R FOR PROBS 1999

These variable names are not very helpful and we might want to change them later.
For example, we my rename R3483600 to mopraise. The variable labels make sense,
but notice that there are no value labels. We will read the codebook we downloaded
when we downloaded this dataset. It is in a file called relate.cbk. Table 3.2 shows
what the downloaded codebook looks like for the R3483600 variable:
This is very helpful. We see that a code of 0 means Never, a code of 1 means
Rarely, etc. We need to make value labels so our data will have nice value labels like
these. Also, notice that there are different missing values. They use a -1 for Refusal,
-2 for Dont know, -4 for Valid skip, and -5 for Non-interview. We are puzzled

40

Chapter 3. Preparing Data for Analysis


-------------------------------------------------------------------------------R34836.00
[YSAQ-022]
Survey Year: 1999
MOTHER PRAISES R FOR DOING WELL
How often does she praise you for doing well?
118
235
917
1546
1701
------4517

0
1
2
3
4

NEVER
RARELY
SOMETIMES
USUALLY
ALWAYS

Refusal(-1)
20
Dont Know(-2)
3
TOTAL
=========>4540
VALID SKIP(-4)
3669
NON-INTERVIEW(-5)
775

Table 3.2: NLSY97 sample codebook entries


by the absence of a -3 code. Searching the documentation we learn that this was used
when there was an interviewer mistake or an invalid skip. We didnt have any of
these for this item, but might for others. We need to distinguish between these types
of missing values. The refusals and the dont know responses are people who should
have answered the question but didnt answer or didnt know what they thought. By
contrast, notice there are 3,669 valid skips. These are people who were not asked the
question for some reason. We need to check this out and if we read the documentation
we learn that these people were not asked because they were over 14 years old and only
youth 12-14 were asked this series of questions. Finally, there were 775 youth who were
not interviewed. We need to check this out as well. This is 1999 data, the third year
the data was collected and the researchers were unable to locate these youth or they
refused to participate in the third year of data collection. In any event, we need to
define missing values in a way that keeps track of these distinctions.
How does our Stata dataset look compared to the codebook we downloaded. We
can run the command codebook. If you are following this you might do this now. We
will restrict it to just the single variable R3483600, by entering the command codebook
R3483600 in the Command window.The problem with our data is apparent when you
look at this codebook, the absence of value labels makes this for our actual data uninterpretable.
. codebook R3483600
R3483600

MOTH PRAISES R DOING WELL 1999


type:
range:

numeric (float)
[-5,4]

units:

3.3 Create value labels


unique values:
tabulation:

41
9
Freq.
775
3669
3
20
118
235
917
1546
1701

missing .:

0/8984

Value
-5
-4
-2
-1
0
1
2
3
4

Before we add value labels so our codebook looks more like the codebook we downloaded, we need to define the missing values in our dataset. Because all of the variables
have the same negative values coded for the different types of missing values, we can assign missing values in a single command, mvdecode all,mv(-5=.a\-4=.b\-3=.c\-2=.d\-1=.e)
. mvdecode _all, mv(-5=.a\-4=.b\-3=.c\-2=.d\-1=.e)
R3483600: 4467 missing values generated
R3483700: 4469 missing values generated
R3483800: 4468 missing values generated
R3483900: 4470 missing values generated
R3485200: 5608 missing values generated
R3485300: 5611 missing values generated
R3485400: 5610 missing values generated
R3485500: 5611 missing values generated
R3828100: 775 missing values generated
R3828700: 775 missing values generated

3.3

Create value labels

We will add value labels including labels for the missing values. We know from the
dataset documentation that we will be reverse coding some of the variables because
some items are stated positively (praises) and some are stated negatively (blames), so
when we create value labels, we will want labels for the original coding and labels for
the reversed coding. Weve chosen to call the labels for the parental relations items
often and those for the reversed items often r. The text of the labels is taken from
the codebook file. Here are the commands to add the labels. You can, if you wish, use
the menus to do this.
. label define often 0 "Never" 1 "Rarely" 2 "Sometimes" 3 "Usually" 4 "Always"
. label define often .e "Non-interview" .d "Valid skip" .c "Invalid skip", add
. label define often .b "Dont know" .a "Refusal", add
. label define often_r 4 "Never" 3 "Rarely" 2 "Sometimes" 1 "Usually" 0 "Always"
. label define often_r .e "Non-interview" .d "Valid skip" .c "Invalid skip", add
. label define often_r .b "Dont know" .a "Refusal", add

Weve entered value labels, but we havent yet attached them to any variables. The
next task is to assign the value labels to the variables. This can be done in several

42

Chapter 3. Preparing Data for Analysis

ways. You can open the Data Editor, then right click in any cell in each of the columns,
to bring up the menu, then select the Assign Value Label to Variable, and choose the
appropriate value label from the list. Weve done this for R3483600 and the results
appear in Figure 3.1. Notice that after we do this for R3483600 the value labels appear
in the Data Editor for this variable.

Figure 3.1: Create new variable

Sometimes these value labels are what you want to see in the Data editor and sometimes you need to see the actual numbers the value labels represent. If you want to see
the numbers instead of the labels, there is a simple command you enter in the Command
window, edit, nolabel. This opens the Data editor and shows the numerical values
and the missing codes.
Another way you can assign the value labels is through the menu system. You could
use the . Data menu, . Labels, . Label values, . Assign value labels to variable, as we did
in Chapter 2. If you wanted to enter the commands directly, you could type commands
into the Command Window:
.
.
.
.

label
label
label
label

values
values
values
values

R3483600
R3483800
R3485200
R3485400

often
often
often
often

We have used the value label called often for the positively stated items. We will
assign the often r scheme to the other items after we reverse code them.

3.4 Reverse code variables

3.4

43

Reverse code variables

We have two items for the mother and two for the father that are stated negatively
and we should reverse code these. The R3483700 item refers to the mother criticizing
the adolescent and the R3485300 item refers to the father criticizing the adolescent.
The R3483900 and R3485500 refer to the mother and father blaming the adolescent for
problems. For these items, an adolescent that reports this always happens would mean
the adolescent had a low rating of his or her parent. We always want a higher score to
signify more of the variable. A score of 0 for never on this pair of items should be the
the highest score on these items, i.e. 4, and a response of always should be the worse
response and would have the lowest score, i.e., 0.
It is good to organize all the variables that need reverse coding into groups based on
their coding scheme. It is also a good idea to create new variables instead of reversing
the originals. This insures that we have the original information if any question arises
about what we have done. We recommend that you again write things out before you
start working, as weve done in Table 3.3.

Old value

New value

0
1
2
3
4

4
3
2
1
0

Table 3.3: Reverse coding plan

For a small dataset like the one we are using as our example, this may seem like
overkill. When you are involved in a large-scale project, the experience gained from
working out how to organize these matters on small datasets will stand you in very
good stead, particularly if you need to assign tasks to several people and try to keep
them all straight.
There are several ways that this recoding can be done, but well use recode right
now. From the . Data menu, choose . Create or change variables, . Other variable
transformation commands, . Recode categorical variable.
The recode command, and its dialog box, are straightforward. You provide the
name of the variable to be recoded, then you give at least one rule for how it should
be recoded. Recoding rules are enclosed in parentheses, with an old value (or values)
listed on the left, then an equals sign, then the new value. The simplest kind of rule
lists one value on either side of the equals sign. So, for example, the first recoding rule
we would use to reverse code our variables would be (0=4).

44

Chapter 3. Preparing Data for Analysis

Figure 3.2: Recode: specifying recode rules on the main tab

Figure 3.3: Recode: specifying new variable name on the options tab

3.4 Reverse code variables

45
More on recoding rules

More than one value can be listed on the left side of the equals, as in (1 2 3=0),
which would recode values of 1, 2, and 3 as 0. Usually more than one rule will be
wanted, but occasionally you may just want to collapse one end of a scale, as with
(5/max=5) which recodes everything from 5 up to but not including the missing
values as 5, which might be used to collapse the highest income categories into one,
for example. There is also a corresponding version to recode from the smallest value
up to some value. For example, (min/8=8) might be used to recode the values for
highest attained grade in school if you wanted the same score for everybody with
fewer than 9 years of education coded the same way.

When you use recode, you can choose to recode the existing variables or create
new variables that have the new coding scheme. We strongly recommend creating new
variables. The dialog boxes shown in Figures 3.2 and 3.3 show how this is done for all
four variables being recoded. The first list is the original variables on the Main tab, as
shown in Figure 3.2. The second list is the names of the new variables to be created on
the Options tab, as shown in 3.3. We started the started the variables for mothers with
an mo and the variables for fathers with fa. We added an r at the end of each variable
to remind us that they are reverse coded. Stata will pair up the names, starting with
the first name in each list, and recode according to the rules stated. You must have the
same number of variable names in each list. The output from the selections shown in
Figures 3.2 and 3.3 is:
. recode R3483700 R3483900 R3485300 R3485500 (4=0) (3=1) (2=2) (1=3) (0=4), genera
> te(mocritr moblamer facritr facrtm)
(3230 differences between R3483700 and mocritr)
(4037 differences between R3483900 and moblamer)
(2562 differences between R3485300 and facritr)
(3070 differences between R3485500 and facrtm)

The output above shows how Stata presents output that is too long to fit on one
line: it moves down a line and inserts a > character at the beginning of the new line.
This differentiates command continuation text from the output of the command. If you
are trying to use the command in output as a model for typing a new command, you
should not type the > characters. Notice the first line ends with genera and you would
delete the > and spaces so the word would be generate
The recode command labels the variables it creates. The text is generic, and you
may wish to replace it. For example, the mocritr variable was labeled with the text:
RECODE of R3483700 (MOTH CRITICIZE RS IDEAS 1999)

That tells you the original variable name (and its label in parentheses), but it does
not tell you how it was recoded. It may be worth the additional effort to create more
explicit variable labels, as in

46

Chapter 3. Preparing Data for Analysis


.
.
.
.

label
label
label
label

variable
variable
variable
variable

mocritr "Mother criticizes


moblamer "Mother blames R,
facritr "Father criticizes
fablamer "Father blames R,

R, R3483700 reversed"
R3483900 reversed"
R, R3485300 reversed"
R3485500 reversed"

The recode command is quite useful. Missing values can be specified either as
original values or as new values, and a value label can be created as part of the recode
command. You can also recode a range (say from 1 to 5) easily, which is useful for
recoding things like ages into age groups. Anytime the resulting values are easily listed
(most often integers, but they could be noninteger values), it is worth thinking about
using the recode command.

3.5

Create and modify variables

There are three primary commands used to create and modify variables: generate and
egen are used to create new variables, and replace is used to modify the values of
existing variables. Well illustrate the use of each of these commands in this section,
but they can do much more than we are able to show.
The simplest kind of variable creation is where you simply assign the same value
to every observation in the new variable, which is called creating a constant variable.
The next simplest kind of variable creation is where you simply copy the values from
one variable into a new variable. One of the tasks in Table 3.1 was to create copies
of some of the original variables that have more descriptive names than those supplied
with NLSY97, and well do that now.
To create a new variable using the menus, go to . Data menu, . Create or change
variables, . Create new variable, which will bring up the dialog box shown in Figure 3.4.
The dialog is pretty simple. Theres a place for you to enter the name of the new
variable, and theres a place for you to enter what the new variable should equal. We
want to create a variable called id and it should contain whatever is in the variable
R0000100.
You could enter the generate command directly in the Command window to do the
same thing, as in, as shown by the command that the dialog box generates:
. generate id

= R0000100

To create a constant variable, simply replace the variable name on the left side of the
equals sign with the value you would like the variable to have.
. generate one

= 1

Stated generally, the way to assign values to a new variable with the generate
command is simple: you put the command itself, then the name of the new variable,
then an equals sign, and finally you type out some expression (or rule, or formula) that
tells generate the value to put in the new variable. Here is what the commands to
rename the remaining items that did not need to have their coding reversed look like.

3.5 Create and modify variables

47

Figure 3.4: Create new variable

.
.
.
.
.
.
.

generate
generate
generate
generate
generate
generate
generate

id = R0000100
sex = R0346300
age = R0346600
mapraise = R34836000
mahelp = R3483800
fapraise = R3485200
fahelp = R3485400

The generate expression can be as simple as a single number or the name of another
variable, as shown above, or it can be quite complicated. You can put variable names,
arithmetic signs, and many kinds of functions into expressions. Table 3.4 shows the
arithmetic symbols that can be used. You should take note that attempts to do arithmetic with missing values will lead to missing values. So, in the addition example in
Table 3.4, if sibscore were missing (say it was a single-child household), then the whole
sum in the example would be set to missing for that observation. From the command
window you can type help generate and you will see several additional examples of
what can be done. .
Symbol

Operation

Example

+
*
/
^

addition
subtraction
multiplication
division
exponentiation (x2 )

mscore + fscore + sibscore


balance - expenses - penalty
income * .75
expenses/income
x^2

Table 3.4: Arithmetic symbols

For more complicated expressions, the order in which things happen can be important, and parentheses are used to control the order in which things are done. Parenthe-

48

Chapter 3. Preparing Data for Analysis

ses contain expressions, too, and those are calculated before the expressions outside of
parentheses. Parentheses are never wrong. They might be unnecessary to get Stata to
calculate the correct value, but they arent wrong. If you think they make an expression
easier to read or understand, use as many as you need.
Fortunately, the rules can be made pretty simple. Things are done reading from left
to right, and the order in which things inside expressions are calculated is:
1. Do everything in parentheses. If one set of parentheses contains another set, do
the inside set first.
2. Exponentiate (raise to a power)
3. Multiply and divide
4. Add and subtract
Lets step through an example:
. generate example = weight/.45*(5+1/age^2)

When Stata looks at the expression to the right of the equals sign, it notices the parentheses, so it looks inside those. There it sees that it first has to square age, then divide
1 by the the result, then add 5 to that result. Once it is done with all the stuff in
parentheses, it starts reading from left to right, so it divides weight by .45 and then
multiplies that by the final thing it got from calculating what was inside parentheses.
That final value is put into a variable called example.
Stata doesnt care about spaces in expressions, but they can help readability. So,
for example, instead of writing something like we just did, we can use some spaces to
make the meaning clearer, as in,
. generate example = weight/.45 * (5 + 1/age^2)

If you wanted to be even more explicit, it would not be wrong to write


. generate example = (weight/.45) * (5 + 1/(age^2))

Lets take another look at reverse coding. We reverse coded variables by using a set
of explicit rules like (0=4). But we could accomplish the same thing using arithmetic.
Since this is a relatively simple problem, well use it to introduce some of the things you
may need to be concerned about with more complex problems.
Reversing a scale is swapping the ends of the scale around. Our scale is 0 to 4, so
we can swap the ends around by subtracting the current value from 4. If the original
value is 4, and we subtract it from 4, weve made it a 0, which is the rule we specified
with recode. If the original value is 0, and we subtract it from 4, weve made it a 4,
which is the rule we specified with recode.
This scale starts at 0, so to reverse it, you just subtract each value from the largest
value in the scale, in this case 4. So, if our scale was 0 to 6, wed subtract each value

3.5 Create and modify variables

49

from 6; if it was 0 to 3, wed subtract from 3. If the scale starts at 1 instead of 0,


then you need to add one to the largest value before subtracting. So, for a 1 to 5 scale
(6 1 = 5 and 6 5 = 1), you would subtract from 6; for a 1 to 3 scale, youd subtract
from 4 (4 3 = 1 and 4 1 = 3).
What weve said so far is correct, as far as it goes, but we arent taking into account
missing values and/or their codes. Our missing value codes are 1 to 5, and if we
subtract those from 4 along with the item responses that arent missing value codes,
well end up with 4 (1) = 5 to 4 (5) = 9. So, we would need to add a second
mvdecode command for the reversed variables. Or, we could first convert our missing
values, then do the arithmetic to reverse the scale.
Lets just work with one variable, since this is an example. Weve already applied
the mvdecode command, so lets reverse code R3485300 and call it Facritr (remember
that Stata is case-sensitive, so facritr and facritr are different names). We use the
same dialog that is shown in Figure 3.4, except enter 4 - R3485300 into the Contents
of new variable box instead of just R3485300. The output is
. generate float facritr = 4 - R3485300
(5611 missing values generated)

Whenever you see that missing values are generated (there are 5,611 of them in this
example!), it is a good idea to make sure you know why. The variables we are working
with only have a small set of values they can take, so we can compare the original
variable with the new variable in a table and see what got turned into what. From
the . Statistics menu, choose . Summaries, tables, & tests, . Tables, . Two-way tables
with measures of association, which will bring up the dialog shown in Figure 3.5. Weve
selected the two variables, and it is worth noting that weve checked two of the options
boxes. We are interested in the actual values the variables take, so we selected the
option to suppress the value labels, and we are interested in the missing values, so we
check the box to have them included in the table.
. tabulate facritr R3485300, miss
RECODE of
R3485300
(FATH
CRITICIZE
IDEAS
1999)
0
1
0
1
2
3
4
.a
.b
.d
.e

0
0
0
0
1,120
0
0
0
0

Total

1,120

0
0
0
1,078
0
0
0
0
0
1,078

FATH CRITICIZE IDEAS 1999


2
3
4
0
0
811
0
0
0
0
0
0
811

0
247
0
0
0
0
0
0
0
247

117
0
0
0
0
0
0
0
0
117

.a

.b

Total

0
0
0
0
0
775
0
0
0

0
0
0
0
0
0
4,816
0
0

117
247
811
1,078
1,120
775
4,816
4
16

4,816

8,984

775

50

Chapter 3. Preparing Data for Analysis

Figure 3.5: Two-way tabulation dialog

RECODE of
R3485300
(FATH
CRITICIZE
IDEAS
1999)

FATH CRITICIZE IDEAS


1999
.d
.e

0
1
2
3
4
.a
.b
.d
.e

0
0
0
0
0
0
0
4
0

0
0
0
0
0
0
0
0
16

Total
HLI10
117
247
811
1,078
1,120
775
4,816
4
16

Total

16

8,984

From the table generated, we quickly see that everything happened as we anticipated.
Those adolescents who had a score of 0 on the original variable (column with a 0 at the
top, now have a score of 4 on the new variable. There are 1,120 of these observations.
We need to check the other combinations as well. Also note that the missing value codes
were all transferred to the new variable. Our new variable facitr is a reverse coding
of our old variable R3485300.

3.6 Create scales

51
Beyond egen

We have seen how generate and egen cover a wide variety of functions. You can
always enter the command help generate or help egen to see a description of
all the functions that are available. Sometimes this is not enough. Nicholas J.
Cox has written and continues to update an command called egenmore that adds
many additional functions that are helpful for data management. You can enter
ssc install egenmore, replace in your Command window and this will install
these added capabilities. The option replace will check for updates he has added
if egenmore is already installed on your machine. If you then enter help egenmore
you will get a description of all the capabilities Cox has added.

3.6

Create scales

We are finally ready to calculate our scales. There are two scale variables that will
be constructed, one for the adolescents perception of his or her mother and one the
adolescents perception of his or her father. With four items, each scored from 0 to
4, the scores could range from 0 to 16 points. Higher scores indicate a more positive
relationship with the parent.
At first glance, this is straightforward. We just add the variables together, which
gives us
. generate ymorelate = mopraise + mocritr + mohelp + moblamer
. generate yfarelate = fapraise + facritr + fahelp + fablamer

Before we settle on this, we should understand what will happen in the case of missing
values, and we should check through the documentation to find out what they did about
missing values. The first thing to do is to determine how many observations answered,
one, two, three, or all four of the items.This introduces a new command and dialog box:
egen shown in Figure 3.6.
From the . Data menu, select . Create or change variables, . Create new variable
(extended) to display the egen dialog. As you can see, there is a box into which you put
the name of the variable to be created which we will call mamissing because this will
tell us how many of the items about mothers that each adolescent missed answering.
The next step is to select the function from the list. In this case, we want to look among
the row functions for one that will count missing values, which appears as . Row number
of missing values. Last, we enter the items about the mother in the box provided. These
selections will generate the command
. egen mamissing = rowmiss(mapraise macritr mahelp mablamer)

If we then look at a tabulation of the mamissing variable, the rows will be numbered 0
to 4, and each entry will indicate how many observations have that many missing values.

52

Chapter 3. Preparing Data for Analysis

Figure 3.6: The extended generate dialog

For example, there are 4,510 observations for which all four items were answered (no
missing values), 6 for which there are three items answered (one missing value), 2 for
which there are 2 answered items (two missing values), and 4,466 for which there are
no answered items.
. tab mamissing
mamissing

Freq.

Percent

Cum.

0
1
2
4

4,510
6
2
4,466

50.20
0.07
0.02
49.71

50.20
50.27
50.29
100.00

Total

8,984

100.00

From this, we can see that there are a total of 8 observations for which there is some,
but not complete, data, There should be little doubt what to do for the 4,510 cases with
complete information, nor for the 4,466 with no information (all missing). The problem
with using the sum of the items as our score on ymorelate is that the 8 observations
with partial data will be dropped, i.e., given a value of missing.
There are several solutions. One thing we might decide to do is to compute the
mean of the items that are answered. We can do this with the egen command. We can
get this from the egen menu using the . Data menu, select . Create or change variables,
. Create new variable (extended) to display the egen dialog as we displayed in Figure 3.5.
However, this time we pick will call our scale mameana and pick row mean from the list
of egen functions. This generates a mean of the items that were answered, regardless

3.6 Create scales

53

of how many were answered, as long as at least one item was answered. If we do a
tabulation for mameana we have 4,518 observations. This mean that everybody who
answered at least one item has a mean of the items they answered.
Another solution is to have some minimum number of items, say 75% or 80% of
them. We might only include people who answered at least 3 of the items. These
people would have fewer than 2 missing items. We can go back to our menu for egen
and rename the variable we are computing mameanb. Clicking on the tab for by/if/in
we can stipulate the condition that mamissing < 2 as shown in Figure 3.7.

Figure 3.7: The extended generate dialog

The Stata command this generates is egen float mameanb = rowmean(mopraise


mocritr mohelp moblamer) if mamissing < 2. This creats a variable, mameanb, that
is the mean of the items for each observation which has at least 3 of the 4 items answered.
Doing a tabluation on this shows that there are 4,516 observations with a valid score.
We were able to keep the 6 cases that answered all but one of the item and we dropped
the 2 cases that were missing more than one item.

54

Chapter 3. Preparing Data for Analysis


Setting how much output is in the Results window
If you are working in a class and doing homework is the extent of your Stata
usage, you might not run out of space in the Results window. If you are working
on a research project, or need to generate more than a few models and see their
results, you will almost certainly want to increase the number of lines that you can
scroll back in the Results window. You do this from the . Prefs menu, . General
Preferences. Click on the Windowing tab. The default size for the scrollback buffer
is 32000 characters. Make it at least 200000, which is about .2 MB. You may also
want to turn the More feature off using the command set more off. If you enter
the command help set you will find many other things you can personalize about
the Stata interface.
Deciding among different ways to do something
When you need to decide among two or more different ways of performing a data
management taskwhich is what weve been doing this whole chapterit is usually
better to choose based on how easy it will be for someone to read and understand
the record rather than based on the number of commands it takes or based on
some notion of computer efficiency. The computing of the row mean has a major
advantage over computing the sum of items. The mean is on the same 0-4 scale that
the items have and this often makes more sense than a scale that has a different
range. For example, if you had 9 items that ranged from 1 for strongly agree to
5 for strongly disagree, a mean of 4 would tell you that the person has averaged
agree over the set of items. This might be easier for users to understand than a
total or sum score of 36 on a scale that ranges from 9 (1 on each item) to 45 (5 on
each item).

3.7

Save some of your data

There will be many occasions when you wish to save only part of your data. You may
wish to save only some variables, or you may wish to save only some observations.
Perhaps you have created variables that were needed as part of a calculation but do
not need to be saved, or perhaps you wish to work only with the created scales not the
original items. You might want to drop some observations based on some characteristic,
such as sex or parents educational level or geographic area.
To drop variables from your dataset use . Data menu, choose . Variable utilities,
. Eliminate variables, which will display the dialog box shown in Figure 3.8. To drop
observations from your dataset use . Data, . Variable utilities, . Eliminate observations,
which will display the dialog box shown in Figure 3.9.

3.7 Save some of your data

Figure 3.8: Selecting variables to drop

Figure 3.9: Selecting observations with an expression

55

56

Chapter 3. Preparing Data for Analysis

Sometimes it is easier to list the variables you wish to keep instead of those you wish
to drop. As you can see from Figure 3.8, thats done from the same dialog box. Click
the Keep button, then check the box to keep variables, and fill in the list of variables
you want to keep.
Often the criterion for selecting what to save to a file is not the variable, but the
observation. For example, there might be a large, master dataset, and you wish to
work only with those 14 and younger. Again, you have the choice to specify which
observations by specifying which to keep or which to drop. To select by observation,
select the Keep or Drop button, then fill in the expression. Figure 3.9 show the dialog
filled in to keep respondents age 14 and younger.
Were going to save two versions of our dataset. Well save one that contains all the
variables, and well save one from which weve dropped the original R0 variables. The
commands to do this are:
.
.
.
.
.
.
.

* Save the master dataset before dropping variables


save nlsy97_mstr, replace .
* Drop the original items because we no longer need them
drop R0000100-R0538200
* Reorder the variables when saving to make them easier to read
order id sex age ymorelate yfarelate m* f*
save nsly97_fp, replace

Note two things about the save command: Weve left off the .dta extension, and
weve used the replace option. Stata automatically appends .dta to the file name, so
we can leave it off. We use the replace option so that if we rerun these command, we
wont get an error because nsly97 fp.dta already exists.

3.8

Summary

This chapter has covered a lot of ground. You may need to refer back to this chapter
when you are creating your own datasets. Weve covered how to label variables and
create value labels for each possible response to an item. In learning how to create a
scale we covered reverse coding of items that were stated negatively and how to create
and modify items. We also covered how to create a scale and work with missing values,
especially where some people who have missing values are included in the scale and
some are excluded. Finally, we covered how to save parts of a file whether the parts
were selected items or selected observations.
In the next chapter well look at how to create a file containing Stata commands that
can be run as a group. Our webpage has such a command file for the current chapter
(c3.do. By recording all the commands into a program we have a record of what we
did and we can re-run it or modify it in the future. Well also see how to record both
output and commands in files.

3.9 Exercises

3.9

57

Exercises
1. Open the relate.dta Stata dataset. The variable R3828700 is the sex of the
adolescent. It is coded as a 1 for males and 2 for females. There are a number
of people who are missing because they dropped out of the study after a previous
wave of data collection. These people have a missing value on R3828700 of -5.
Modify the variable so that the code of -5 will be recognized by Stata as a missing
value. Then do a tabulation of R3828700.
2. Using the result of the first exercise, generate a new variable called male that has
a value of 1 for males and a value of 0 for females. Do a two way table of male
and R3828700 that shows you did this correctly.
3. Using the result of the second exercise, assign value labels to male so a 1 says Male
and a 0 says female. Do a tabulation of the variable male that shows the labeling.
Next, enter the command set numlabel, add in the Command window. Then
repeat the tabulation of the variable male. What is the difference?
4. The relate.dta dataset is data from the third year of a long term study. Because
of this, some of the youth who were adolescents the first year (1997) are over
18. Assign a missing value on R3828100 (age) for those with a code of -5.Drop
observations that are 18 or over. Keep only R0000100, R3483600, R3483800,
R3485200, R3485400, R3828700. Save this as a dataset called positive.dta.
5. Using the positive.dta dataset, assign missing values to the four items R3483600,
R3483800, R3485200, R3485400. Copy these four items to four new items called
mompraise, momhelp, dadpraise, dadhelp.
6. Create a scale called parents of how youth relate to their parents using the four
items (R3483600, R3483800, R3485200, R3485400). Do this separately for boys
(if R3828700 == 1) and girls (if R3828700 == 2). Use the row mean function
to create your scale. Do a tabulation of parents.

Author index
B
Bureau of Labor Statistics, Appendix 9
. . . . . . . . . . . . . 39
C
Cox, N. J., egenmore . . . . . . . . . . . . . . . 51
M
Maker, trouble . . . . . . . . . . . . . . . . . . . . . . . 1
S
StataCorp,Data Management reference
Manual . . . . . . . . . . . . . . . . . . . . 37

Subject index
C
codebook . . . . . . . . . . . . . . . . . . . . . . . 22, 40
coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Command Window . . . . . . . . . . . . . . . . . . 4
command, clear . . . . . . . . . . . . . . . . . . . . . 26
command, codebook . . . . . . . . . . . . . . . . 32
command, describe . . . . . . . . . . . . . . . . . 34
command, destring . . . . . . . . . . . . . . . . . 27
command, driven . . . . . . . . . . . . . . . . . . . . 2
command, drop observations . . . . . . . . 54
command, drop variables . . . . . . . . . . . 54
command, edit, nolabel . . . . . . . . . . . . . 42
command, egen . . . . . . . . . . . . . . . . . . . . . 46
command, egen rowmiss . . . . . . . . . . . . 51
command, egenmore . . . . . . . . . . . . . . . . 51
command, generate . . . . . . . . . . . . . . . . . 46
command, generate math functions. .47
command, generate rowmean . . . . . . . 53
command, histogram. . . . . . . . . . . . . . . .10
command, keep observations . . . . . . . . 54
command, keep variables. . . . . . . . . . . .54
command, label define . . . . . . . . . . 30, 41
command, label values . . . . . . . . . . 30, 42
command, more . . . . . . . . . . . . . . . . . . . . 33
command, mvdecode . . . . . . . . . . . . 41, 49
command, replace . . . . . . . . . . . . . . . . . . 46
command, saving . . . . . . . . . . . . . . . . . . . 31
command, set more off . . . . . . . . . . . . . . 54
command, variable labels . . . . . . . . . . . 28
conventions used in the book . . . . . . . 16
D
Data Editor . . . . . . . . . . . . . . . . . . . . . . . . 26
Data Editor, nolabels . . . . . . . . . . . . . . . 42
dataset, contents. . . . . . . . . . . . . . . . . . . .32
dataset, create . . . . . . . . . . . . . . . . . . . . . . 19
Datasets used, cancer.dta . . . . . . . . . . . . 8

dialog box, submit vs ok . . . . . . . . . . . . 15


Dialog box, summarize . . . . . . . . . . . . . . . 9
dictionary file . . . . . . . . . . . . . . . . . . . . . . . 38
Do-files, introduction . . . . . . . . . . . . . . . . 2
E
entering data . . . . . . . . . . . . . . . . . . . . . . . 26
exit Stata. . . . . . . . . . . . . . . . . . . . . . . . . . .15
F
format, numeric . . . . . . . . . . . . . . . . . . . . 27
format, string . . . . . . . . . . . . . . . . . . . . . . . 27
G
generate, math functions . . . . . . . . . . . . 47
GUI interface, Preferences . . . . . . . . . . . 3
H
help, videos . . . . . . . . . . . . . . . . . . . . . . . . . 18
help, web base . . . . . . . . . . . . . . . . . . . . . . 18
M
missing values . . . . . . . . . . . 22, 24, 41, 49
missing values, types . . . . . . . . . . . . . . . . 40
more . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
N
NLSY97 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
O
Open, Stata installed dataset, sysuse. .7
Open,existing dataset . . . . . . . . . . . . . . . . 7
options, add . . . . . . . . . . . . . . . . . . . . . . . . 30
options, modify . . . . . . . . . . . . . . . . . . . . . 30
P
Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . 4

62
project outline . . . . . . . . . . . . . . . . . . . . . . 38
R
recoding, ranges . . . . . . . . . . . . . . . . . . . . 45
Results Window, scroll size . . . . . . . . . 54
Results, more . . . . . . . . . . . . . . . . . . . . . . . 54
Results, Window . . . . . . . . . . . . . . . . . . . . . 4
reverse code variables . . . . . . . . . . . . . . . 43
Review Window . . . . . . . . . . . . . . . . . . . . . 4
S
scale creation . . . . . . . . . . . . . . . . . . . . . . . 51
schemes, s1monochrome . . . . . . . . . . . . 12
Stata Screen . . . . . . . . . . . . . . . . . . . . . . . . . 3
summarize . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
sysuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
U
UCLA Stata portal . . . . . . . . . . . . . . . . . 18
V
value labels. . . . . . . . . . . . . . . . . .20, 28, 41
value labels, assign. . . . . . . . . . . . . . . . . .42
variable label . . . . . . . . . . . . . . . . . . . . . . . 20
variable name . . . . . . . . . . . . . . . . . . . 20, 22
variable names . . . . . . . . . . . . . . . . . . . . . . 20
Variables Window . . . . . . . . . . . . . . . . . . . 4
W
Working directory . . . . . . . . . . . . . . . . . . . 7

Subject index

S-ar putea să vă placă și