Sunteți pe pagina 1din 10

An Introduction to Stata

Hemanshu Kumar
August 2015

The Stata Layout

The Stata screen comprises four windows:


Results: the biggest window is where Stata shows the commands you
enter and the results they generate.
Command: the window found at the bottom of the screen by default.
This is where you type in commands.
Review: this window displays the history of commands entered in
the command window.
Variables: If there is data loaded, then this window displays the
variables in the dataset.
These windows can be unpinned, moved around, etc. Feel free to experiment!
In addition, note the toolbar on top. In particular, the Data, Graphics
and Statistics menus contain a wide variety of commands which are very
useful for most data summary and description purposes.

Some Preliminary Notes


Stata can be used interactively, by entering commands one at a time
in the command window, or using the menus on the toolbar. However,
for larger tasks, it is best to put all the commands together in a file
and run them together. This has the advantage that you can easily
replicate your work later, as well as make changes easily if you make
mistakes. The file which stores your commands is a simple text file,
and is given the extension .do. More on this later.
Stata works by loading a (single) dataset into the computers memory.
All the commands you give work with the data in memory, and the
data in the file is left untouched. The file is only changed when you
1

explicitly ask for the data to be saved to file. This minimizes the
chance that any mistakes you make permanently destroy your data.
Stata variables are actually entire vectors, with one value for each
observation. If you want to store a simple number or string, it is
more appropriate to use scalars. Stata also provides a simple way to
store matrices of numbers. These can all be stored in the same Stata
dataset.
If you need to give Stata a command that involves a filename and/or directory name that contains spaces, you must enclose the file/directory
name in a pair of double quotation marks ("").
Stata commands and variable names are case-sensitive.
In any command, Stata is generally insensitive to the number of continuous spaces.

Command Syntax

The essential syntax of a Stata command looks as follows:


.[prefix cmd :] command [varlist ] [if] [in] [,options ]
where the portions enclosed in square brackets are optional. We shall
see various examples of Stata commands ahead; you should return here to
see how they fit into this schema, and as a guide to know how to make your
own modifications to the commands.

Some Preliminary Steps

For the bulk of this document, we will assume that you are working interactively in Stata, by typing commands in the command window.

4.1

Getting Started

Once we have launched Stata from Windows, we take immediate note of two
things on the screen:
the status bar at the bottom mentions the current working directory
of Stata on its left corner. Say we want to use and save datasets in
the directory C:\My Documents\C003. For this, just type
.cd "C:\My Documents\C003"
in the command window, without the initial . I will include the dot
at the beginning of each command, since this is the way the command
2

is displayed in the Results window. However, it is not typed in by the


user. In this command, cd is short for change directory. The double
quotes in the command we typed are only necessary when the name
of the directory contains spaces.
the results window typically states that Stata has allocated 1.0MB of
memory for data. In addition, in Intercooled Stata, the default settings allow for a maximum of 200 variables in the dataset. While this
is enough for small datasets, we might occasionally need more. To
change the memory allocation to 100MB, for example, and the permissible number of variables to 300, we can type
.set memory 100m
.set matsize 300
It is important to execute these commands before creating any variables or loading any dataset into Statas memory. Once a dataset has
been loaded and/or variables created, Stata does not let you fiddle
with memory because it would destroy the dataset in memory.
Note: The set memory command is obsolete starting with Stata 12,
since Stata now manages memory automatically.

4.2

Getting Help

Stata has an extensive help system. You can get help on Stata commands
at any time by typing
.help commandname
As an example, try asking for help on matsize. In addition, you can
search Statas help system for any word(s) of your choice. For example, try
.search memory
In the help on a command, you will notice a portion of the command
underlined. This tells you the extent to which you can abbreviate that command in Stata. For example, in setting memory allocation above, we could
have typed just
.set mem 100m
.set mat 300

4.3

Logging your Work

Especially when using Stata in interactive mode, it is a good idea to keep


a log of your work both the commands you entered and the results Stata
gave. To begin a log, either go to File > Log > Begin... and specify the
log filename and location, or in the command window, type
3

.log using mylog , fileformat


where of course you can replace mylog with any filename of your choice.
Stata can create logs in one of two formats SMCL, which has rich formatting but can only be read by Stata, and text files, which have almost
no formatting, but can be read by any editor such as Notepad, Wordpad,
Word, etc. By default, Stata uses SMCL logs. However, we can specify an
option in the above command, to use the text format. If we merely typed
in
.log using mylog
Stata would start a log file called mylog.smcl in the SMCL format, while
if we typed
.log using mylog, text
it would start a log file called mylog.log, which is a simple text file. This
also highlights one aspect of the standard syntax of Stata commands: the
options with a command are specified by entering a , after the main
command.
To suspend recording to a log file at any time, type
.log off
To resume a suspended log, just type
.log on
And to close a log file completely, type
.log close

5
5.1

Using Stata in Batch Mode


Working with Do Files

Arguably the most professional way to use Stata is to do all your work
with do files. A do file is nothing but a text file that contains a series of
commands. When the file is run in Stata, the commands are processed
together in a batch. And since the do file is a simple text file, it can be
opened by any text editor, such as Windows native Notepad or Wordpad
programs. I personally prefer to use a program called WinEdt.
However, Stata has its own do file editor as well, and to pull it up, you
can just type
.doedit
in the command window, or alternatively click on the envelope icon in
the toolbar near the top of Statas main window. As an example of our first
4

do file, you could type into the editor


* this is my first ever do file
clear
Use the Ctrl + S keyboard shortcut to save this file with your chosen
name. Notice that the default file extension is .do. To execute this file, you
can either click the icon for Do current file in the toolbar of the do file
editor, or in Statas command window, you can type
.do filename
where Stata assumes the .do extension to filename if it is not specified.

5.2

Comments

It is good programming practice to include extensive comments. This allows


others to understand your code, as well as for you yourself to make sense of
it at a later date. In Stata, single-line comments can be included by starting
the line with an asterisk (*), as you can see in the do file above. Long comments that span multiple lines are also possible in this case, the comment
should begin with /* and end with */ . In addition, you can also include
a comment at the end of a line which contains a Stata command. Such a
comment must be separated from the command by // as in the following
example:
.set obs 40 // this changes the number of observations to 40

5.3

Backward Compatibility

Since new versions of Stata are constantly coming out, it is quite conceivable
that a do file you write today may not work a few months or years down
the line. The solution is to specify the Stata version you created the file in,
at the top of the file, using the version command. This is what you see in
the do file example in section 5.1 above.

5.4

Clearing Memory

The clear command is a very useful command that also often finds place
at the beginning of do files. It simply clears the memory of all variables
and observations, as well as other Stata structures such as scalars, matrices,
labels, equations, and so on.

Generating Data

In this guide, we will consider a situation where you are interested in creating
a dataset from scratch, rather than using a pre-existing one.
Stata thinks of its dataset as being comprised of several variables, all of
which have the same number of observations. In a spreadsheet or matrix
5

representation, the variables comprise the columns and the observations sit
on individual rows. The first thing to do when generating a dataset is to
tell Stata how many observations the variables will have.
.set obs #
where # should be replaced with the requisite positive integer. This
command is usually given when there are no pre-existing variables in memory.1

Saving your Data

Remember that Stata works with data in memory. Until you explicitly ask
Stata to save the data, no change will be made to any file on disk. Saving
your dataset in Statas own proprietory format is simplicity itself. Suppose
you want to save to a file called mywork.dta in the working directory. You
need to type:
.save using mywork
Notice that we did not need to specify the .dta extension Stata adds it
automatically. If a file called mywork.dta already exists, Stata will promptly
give you an error. If you are sure you want to overwrite the existing file with
the dataset in memory, you should add the replace option to the command:
.save using mywork, replace

Taking a First Look at your Data

8.1

Browse command

Having loaded in your dataset, you will find the Variables window populated by the various variable names that were found in your data. If you
imported a text file into Stata, Stata would have converted the variable
names to small letters even if they were originally not so. By default, all
variable names in Stata are purely in small letters, and no Stata variable
name can begin with a number.
Perhaps the first thing to do is to just look at the spreadsheet of your
data. This is achieved by typing
.browse
As an aside, note that the browse command can be abbreviated to as
little as br.
1

If there are variables in memory, then this command can be used to increase the
number of observations in the dataset. In this case, the new observations will all have
missing values.

The Data Browser pane that opens up allows you to look at your data,
but not edit it. Also, while the Data Browser is open, no other commands
can be executed by Stata. These are for your protection! Stata strongly
deprecates direct editing of data; you should use commands, so that you
have a better track of what changes are made, and are forced to change
data in a consistent manner.If you are sure you want to manually change
the data, you can always use the edit command.
If you have missing observations/cells in your data, these are recorded
as a single dot (.).

8.2

Conditions, Ranges and Variable Lists

Sometimes, you may wish to look at only part of your data. For example,
you might have a variable country, which stores names of various countries,
and you may wish to see only those observations for which country takes
on the value India. To do this, type
.br if country == "India"
Most commands in Stata accept the if argument. Notice that this is not
an option it does not need to be preceded by a comma. if executes a
command for those observations for which the succeeding logical expression
holds true. You should note that in a logical expression for equality, we
must use a double = sign. See help operator for more.
Instead of performing a command (such as browse) for a set of observations which satisfy a condition, if we want to execute it over some specific
range of observation numbers, we can do the following
.br in 50/l
(where l is the lowercase L). This would browse the observation numbers
from 50 to the last. (f, for the f irst observation, is also available as a special
character). Notice the use of the forward slash (/) to give an observation
range.
We could also choose to browse only a subset of the variables in our data.
Suppose our dataset had five variables, country, year, gdp, gnp, exrate,
listed in that order in our Variables window. Then
.br country year gdp gnp if year>1990
would show us the specified variables for the data from after 1990. Notice
that in a list of variables, the individual variables are separated by spaces.
Stata also allows us to abbreviate variable names as long as it can
uniquely identify a variable from its abbreviation. In addition, wildcards
such as * and ? are permitted. Thus, the same result as above could be
obtained by typing
.br c y g* if y>1990
7

You can also use - to shorten a list of variables, using the order in the
Variables window. Thus, the same result as above could also be obtained by
.br c-gnp if y>1990

8.3

Other data description commands

Typing just
.describe
gives you a basic summary of your data, including the source dataset, the
number of observations and variables, the amount of memory in use, and
a list of all variables with their respective storage types, display formats
and labels (more later about labels). To describe only specific variables, the
syntax is
.describe varlist
where varlist is a list of variables.
For numeric variables, the inspect command provides a useful first pass
at the nature of the data: it gives a small histogram, tells you the number
of unique and missing values, and the number of values which are positive/zero/negative and integer or not.
For categorical (nominal) variables, we can quickly obtain a frequency
distribution of the data using the tabulate command.
For any variable, its values (if desired, for a specified range of observations) in the dataset can be obtained using the command list.
Basic descriptive statistics for cardinal variables can be obtained with
the summarize command. For example,
.sum gnp gdp if year<=1990
(where sum is the abbreviated version of summarize) provides the mean,
standard deviation and minimum and maximum values of gnp and gdp for
years uptil 1990. Adding the detail option to the command provides a
larger set of statistics, including quantiles and skewness and kurtosis.

9
9.1

Creating and Deleting Variables


Creating Variables

We often need to create new variables in a dataset which operate on the


existing variables to give us our quantity of interest. Suppose for example
that we are not interested in GDP itself, but in its logarithm. We could then
create a new variable, say called lgdp, by using the generate command:
.generate lgdp = ln(gdp)
where the function ln() gives the natural log. Suppose we change our mind
and decide we want the log to the base 10 instead. We can then replace
8

our variable as follows:


.replace lgdp = log10(gdp)
If we had used the generate command instead, Stata would have given
an error, since it is not sure whether we realize that the variable lgdp already
exists, and would get overwritten.
As another example, suppose we want to create a variable to store the
growth rate of GDP. We could then do
.generate growth = (gdp[ n]-gdp[ n-1])/gdp[ n-1]
To understand what we are doing, we need to realize that Stata generates
each observation of the new variable growth, one at a time. At any given
time during that process, Stata uses a temporary variable n to store the
current observation number. [Note again, Stata is case sensitive. In particular, N is an entirely different variable, one which stores the total number
of observations. Further, the leading underscore is a common feature of a
lot of Statas internal hidden variables.]
To access a specific observation, we need to enclose the observation number in square brackets. Thus the nth observation of the variable growth
is generated by differencing gdp[ n] and gdp[ n-1] (its value in the last
period), and dividing by the latter.
As you might guess, the first observation of our new variable growth
should be missing. You should browse to satisfy yourself that this is
indeed the case.
Also use help functions to see the range of operations on offer.

9.2

Deleting Variables

We can use drop varlist to drop a specific list of variables, or use keep
varlist to drop except the specified list of variables.

9.3

Macros

Macros come in two types, global and local. Global macros, once defined, are
available anywhere in Stata. Local macros exist solely within the program
or do-file in which they are defined. If that program or do-file calls another
program or do-file, the local macros previously defined temporarily cease to
exist, and their existence is reestablished when the calling program regains
control. When a program or do-file ends, its local macros are permanently
deleted.
For example,
to create a local containing a value, you could type:
.local x = 4
Locals can be numeric or text strings. For example, you could type
.local y = "Hello"

The list of macros in memory at any time, and their values, can be obtained using
.macro list
The content of a local in memory at any time, and its value, can be
obtained using
.disp x y
To delete any local from memory (for example y), we can type
.local drop y

10
10.1

Using Stata as an Advanced Calculator


Simple Math

We can do simple math on the command line by using the display command. display will simply output the result of the computation for us. For
example, we can ask Stata to
.display 5 + ln((3-1)/2)
and Stata will just output 5.

10.2

Using Scalars

For more complex mathematics, however, we would ideally like to store


results in some variables. However, Statas variables are entire vectors,
containing one value for each observation in the dataset. The purpose is
served by using scalars.
To create a scalar containing the same result as above, you can type, for
example:
.scalar myfirst = 5 + ln((3-1)/2)
Scalars can be numeric or text strings. For example, you could type
.scalar hk = Hemanshu Kumar
The list of scalars in memory at any time, and their values, can be obtained using
.scalar list
To delete any scalar from memory (for example hk), we can type
.scalar drop hk

10

S-ar putea să vă placă și