Notes On The SAS Data Step and An Introduction To Simulation

Notes on the SAS Data Step
and an Introduction to Simulation

W. John Braun
University of Western Ontario
Department of Statistical and Actuarial Sciences
Chapter 1
Introduction
1.1 Introduction to Data Analysis and Simulation
Given a set of data, one wishes to analyze it appropriately in order to make a decision or to
acquire some new insights into the population from which the data was extracted.
A data set is a collection of letters (characters) and/or numbers each representing infor-
mation in the form of measurements, counts or labels. The data sets we will consider in
this course will usually be in case-by-variable format which is a rectangular array (or ma-
trix) of data, where each row represents a set of measurements taken on a single subject or
case. Each column of the data set refers to a specic variable, such as age, gender or annual
income.
There are many dierent types of analysis that are possible. A few of them should be
familiar from an earlier course, such as simple regression analysis or ANOVA. Other kinds
of analyses will be introduced in this course. In all cases, the analysis of a data set involves
one or more of the following:
checking for errors, missing values, etc. (data cleaning)
graphical displays
estimation
prediction
control
measuring uncertainty
statistical testing
interpreting results
In order to be able to analyze a data set satisfactorily, a computer package is usually
necessary. Several are available, such as SPSS, Minitab, S-Plus and R. This course will focus
mainly on the use of SAS, and the goal of this set of notes on the SAS Data Step is to teach
you how to use SAS to simulate dierent kinds of data. Simulated data are generated by
the computer according to a pre-specied probability model, such as a normal distribution
or a t-distribution, or perhaps, something much more complicated. The way in which the
simulated data are generated is designed to make the data appear to be random, though in
fact, they are not truly random.
1
CHAPTER 1. INTRODUCTION 2
There are at least 2 reasons for learning how simulate data: rst, it gives you a way
of making up data for your own future exercises so that you can test out dierent SAS
analysis procedures, and you will be able to nd out what kinds of data are appropriate
for a given procedure; second, knowing how to simulate a set of data is a step towards
understanding what kind of structure underlies the data or the mathematical model which is
being studied as an approximation to the real population. Thus, we will rst be using SAS
to create articial data of dierent types. Later on, we will learn how to use SAS procedures
to analyze real data; the articial data can then be used for practice.
1.2 Introduction to SAS
You are about to be introduced to one of the most commonly used statistical packages: SAS
(Statistical Analysis System). Many companies use SAS, especially in the pharmaceutical
industry. Certain insurance companies and banks are also happy to have employees who can
use SAS to analyze data.
SAS is a software system for data analysis. SAS has been (and is continuing to be)
developed at the SAS Institute in Research Triangle Park at Cary, North Carolina. We will
be using the SAS Version 9.3 in this course. It has been in development for over 30 years,
and it now has capabilities to perform hundreds of kinds of data analyses. A number of
extensions, such as IML, have also been developed which give SAS even more exibility and
power.
In this course, we will only learn the basics. What you learn here will give you the ability
to self-learn the rest of the system as needed.
In these notes, we will begin our introduction to the SAS system by showing to get it
started in the computing lab in WSC 256 and how to use the graphical user interface. Then,
the Data Step will be considered in some detail. Matters of input/output and ow control
will be discussed. The main application will be to the generation of random numbers and
the creation of articial data. The very important issue of documentation for SAS programs
will be considered briey.
1.3 Accessing SAS at Western
We will begin by learning how to run SAS jobs in the Windows environment. In practice,
SAS is often run on Unix platforms in which case the procedures for running the SAS jobs
diers from what will be described here, but the content of the SAS programs is almost
identical.
To invoke SAS in the lab (Room 256 WSC), begin by logging into the network using your
UWO id and password. Proceed through the following steps as illustrated in Figure 1.1:
1. Click on the Windows icon and choose All Programs.
2. Scroll down to the STATISTICS folder and click on it.
3. Click on the SAS folder and choose SAS 9.3.
You will then see the Program Editor window and the Log Window. You should see
something similar to what is shown in Figure 1.2.
The Program Editor is ready for you to type in a SAS program or to open an existing
program (using the File Menu).
CHAPTER 1. INTRODUCTION 3
Figure 1.1: Locating the SAS program on the Labs Windows system.
Figure 1.2: What should appear on the computer screen after invoking SAS 9.3.
1.4 Main Components of a SAS program
1. DATA step - for reading and manipulating data. Sometimes programming is done in
this step.
2. PROC step - for analyzing data. A SAS procedure is used to conduct the analysis on
data that is contained in a SAS dataset prepared during the DATA step. Thus, the
PROC step usually follows a DATA step.
Chapter 2
The Data Step
2.1 Some Denitions
1. Data Value - a single measurement. e.g. the height of a person (Joe).
2. Observation - a set of data values for the same individual. e.g. name, height, weight,
age and sex of Joe.
3. Variable - a set of data values for the same measurement. e.g. the heights of 10 dierent
people.
4. Data set - a collection of observations. We usually think of the observations as being
the rows of the data set, while the variables make up the columns of the data set.
2.1.1 Example
Consider the following data set which consists of 4 observations on 5 dierent
variables (NAME, HEIGHT, WEIGHT, AGE, SEX).
NAME HEIGHT WEIGHT AGE SEX
JOE 149 54 13 M
MARY 151 60 28 F
SUE 154 45 21 F
TOM 174 72 26 M
Here, we have 3 numeric variables (HEIGHT, WEIGHT, AGE) and 2 character variables
(NAME, SEX).
2.1.2 Exercise
Consider the following data set:
TEMPERATURE PRESSURE MINIMUM WIND SPEED MAXIMUM WIND SPEED
32 101.5 21 42
31 101.3 15 28
30 101.8 7 35
24 101.2 12 23
21 100.8 4 22
22 100.9 18 27
1. How many variables are there?
4
CHAPTER 2. THE DATA STEP 5
2. How many observations on each variable?
The Data Step is the point in the SAS program at which one or more SAS data sets are
created. These data sets may be read in from external les or created from within the SAS
program itself. It should be noted that a single SAS program can consist of more than one
Data Step, though we shall nd a single Data Step sucient for present purposes.
The Data Step consists of a sequence of statements, each ending with a semi-colon. These
statements are primarily concerned with the construction of data sets and the management
of data.
2.2 Data
The rst line of the Data Step consists of the Data statement. This statement indicates that
a data step is starting, and it tells SAS the name of the SAS data set which is being created.
Syntax:
DATA setname;
The data set name is a word which is somehow descriptive of the data set with which it
is associated. It must consist of at most 32 letters and/or numbers. The rst character must
be a letter.
2.2.1 Examples
The following statement tells SAS that a SAS data set called WEATHER is going to
be created.
DATA WEATHER;
The following statement tells SAS that a SAS data set called GRADES98 is going to
be created.
DATA GRADES98;
Some programming applications do not involve a data set. The following statement
tells SAS to begin a data step without creating a data set.
DATA _NULL_;
This type of data statement frees up memory that would possibly be used unnec-
essarily. We will use it when doing simulations.
2.3 Numeric Assignment
The Assignment statement is used for creating new variables and modifying existing vari-
ables.
Syntax:
varname = value;
Naming Variables in SAS: A variable name must begin with a letter and may be 1 to 8
characters long. e.g. NAME HEIGHT WEIGHT AGE SEX. e.g. If we have two samples of heights,
we could label the 2 height variables HEIGHT1 and HEIGHT2. 1HEIGHT and 2HEIGHT are not
valid variable names.
2.3.1 Example
TEMP = -21.7;
The above statement assigns the value -21.7 to the variable TEMP.
2.3.2 Example
We can create a SAS data set called WEATHER consisting of one observation on each
of 4 variables using the following sequence of assignment statements. Figure 2.1
shows what this should look like on your computer screen.
DATA WEATHER;
DATE = 22;
PRESSURE= 100.55;
WIND = 19;
TEMP = -21.7;
RUN;
QUIT;
When the program has run (as shown, for example, by pressing the Runner
button, in Figure 2.2), the resulting SAS data set is as follows:
WEATHER
DATE PRESSURE WIND TEMP
22 100.55 19 -21.7
Note that the data set is not actually visible in the output. In fact, no output is actually
available; clicking on the Output button at the bottom of the screen opens the Output
window, but nothing appears there, as indicated in the bottom panel of Figure 2.2.
Figure 2.1: Entering commands into the Editor window to assign data values to a number of variables.
Figure 2.2: To execute lines of SAS code, press the Runner button as shown in the top panel. What appears
on the screen after the lines of SAS code have been successfully executed: a record of what was done in the
log window. In this case, no errors were reported.
The problem is that we have simply created a SAS dataset which is held internally by
the program. In order to see it, we would need to explicitly ask for it somehow. Later, we
will see how to do this.
A simpler way to read in data involves he the Input and Datalines statements. The
following lines of code, upon execution, will produce the same SAS dataset as before.
DATA WEATHER;
INPUT DATE PRESSURE WIND TEMP;
DATALINES;
22 100.55 19 -21.7
;
RUN;
QUIT;
A major advantage of this approach is that it allows us to read in more than one obser-
vation on the variables specied by the Input statement. This is accomplished by inserting
additional lines of data, noting that the data will be read into the resulting SAS dataset case
by case, where each case consists of observations on each of the Input variables.
2.3.3 Example
Create a SAS data set called GRADES98 containing the following data:
ID EXAM FINAL
3237332 58 61
4136229 71 68
2838823 43 49
2881266 62 58
The following lines of code will give the required SAS dataset.
DATA GRADES98;
INPUT ID EXAM FINAL;
DATALINES;
3237332 58 61
4136229 71 68
2838823 43 49
2881266 62 58
;
RUN;
QUIT;
2.4 INFILE and INPUT: Importing Data from an External File
Often, a data set has been entered into a text le, for example, from a spreadsheet or data
editor, or perhaps from another SAS program. The INFILE statement is used in the Data
Step to tell SAS where to nd the data. Then, the INPUT statement species how to assign
the data values to specic variables in the newly created SAS dataset.
Syntax:
INFILE filename;
INPUT var1 var2 ... varn;
2.4.1 Example
Suppose the data set of the exercise in the previous section had been previously
entered into a le called weather.dat. We can produce a SAS data set called
WEATHER by executing the following program.
/* Example of reading data */
DATA WEATHER;
INFILE WEATHER.DAT;
INPUT TEMP PRESSURE MINWIND MAXWIND;
PROC PRINT NOOBS; /* This statement is NOT necessary, but it
allows one to see the contents of the SAS
data set in the Output window. */
RUN; /* This statement IS necessary. The program
will not run otherwise. */
QUIT;
The PROC PRINT statement invokes the Print Procedure which prints the SAS dataset
to the Output window. In this case, it consists of a single case on the four given variables.
It is pictured in Figure 2.3.
Figure 2.3: Output from the SAS Print Procedure. In this case, the single case of the SAS dataset WEATHER
has been printed to the Output window.
2.5 Comments and Documentation
It is often important to add documentation to any computer programs which you create.
Comment statements should be used to describe program contents. Proper documenta-
tion allows you or other users to read and understand your program more easily. This is
particularly useful if the program is to be updated later.
In SAS, there are two forms of comment statements:
1. /* comment */
e.g.
/* The variable RADIUS measures the
cross-sectional radius of each tree at a distance of 1 meter from
the ground. */
2. * comment;
e.g.
* The variable RADIUS measures the cross-sectional
radius of each tree at a distance of 1 meter from the
ground.;
A useful form of documentation includes a statement at the beginning of the program
consisting of the title of the program, the name of the programmer, and the date (dates
of later revisions are important as well). Sometimes variables are dened here. A brief
description of the purpose of the program is useful as well. In the body of the program, it is
often useful to explain any special commands used there.
2.5.1 Example
The following lines would make up a SAS le:
/* Descriptive Analysis of a Sample of Four Individuals
By P. Brooks
January 15, 2007
This program computes the mean and standard deviation for the
height, weight and age of a random sample of people.
Variables: HEIGHT = height in centimeters.
WEIGHT = weight in kilograms.
AGE = age in years. */
DATA SIZES; INFILE sizes.dat;
INPUT HEIGHT AGE WEIGHT;
PROC MEANS MEAN STD;
* The extra arguments produce only the sample mean and
sample standard deviation for each variable;
2.6 File and Put
The FILE statement is used to specify an external output le.
Syntax:
FILE filename;
The PUT statement causes SAS to print to the external le named in an earlier FILE
statement.
Syntax:
PUT varname1 varname2 ...;
2.6.1 Example
The following lines cause SAS to print the values 22, 100.55, 19, -21.7 to a le
called weather.txt.
DATA WEATHER;
FILE weather.txt;
INPUT DATE PRESSURE WIND TEMP;
PUT DATE PRESSURE WIND TEMP;
DATALINES;
22 100.55 19 -21.7
;
RUN;
QUIT;
Each occurrence of a Put statement causes the current value of the relevant variables to
be output to the le named in the File statement.
2.6.2 Example
DATA _NULL_;
FILE GRADES.08;
IF _N_=1 THEN PUT 2008 GRADES; /* _N_ counts the observations
as they are input to the dataset */
LENGTH NAME $ 8; /* This Length statement ensures that the
variable NAME can contain values up to
8 characters in length. */
INPUT NAME $ GRADE; /* The $ tells SAS that NAME is a character
variable. */
PUT NAME GRADE;
DATALINES;
JOE 57.5
MARY 83
JENNIFER 64.5
;
RUN;
QUIT;
This produces a le called GRADES.08 containing the lines
2008 GRADES
JOE 57.5
MARY 83
JENNIFER 64.5
Note that the use of DATA _NULL_ results in no SAS dataset being created.
2.6.3 Exercises
1. Write out the contents of the le epa.dat produced by the following:
DATA _NULL_;
FILE epa.dat;
PUT SOME MILEAGE MEASUREMENTS;
LENGTH CAR $ 13;
CAR = BUICK CENTURY;
DISTANCE = 540;
FUEL = 40;
PUT CAR DISTANCE FUEL;
CAR = HONDA CRX;
DISTANCE = 720;
FUEL = 30;
PUT CAR DISTANCE FUEL;
RUN;
QUIT;
2. Check your answer by executing the above lines on a computer.
3. Was a SAS data set created? Check this by adding the line PROC PRINT NOOBS;
(then look in the Output window and the Log le for more information.)
4. Reorganize the program so that it uses the Datalines statement.
2.7 Arithmetic
SAS can be used as a calculator to perform simple arithmetic.
1. Addition:
varname = varname1 + varname2;
2. Subtraction:
varname = varname1 - varname2;
3. Multiplication:
varname = varname1 * varname2;
4. Division:
varname = varname1 / varname2;
5. Power (varname1
varname2
):
varname = varname1 ** varname2;
6. Modular arithmetic:
varname = MOD(varname1, varname2);
this computes the remainder resulting from division of varname1 by varname2 and
assigns this value to varname.
2.7.1 Example
DATA _NULL_;
/* some examples of arithmetic calculations */
FILE arith.out;
X = 15; Y = 6;
SUM = X + Y;
DIFF = X - Y; /* DIFF = DIFFERENCE */
PRODUCT = X * Y;
QUOTIENT = X/Y;
POWER = X ** Y;
REMAIND = MOD(X,Y); /* REMAIND = REMAINDER */
PUT X Y SUM DIFF PRODUCT;
PUT QUOTIENT POWER REMAIND;
RUN;
QUIT;
Execution of the above SAS program produces a le called arith.out which con-
tains the following lines:
15 6 21 9
90 2.5 11390600 3
2.7.2 Exercises
1. What are the contents of the le convert.tmp produced by the following
program?
DATA _NULL_;
FILE convert.tmp;
TEMPC = 20;
TEMPF = TEMPC*1.8 + 32;
PUT TEMPC degrees Celsius = TEMPF degrees Fahrenheit.
RUN;
QUIT;
2. Suppose X = 45, Y = 32, and Z = 7. Find the value of the variable ANSWER
in each of the following:
(a) ANSWER = X - Y;
(b) ANSWER = Z ** Z;
(c) ANSWER = MOD(X,Y);
(d) ANSWER = MOD(Y,Z);
(e) ANSWER = MOD(X,Y)+ MOD(X,Z);
3. Using the fact that 1 mile = 1.6 kilometers, write a complete SAS program
which converts a distance of 26 miles into kilometer units, and which prints
the following into a le called convert.dst:
A distance of 26 miles
is the same as a distance of 41.6
kilometers.
The Floor Function
Syntax:
varname = FLOOR(varname1);
This statement assigns the greatest integer less than varname1 to the variable varname.
For example, the greatest integer less than 27.34 is 27, and the greatest integer less than
-16.4 is -17.
2.7.3 Example
DATA _NULL_;
X = 47.39;
Y = FLOOR(X);
The value of Y is 47.
2.7.4 Exercises
1. Write out the contents of the le arith.dat produced by
DATA _NULL_;
FILE arith.dat;
X = -42.49;
Y = FLOOR(X);
PUT X Y;
RUN;
QUIT;
2. Modify the above program to compute the greatest integer less than
(a) 0.47.
(b) -0.47.
(c) W, where W = 32X, and X = 0.217.
Chapter 3
If: Controlling Flow of Operations
The IF statement is very important in database management. It is used to control the ow
of operations which are applied to variables depending on the values of relevant variables.
In other words, if a certain variable takes on a certain value, a certain operation might be
performed; otherwise, the operation is not performed or a dierent operation is performed
in its place.
Syntax:
IF (condition) THEN (SAS statement);
ELSE (SAS statement);
SAS evaluates the condition to determine whether it is true or false. If the condition is true,
SAS proceeds to carry out the SAS statement. The ELSE statement is optional. It provides
an alternative action if the condition is false.
Possible conditions to test are
varname GE constant, varname LE constant
varname < constant, varname > constant
varname = constant, varname NE constant
Testing the rst condition above amounts to testing whether the variable with name
varname is greater than or equal to the specied constant (another variable name could be
used here as well). The second condition listed concerns less than or equal, and the last
condition involves testing for inequality.
3.0.5 Example Coding
The variable SEX can take values M and F. It is sometimes more convenient
to code this variable numerically using 1 for males and 0 for females. The IF
statement can be used to do this as follows:
IF SEX = M THEN SEXCODE = 1;
ELSE SEXCODE = 0;
In other words, if the variable SEX takes the value M, then the new variable
SEXCODE takes the value 1. Otherwise, SEXCODE takes the value 0.
15
CHAPTER 3. IF: CONTROLLING FLOW OF OPERATIONS 16
3.0.6 Example Outlier Detection
Suppose X is a variable whose mean is MU and standard deviation is SIGMA. We may
decide that the value of X is to be considered outlying if it is more than 3 standard
deviations from MU. The following SAS lines determine if the value of X is outlying.
The variable OUTLIER is assigned the value 1 if X is an outlier, and it is assigned
the value 0 if X is not an outlier.
OUTLIER = 0;
Z = (X - MU)/SIGMA;
IF Z > 3 THEN OUTLIER = 1;
ELSE IF Z < -3 THEN OUTLIER = 1;
3.0.7 Exercises
1. Execute the following program and view the contents of the le demog.dat.
DATA DEMOGRAP;
FILE demog.dat;
INPUT SEX $;
IF SEX = M THEN SEXCODE = 1;
ELSE SEXCODE = 0;
PUT SEXCODE;
DATALINES;
M
F
M
M
F
;
RUN;
QUIT;
2. The following data has been recorded over a period of 5 hours at a switch:
0,1,1,1,0. The switch is o when the value of the above variable (called
testcode) is 0, and on when the value is 1.
Write a SAS program which assigns the value on to the variable test when
the testcode value is 1 and o when testcode is 0.
3. A random variable X has mean 14 and variance 49. Write a SAS program
which determines which of the following values of X are outliers: 15, 23, -8,
31, 17. The results should be output to a le called outliers.ex.
Chapter 4
DOing things repeatedly
The DO statement is often useful for simulation. It is also sometimes useful in other kinds
of data preparation and analysis.
4.1 Simple DO
The simple DO statement (which is usually used in association with an IF statement) tells
SAS to execute a set of SAS statements. This set of statements is usually referred to as a
DO group.
Syntax:
DO;
SAS statements
END;
4.1.1 Example
DATA _NULL_;
FILE do.eg;
INPUT X Y;
IF X > Y THEN DO;
Z1 = X+Y;
Z2 = X-Y;
END;
ELSE DO;
Z1 = X-Y;
Z2 = X+Y;
END;
PUT X Y Z1 Z2;
DATALINES;
3 4
5 4
;
RUN;
17
CHAPTER 4. DOING THINGS REPEATEDLY 18
QUIT;
Executing the above program results in a le called do.eg which contains the
following:
3 4 -1 7
5 4 9 -1
4.2 Iterative DO
The iterative DO statement tells SAS to perform a computation several times.
Syntax:
DO varname = constant1 TO constant2 BY constant3;
SAS statements
END;
4.2.1 Example
Suppose we wish to add up all the numbers from 1 to 100. The following SAS
program does this for us:
DATA _NULL_;
NUMSUM = 0; /* NUMSUM is the variable which will
ultimately contain the sum we are
interested in.*/
DO INDEX = 1 TO 100;
NUMSUM = NUMSUM + INDEX; /* At each iteration of the DO group,
the current value of INDEX is added to
the current value of NUMSUM. */
END;
FILE sum.100;
PUT NUMSUM;
RUN;
QUIT;
The le sum.100 will then contain the value 5050, which is the sum of the rst 100
integers.
4.2.2 Example
Suppose we wish to add up all the even numbers between 1 and 101. The following
SAS program does this for us:
DATA _NULL_;
NUMSUM = 0;
DO INDEX = 2 TO 100 BY 2;
NUMSUM = NUMSUM + INDEX;
END;
FILE even.sum;
PUT NUMSUM;
RUN;
QUIT;
The le even.sum will then contain the value 2550, which is the sum of the rst
50 even numbers.
4.2.3 Exercises
1. Write a SAS program which calculates the sum of all multiples of 3 between 1
and 121. Ans. 2460
2. Modify the above program so that it calculates the sum of all integers from 51
through 100. Ans. 3775
3. Modify the above program so that it calculates the sum of all squares from 1
to 100.
4. Modify the above program so that it calculates the sum of square roots of even
numbers between 1 and 101.
5. Modify the above program so that it calculates 20! (the product of all integers
between 1 and 20).
4.3 DO While (optional)
In order to use the iterative DO, one needs to know the number of times the computation is
to be performed. Often, this number is not known beforehand. Instead, one might require
that the computation is performed while a particular condition is satised.
Syntax:
DO WHILE (condition);
SAS statements
END;
The SAS statements in the DO group are executed as long as the condition is found to
be true. The condition is tested once before the beginning of each loop. The rst time that
the condition is found to be false, the DO group statements are no longer executed and SAS
moves on beyond the END; statement.
4.3.1 Example
Suppose we want to determine the largest value of n so that
n
i=1
i
2
< 10000.
One approach to this problem is to successively add terms to the sum, while the
sum is less than 10000, and to stop accumulating as soon as the sum exceeds this
amount. The following statements accomplish this:
DATA _NULL_;
NUMSUM = 0;
INDEX=0;
DO WHILE (NUMSUM < 10000);
INDEX=INDEX+1;
NUMSUM = NUMSUM + INDEX**2;
END;
INDEX=INDEX-1;
FILE sum.out;
PUT INDEX;
RUN;
QUIT;
The nal value of INDEX is the solution n. This single number should be contained
in the le sum.out after executing the above lines of code.
4.3.2 Exercises
1. Write a SAS program which nds the largest n satisfying
n
i=1
i
3
< 20000.
2. Write a SAS program which nds the largest n satisfying n! < 100000.
3. Write a SAS program which nds the smallest n satisfying n! > 100000.
Chapter 5
Simulation
5.1 Generation of Pseudorandom Numbers
We begin our discussion of simulation with a brief exploration of the mechanics of pseudo-
random number generation. Pseudorandom numbers are useful in simulation studies.
We will briey describe a common method for simulating independent uniform random
variables on the interval [0,1]. A multiplicative congruential random number generator pro-
duces a sequence of pseudorandom numbers, u
0
, u
1
, u
2
, . . . , which are approximately inde-
pendent uniform random variables on the interval [0,1]. We now describe how to construct
such a generator.
Let m be a large integer, and let b be another integer which is smaller than m. b is often
somewhere around the square root of m. To begin, an integer x
0
is chosen between 1 and
m. x
0
is called the seed. It is best chosen in some non-systematic manner.
Once the seed has been chosen, the generator proceeds as follows:
x
1
= bx
0
(mod m)
u
1
= x
1
/m.
u
1
is the rst pseudorandom number. Dividing by m ensures that the number lies between
0 and 1. Note that it takes some value between 0 and 1. If m and b are chosen properly, it
is dicult to predict the value of u
1
, given the value of x
0
only. The second pseudorandom
number is then obtained in the same manner:
x
2
= bx
1
(mod m)
u
2
= x
2
/m.
u
2
is another pseudorandom number, which is approximately independent of u
1
. The method
continues using the following formulas:
x
n
= bx
n1
(mod m)
u
n
= x
n
/m.
This method produces numbers which are in reality non-random, but if done properly,
the numbers appear to be random (i.e. unpredictable).
Dierent values of b and m give rise to pseudorandom number generators of varying
quality. If they are not chosen with some care, then the generator will produce numbers that
do not appear to be random. A number of statistical tests have been developed for assessing
the quality of a pseudorandom number generator.
21
CHAPTER 5. SIMULATION 22
5.1.1 Example
The following lines of SAS create a le called RANDOM.DAT which contains 5 pseu-
dorandom numbers based on the multiplicative congruential generator:
x
n
= 171x
n1
(mod 30269)
u
n
= x
n
/30269
with initial seed x
0
= 23121.
/* Rudimentary Pseudorandom Number Generator */
DATA _NULL_;
FILE RANDOM.DAT;
B = 171;
M = 30269;
SEED = 23121;
X = SEED;
DO I = 1 TO 5;
X = MOD(B*X, M);
U = X/M;
PUT X U;
END;
RUN;
QUIT;
The results which are stored in the le RANDOM.DAT are as follows. The rst column
consists of the integers x
1
, x
2
, . . . , x
5
. The second column consists of numbers rang-
ing between 0 and 1. These are the uniform pseudorandom numbers, u
1
, u
2
, . . . , u
5
.
18721 0.61849
23046 0.76137
5896 0.19479
9339 0.30853
22981 0.75923
A related operation is used internally by SAS to produce pseudorandom numbers auto-
matically with the function UNIFORM.
5.1.2 Example
The following lines of SAS create a le called RANDOM.DAT which contains 50 uni-
form pseudorandom numbers based on the SAS generator UNIFORM with initial seed
x
0
= 27218.
/* Example demonstrating use of SAS RNG with fixed seed. */
DATA _NULL_;
SEED = 27218;
FILE RANDOM.DAT;
DO I = 1 TO 50;
U = UNIFORM(SEED);
PUT U;
END;
RUN;
QUIT;
It is often of interest to look at the distribution of a set of pseudorandom numbers.
For the numbers generated in the previous example, we would proceed as follows:
DATA RANDOM;
INFILE RANDOM.DAT;
INPUT U;
PROC CHART;
VBAR U;
RUN;
QUIT;
The bars of the histogram should all be roughly the same height, if the numbers
are really uniformly distributed.
5.1.3 Exercises
1. Generate 200 random numbers using the generator from the rst example with
an initial seed of 2018.
2. Write a program (or modify the second program in the second example) which
produces a histogram of the numbers produced in the previous exercise.
3. Generate 200 random numbers using the SAS UNIFORM generator from example
2 with an initial seed of 2018. Produce a histogram of this simulated data.
4. Modify the generator of the rst example so that it produces 200 random
numbers from the generator
x
n
= 172x
n1
(mod 30307)
with initial seed x
0
= 17218.
5. Generate 1000 pseudorandom numbers using the SAS function UNIFORM, and
store them in a le called UNIF.DAT.
6. Modify the above program to simulate the random variable Y = 1/(U +
1) where U is a uniform random variable on the interval [0,1]. Specically,
generate 1000 values of this random variable and put them in a le called
RANDOM.DAT.
Also, plot the histogram of the random numbers y
1
, . . . , y
1000
. Since Y is no
longer a uniform random variable, the histogram will not be at any longer;
what is the shape of the distribution?
7. Write a program which generates 100 independent observations on a uniformly
distributed random variable on the interval [0, 100]. Estimate the mean, vari-
ance and standard deviation of such a uniform random variable.
8. Use the FLOOR function together with UNIFORM to simulate 100 random in-
tegers between 0 and 99.
5.2 Simulation of Bernoulli Trials
A Bernoulli trial is an experiment in which there are 2 possible outcomes. For example, a
light bulb may work or it may not work; these are the only possibilities. For another example,
consider a student who guesses on a multiple choice test question which has 5 options; the
student may guess correctly with probability 0.2 and incorrectly with probability 0.8.
Suppose we would like to know how well such a student would do on a multiple choice
test consisting of 100 questions. We can get an idea by using simulation:
Each question corresponds to an independent Bernoulli trial with probability of success
equal to 0.2. We can simulate the correctness of the student for each question by generating
an independent uniform random number. If this number is less than .2, we say that the
student guessed correctly; otherwise, we say that the student guessed incorrectly.
This will work because the probability that a uniform random variable is less than .2 is
exactly .2, while the probability that a uniform random variable exceeds .2 is exactly .8,
which is the same as the probability that the student guesses incorrectly. Thus, the uniform
random number generator is simulating the student. The SAS version of this is as follows:
DATA _NULL_;
SEED = 12883;
FILE STUDENT.ANS;
PUT CORRECT U;
DO QUESTION = 1 TO 100;
U = UNIFORM(SEED);
IF U < .2 THEN CORRECT = 1;
ELSE CORRECT = 0;
PUT CORRECT U;
END;
RUN;
QUIT;
The rst column of the le STUDENT.ANS contains the results of the students guesses. A 1
is recorded each time the student correctly guesses the answer, while a 0 is recorded each
time the student is wrong. The second column records the value of the variable U; note
that whenever its value is less than .2, the value of CORRECT is 1, and when U takes a value
exceeding .2, the value of CORRECT is 0.
5.2.1 Exercises
1. Write a SAS program which simulates a student guessing at a True-False test
consisting of 40 questions.
2. Write a SAS program which simulates 500 light bulbs, each of which has
probability .99 of working.
3. Write a SAS program which simulates a binomial random variable Y with
parameters n = 25 and p = .4. (Y is the sum of 25 independent Bernoulli
random variables with p = .4.)
Now, modify the program so that it generates 100 of these binomial random
variables and writes them to a le called binom.dat. In order to do this,
you will need to nest one DO group inside another.
Write another program which reads the data from binom.dat into a SAS
data set and produces a histogram. Estimate the mean and variance using
PROC MEANS. Compare these estimates with their theoretical counterparts.
Recall that the theoretical mean of a binomial random variable is np and
the theoretical variance is np(1 p).
5.3 The Logistic Model
In many biostatistical applications, interest centers on a dose-response relationship. For
example, what dosage of a carcinogenic substance will produce cancer in a given percentage
of a population? One would expect that higher dosages of carcinogen will yield higher rates
of cancer. A rst attempt at modelling this kind of relationship might be
p =
0
+
1
x
where p is the proportion of the population that would acquire cancer at dosage x;
0
and
1
are constants. This model is linear, and will almost have the correct behaviour if
1
is
positive. However, it will give values of p outside the interval [0, 1] if x is too large or too
small.
The logistic model is often used as an alternative to handle this kind of problem. It
is based on the logit transformation which maps values in (0, 1) to (, ). The logit
transformation is given by (p) = log(p/(1 p)). Its inverse is given by the logistic function
p() = exp()/(1 + exp()).
We can then model the dose-response relationship with
(p) =
0
+
1
x
where
0
and
1
are constants. This model says that when the dosage is x, the proportion
of the population acquiring cancer will be p, where
p =
e
0
+
1
x
1 + e
0
+
1
x
.
Example
Write SAS code to simulate the responses of 20 subjects who have been exposed to
varying amounts of carcinogen under the logistic model assumption with
0
= 1.5
and
1
= 0.7. Assume that the dosages are given by x = 0.1, 0.2, . . . , 2.0. Output
should be printed to a le called doseresponsesim.txt.
DATA _NULL_;
SEED = 81818; B0 = -1.5; B1 = 0.7;
FILE doseresponsesim.txt;
PUT Response Dosage;
DO X = 0.1 TO 2.0 BY 0.1;
U = UNIFORM(SEED);
TMP = EXP(B0 + B1*X);
P = TMP/(1+TMP);
IF U < P THEN CANCER = 1;
ELSE CANCER = 0;
PUT CANCER X;
END;
RUN;
QUIT;
Upon running the code, it should be clear that as x increases, the incidence of
cancer increases (i.e. the incidence of 1s in the rst column of simulated data
increases).
Exercises
1. Run the code for the logistic model given in the above example. Then change the slope
parameter
1
to 0.7. How does this aect the pattern in the response?
2. Modify the code given in the example so that dosages are given by 1.5, 1.7, 1.9, . . . , 3.5.
3. Modify the example code so that the output enters a SAS dataset called DOSERESP.
Next, use the PLOT procedure to plot CANCER against X. Experiment with various
values of
0
and
1
in order to see how these values aect the pattern of response.
5.4 Binomial Random Numbers
The RANBIN function can be used to automatically generate binomial random numbers.
Syntax:
Y = RANBIN(seed,n,p);
The seed is any positive integer, while n and p are the binomial parameters. The function
assigns a random binomial realization to the variable Y.
5.4.1 Example
Suppose 12% of a large population has recently been infected by a virus whose
incubation period is 2 weeks long, but whose presence can be detected by a blood
test. Suppose random testing for the virus is conducted, and 15 individuals are
tested each hour. Simulate the number of positive test results for each hour over
a 24-hour period. Record the simulated numbers of positive test results in a le
called viruscounts.txt.
Since 15 individuals are tested each hour and each individual has a 0.12 probability
of being infected, independent of the state of the other individuals, the number
of positive test results in one hour is a binomial random variable with n = 15
and p = 0.12. To simulate the numbers of positive test results for each hour in a
24-hour period, we need to generate 24 binomial random numbers:
/* Simulation of infected individuals */
DATA _NULL_;
SEED = 3728;
N = 15;
P = .12;
FILE viruscounts.txt;
PUT HOUR NUMBER OF INFECTED;
DO HOUR = 1 TO 24;
INFECTED = RANBIN(SEED,N,P);
PUT HOUR INFECTED;
END;
RUN;
QUIT;
5.4.2 Exercises
1. Generate 1000 binomial variates with n = 18 and p = .75 using RANBIN. Then use
PROC MEANS to estimate the average and variance. Compare with the theoretical mean
and variance. Repeat for binomial variates with n = 50 and p = .4.
2. Generate 50 binomial variates B
1
, B
2
, . . . , B
50
, having n = 20 and where p satises
(p) = 2.0 + 0.5x
where x = 0.1, 0.2, 0.3, . . . , 5.0. Use the Plot procedure to plot B against x and note
the pattern of plotted points.
3. Refer to the previous question. Calculate the expected value of B
i
, for i = 1, 2, . . . , 50.
Plot these expected values against x.
5.5 Poisson Random Numbers
We can generate Poisson random numbers using SAS with the RANPOI function. It is similar
to the RANBIN function, but there is only one parameter instead of two.
Syntax:
Y = RANPOI(seed, lambda);
In this case, lambda is the mean of the Poisson random variable.
5.5.1 Example
Suppose trac accidents occur at an intersection with a mean of 3.7 per year.
Simulate the annual number of accidents for a 10-year period, assuming that the
numbers occurring from year to year are independent.
/* Example of Poisson variate generation -- Simulation of Traffic
Accidents */
DATA _NULL_;
SEED = 497765;
LAMBDA = 3.7;
FILE ACCIDENT.DAT;
PUT YEAR NUMBER OF ACCIDENTS;
DO YEAR = 1 TO 10;
ACCIDENT = RANPOI(SEED, LAMBDA);
PUT YEAR ACCIDENT;
END;
RUN;
QUIT;
5.5.2 Exercises
1. Modify the above program to simulate the number of accidents per year for
15 years, when the average rate is 2.8 accidents per year.
2. Simulate the number of surface defects in the nish of a sports car for 20 cars,
where the mean is 1.2 defects per car.
3. Estimate the mean and variance of a Poisson random variable whose mean
rate is 7.2 by simulating 1000 such variates and using PROC MEANS. Compare
with the theoretical values, recalling that the variance and mean are equal for
Poisson random variables.
4. A commonly used model is the Poisson regression model
log() =
0
+
1
x
where
0
and
1
are constants. Take
0
= 3 and
1
= 0.5, and suppose
x = 0.1, 0.2, 0.3, . . . , 4.0. Calculate the corresponding values of . (Store these
values in a SAS variable called lambda.)
5. Refer to the previous question. Simulate Poisson random variates which have
the values. Plot the Poisson variates against the corresponding values of x.
5.6 Exponential Random Numbers
The exponential distribution can be used as a simple model for the time until a component
fails, or until a light bulb burns out.
A random variable T has an exponential distribution with mean if
P(T t) = 1 e
t/
for any non-negative t. The mean or expected value of T is 1/ and the variance of T is
1/
2
.
The simplest way to simulate exponential random variables is to generate a uniform
random variable U on [0,1], and set
1 e
T/
= U
Solving this for T, we have
T = log(1 U).
It can be shown that T dened in this way has an exponential distribution with mean . The
SAS function RANEXP can be used to generate random exponential variates with mean 1.
Syntax:
T = RANEXP(seed);
This produces an exponential variate T having mean 1. To change the mean to lambda, we
must use
T = lambda * RANEXP(seed);
5.6.1 Example
/* SIMULATION OF N EXPONENTIAL LAMBDA RANDOM VARIATES */
DATA _NULL_;
SEED = 12238;
LAMBDA = 2.5;
N = 10;
FILE EXPO.RVS
DO I = 1 TO N;
T = RANEXP(SEED)*LAMBDA;
PUT T;
END;
RUN;
QUIT;
5.6.2 Exercises
1. Suppose that a certain type of battery has a lifetime which is exponentially
distributed with mean 55 hours. Simulate 1000 such lifetimes to estimate the
mean and variance of the lifetime for this type of battery. Compare with the
theoretical mean and variance.
2. The central limit theorem says that the sample mean for a random sample
of size n from a population with mean and variance
2
is approximately
normally distributed with mean and variance
2
/n, where the approximation
improves as n increases.
The following programs provides a demonstration for the case where the un-
derlying population is exponentially distributed:
/* PROGRAM 1: Computation of averages of samples of size N coming
from exponential lambda populations */
DATA _NULL_;
SEED = 12238;
LAMBDA = 2.5;
NSAMPLES = 1000; /* We are going to simulate NSAMPLES
independent samples of size N, computing the average
in each case. */
N = 10;
FILE EXPO.AVG
DO NSAMPLE = 1 TO NSAMPLES;
TSUM = 0;
DO I = 1 TO N;
T = RANEXP(SEED)*LAMBDA;
TSUM = TSUM + T; /* Accumulating the sample
values to form a sum */
END;
TAVG = TSUM/N; /* TAVG = average of the current
sample. */
PUT TAVG; /* Storing sample averages for
use in next program where they will be
plotted as a histogram. */
END;
RUN;
QUIT;
/* PROGRAM 2: Histogram of averages to demonstrate CLT */
DATA EXPO_AVG;
INFILE EXPO.AVG;
INPUT TAVG;
PROC CHART;
VBAR TAVG;
PROC MEANS MEAN VAR;
VAR TAVG;
/* Weve included this procedure to compare
the mean and variance of the averages with what is
expected by the theory */
RUN;
QUIT;
Run the above programs for N = 3, 6, 10, 20, 30, 40. Note how the histogram
begins to resemble the familiar bell-shaped curve as N increases. How large
would you say N should be in order for the normal approximation to be con-
sidered accurate, when the underlying population is exponential?
5.7 Normal Random Numbers
Standard normal random variables can be generated using the RANNOR function in SAS.
Syntax:
Z = RANNOR(seed);
This produces a value of a normal random variable Z which has mean 0 and variance 1.
Recall that if X has mean and variance
2
, then
X = + Z
where Z has mean 0 and variance 1. Therefore, to simulate a random variable X having
mean mu and standard deviation sigma, use
X = mu + sigma*RANNOR(seed);
5.7.1 Example
Use simulation to estimate P(Z < 1.25) where Z is a standard normal random
variable.
Idea: Simulate a large number (say, 1000) of standard normal random variates and
compute the proportion that lie below 1.25.
DATA _NULL_;
FILE NORMAL.PRB;
SEED = 19218;
N = 1000;
VALUE = 1.25;
COUNT = 0;
DO I = 1 TO N;
Z = RANNOR(SEED);
IF Z < VALUE THEN COUNT = COUNT + 1;
END;
PROBEST = COUNT/N;
PUT AN EMPIRICAL ESTIMATE OF P(Z < VALUE ) IS PROBEST;
RUN;
QUIT;
5.7.2 Exercises
1. Simulate 100 normal random variates having mean 51 and standard deviation
5.2. Compute the average and standard deviation of your simulated sample
and compare with the theoretical values.
2. Simulate 1000 standard normal random variates Z, and use your simulated
sample to estimate
(a) P(Z > 2.5).
(b) P(0 < Z < 1.645).
(c) P(1.2 < Z < 1.45).
(d) P(1.2 < Z < 1.3).
Compare with the theoretical values (i.e. consult a normal table).
3. Using the fact that a
2
random variable on 1 degree of freedom has the same
distribution as the square of a standard normal random variable, simulate 100
independent values of such a
2
random variable, and estimate its mean and
variance. (Compare with the theoretical values: 1, 2.)
4. A
2
random variable on n degrees of freedom has the same distribution as
the sum of n independent standard normal random variables. Simulate a
2
random variable on 8 degrees of freedom, and estimate its mean and variance.
(Compare with the theoretical values: 8, 16.)
5. A commonly used model is the simple regression model
y =
0
+
1
x +
where
0
and
1
are constants. is a normal random variable with mean 0 and
variance
2
. Take
0
= 3 and
1
= 0.5, and suppose x = 0.1, 0.2, 0.3, . . . , 4.0.
(a) Simulate 40 independent normal variates , supposing = 0.4. (Store
these values in a SAS variable called epsilon.)
(b) Simulate the corresponding values of y. (Store these values in a SAS vari-
able called y.)
(c) Plot the normal variates against the corresponding values of x. Note the
pattern on the plot.
6. Re-do the previous question using = 1.5.
7. Repeat, using
0
= 5 and
1
= 2.
Chapter 6
REFERENCE: Other Data Step
Functions
A SAS DATASET
X1 X2 X3 X4
-1 3 2 2.3
0.1 4 -1 2.1
0.5 -1 -7 2.4
1.9 -1.7 -4 1.9
- used in some of the examples below.
6.1 Arithmetic Functions
ABS(X) - returns the absolute value of X: |X|.
EXAMPLE: Y=ABS(X1); (Y = 1 0.1 0.5 1.9).
MAX(X1,X2,...,XN) - returns the largest value among the values of the arguments.
EXAMPLE: verb+Y=MAX(X1,X2,X3,X4);+ (Y = 3 4 2.4 1.9).
MIN(X1,X2,...,XN) - returns the smallest value among the values of the arguments.
EXAMPLE: Y=MIN(X1,X2,X3,X4); (Y = -1 -1 -7 -4).
MOD(N1,N2) - returns the remainder when the quotient of N1 divided by N2 is calculated.
EXAMPLE: Y=MOD(X1,X2); (Y= 2 0.1 0.5 0.2).
SIGN(X) - returns the sign of X, or 0, if X is 0.
EXAMPLE: Y=SIGN(X1); (Y= -1 1 1 1)
SQRT(X) - returns the square root of X:
X. When X is negative, it returns a missing

value (.).
EXAMPLE: Y=SQRT(X1); (Y = . 0.31622 0.70710 1.37840).
6.2 Truncation Functions
CEIL(X) - returns the smallest integer greater than X.
FLOOR(X) - returns the largest integer smaller than X.
33
CHAPTER 6. REFERENCE: OTHER DATA STEP FUNCTIONS 34
INT(X) - returns the same value as FLOOR(X), if X is positive, and returns the same
value as CEIL(X), if X is negative.
ROUND(X,Z) - returns the value of X rounded to the nearest unit of Z.
6.3 Special Mathematical Functions
EXP(X): e
X
.
GAMMA(X): the complete gamma function,

0
t
X1
e
t
dt.
LOG(X): the natural logarithm of X.
LOG2(X): the logarithm to the base 2 of X.
LOG10(X): the logarithm to the base 10 of X.
6.4 Trigonometric and Hyperbolic Functions
ARCOS(X): inverse cosine of X.
ARSIN(X): inverse sine of X.
ATAN(X): inverse tangent of X.
COS(X): cosine of X.
COSH(X): hyperbolic cosine of X.
SIN(X): sine of X.
SINH(X): hyperbolic sine of X.
TAN(X): tangent of X.
TANH(X): hyperbolic tangent of X.
6.5 Statistical functions
CSS(X1,X2,...,XN): the corrected sum of squares
N
i=1
X
2
i
N

X
2
CV(X1,X2,...,XN): the coecient of variation - the standard deviation of X
1
, . . . , X
N
divided by the mean of X
1
, . . . , X
N
.
MEAN(X1,...,XN)
X =
1
N
N
i=1
X
i
EXAMPLE: Y = MEAN(X1,X2,X3,X4); (Y = 1.575 1.3 -1.275 -0.475).
N(X1,...,XN): number of nonmissing arguments.
EXAMPLE: Y=N(.,4.1,.3.7,5.7); (Y = 3).
NMISS($X_1,\ldots,X_N$): number of missing values.
EXAMPLE: Y=NMISS(.,4.1,.3.7,5.7); (Y = 2).
RANGE(X1,...,XN): maximum minus the minimum.
EXAMPLE: Y=RANGE(X1,X2,X3,X4); (Y = 4 5 9.4 5.9).
STD(X1,...,XN): standard deviation.
STDERR(X1,...,XN): standard error (standard deviation divided by
N).
SUM(X1,...,XN):

N
i=1
X
i
USS(X1,...,XN): uncorrected sum of squares

N
i=1
X
2
i
VAR(X1,...,XN): variance
6.6 Probability functions
The following functions can be used to determine various probabilities. The syntax is similar
to that used for the random number generator functions.
GAMINV(P,eta): returns the value of x such that
P =
x
0
t
1
e
t
dt
()
(0 P < 1, and > 0).
POISSON(lambda,N): returns the probability that an observation from a Poisson distri-
bution is less than or equal to N. is the mean parameter.
i.e. POISSON(lambda,N) =

N
j=0
e
()
j
j!
PROBBNML(p,n,m): returns the probability that an observation from a binomial distri-
bution with parameters p and n is less than or equal to m.
i.e. PROBBNML(p,n,m) =

m
j=0
n
j
p
j
(1 p)
nj
.
PROBCHI(x,nu): returns the probability that a random variable with a chi-square dis-
tribution on degrees of freedom falls below x.
PROBF(x,ndf,ddf): returns the probability that a random variable with an F distribu-
tion on ndf numerator degrees of freedom and ddf denominator degrees of freedom falls
below x.
PROBGAM(x,eta): returns the probability that a random variable with a gamma distri-
bution with shape parameter falls below x.
i.e. PROBGAM(x,eta) =
x
0
t
1
e
t
()
.
PROBIT(x): returns the inverse of the standard normal cumulative distribution function.
i.e. If X is a standard normal random variable, then x is the probability that X will
take on a value less PROBIT(X).
PROBNORM(x): returns the probability that a standard normal random variable will fall
below x.
PROBT(x,nu): returns the probability that a random variable with students t distribu-
tion on degrees of freedom will fall below x.
TINV(p,nu): returns the pth percentile of the students t distribution on degrees of
freedom.
6.6.1 Example
Find the probability that a random variable with a t distribution on 8 degrees of freedom is
less than 1.4.
i.e. P(T < 1.4) =? where T is t-distributed on 8 d.f. The following program writes the
correct probability into the le PROB.T.
DATA _NULL_;
FILE PROB.T;
PROB = PROBT(1.4, 8);
PUT PROB;
6.6.2 Exercises
1. Compute the probability that a Poisson random variable with mean rate 11.4
takes on values less than
(a) 1.
(b) 2.
(c) 5.
(d) 11.
(e) 15.
(f) 21.
2. Repeat the previous question for a binomial random variable with p = .45 and
n = 24.
3. The time that it takes a bus to arrive at the next stop is normally distributed
with mean 10.4 minutes and standard deviation 1.2. Compute the probabilities
that the bus will arrive in less than
(a) 5 minutes.
(b) 8 minutes.
(c) 10.5 minutes.
(d) 12.5 minutes.
(e) 13.1 minutes.
(f) 15.2 minutes.

Notes On The SAS Data Step and An Introduction To Simulation

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Notes On The SAS Data Step and An Introduction To Simulation

Încărcat de

Drepturi de autor:

Formate disponibile

Notes on the SAS Data Step

and an Introduction to Simulation

X. When X is negative, it returns a missing

S-ar putea să vă placă și