Sunteți pe pagina 1din 47

What is logistic regression?

logistic regression: a type of regression used when the dependant variable is binary or ordinal.

A lot of statistics is concerned with predicting the value of a continuous variable: blood pressure, intelligence, oxygen levels, wealth and so on. This kind of statistics dominates undergraduate courses, and in social science analysis. But what do you do if your dependant variable is binary?

What if, for example, you're running a medical study where you want to predict whether someone will live or die in a particular treatment regime? In this case, your dependent variable, survival, can only have two values. It isn't continuous, it's binary. In the past, an accepted way around this was to simply use standard linear regression, and treat the dependent as if it was binary. If the two values were coded as 0 and 1, then any value of .5 or above would be treated as a 1, and anything below .5 would be treated as a zero. However, this approach is no longer considered acceptable, as it has several problems.

What's wrong with regressing against binary dependent variables?

The first problem is apparent to even a casual observer: the predicted values have no meaning. If you dependent variable can only be zero or one (such as alive or dead), then a value of 3 would indicate that the subject is alive. But what else? This isn't a probability value or likelihood, or even a percentage. It has no real-world interpretation. Equally meaningless is comparing different predicted values to each other.

An even more serious problem is that such an analysis violates many assumptions of linear regression. For example, the assumption of homoscedacity won't hold. Homoscedasticity means that the variance around the dependent variable is similar for all values of the independent variable. Variance for a distribution of a binary variable is PQ where P is the probability of a zero, and Q is the probability of a 1. (another assumption that is violated is that Y-Y' is not normally distributed).

A better solution is to use either discriminant analysis (DA) or logistic regression.

Advantages and disadvantages of logistic regression

Logistic regression has several advantages over discriminant analysis:

* it is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group

* It does not assume a linear relationship between the IV and DV

* It may handle nonlinear effects

* You can add explicit interaction and power terms

* The DV need not be normally distributed.

* There is no homogeneity of variance assumption.

* Normally distributed error terms are not assumed.

* It does not require that the independents be interval.

* It does not require that the independents be unbounded.

With all this flexibility, you might wonder why anyone would ever use discriminant analysis or any other method of analysis. Unfortunately, the advantages of logistic regression come at a cost: it requires much more data to achieve stable, meaningful results. With standard regression, and DA, typically 20 data points per predictor is considered the lower bound. For logistic regression, at least 50 data points per predictor is necessary to achieve stable results.

How is logistic regression done?

We've already talked about why you can't regress against the binary variable (0-1 values), so that's out. W hat about the probability of a 1? This would be a range of numbers between 0 and 1. In fact, I have seen research papers where the authors have done just that. However, it's not good practice. This is continuous, but it's still between 0 and one: it's bounded. A value of 1.1 or -3 make no sense. What's needed is a continuous, unbounded dependent variable.

The dependent variable in a logistic regression is the log of the odds ratio. Or in mathspeak,

ln(p/(1-p))

This is known as the logit. This is the dependent variable against which independent variables are regressed.

Interpreting the results of a logistic regression

At first glance, logistic regression results look familiar, especially to someone familiar with standard regression: there is a regression equation, complete with coefficients for all the variables. However, these regress against the logit, not the dependent variable itself! To find the probability of a 1 given all your dependents, you have to convert the predicted logit back to a predicted probability. Let's say that your regression equation is

Logit = a + bX1 + cX2

Formula for converting logit to probabilities

If your regression equation is Logit = a + bX1 + cX2 etc, then the first step is to calculate the logit using that formula. The logit is then converted into a probability using this formula:

Formula for converting logit to probabilities If your regression equation is Logit = a + bX1

This number gives you the probability of a 1, given the current configuration of all the predictors. For example, it might give the probability of survival given various lifestyle factors, or the probability of contracting a disease.

Effect size

Logistic regression is a bit like regression, so people who are familiar with regression ask "what's the R value?"

In standard regression, R (or R squared in particular) gives you an idea of how powerful your equation is at predicting the variable of interest. An R close to 1 is a very strong prediction, whereas a small R, closer to zero, indicates a weak relationship.

There is no direct equivalent of R for logistic regression.

However, to keep people happy who insist on an R value, statisticians have come up with several R-like measures for logistic regression. They are not R itself, R has no meaning in logistic regression.

Some of the better known ones are:

Cox and Snell's R-Square

Pseudo-R-Square

Hagle and Mitchell's Pseudo-R-Square

Uses

Logistic regression is perfect for situations where you are trying to predict whether something "happens" or not. A patient survives a treatment, a person contracts a disease, a student passes a course. These are binary outcome measures. It is

particularly useful where the dataset is very large, and the predictor variables do not behave in orderly ways, or obey the assumptions required of discriminant analysis.

Key Concepts About Logistic Regression

Logistic Regression is used to assess the likelihood of a disease or health condition as a function of a risk factor (and covariates). Both simple and multiple logistic regression, assess the association between independent variable(s) (X i ) -- sometimes called exposure or predictor variables — and a dichotomous dependent variable (Y) — sometimes called the outcome or response variable. Logistic regression analysis tells you how much an increment in a given exposure variable affects the odds of the outcome.

Simple logistic regression is used to explore associations between one (dichotomous) outcome and one (continuous, ordinal, or categorical) exposure variable. Simple logistic regression lets you answer questions like, "how does gender affect the probability of having hypertension?

Multiple logistic regression is used to explore associations between one (dichotomous) outcome variable and two or more exposure variables (which may be continuous, ordinal or categorical).

The purpose of multiple logistic regression is to let you isolate the relationship between the exposure variable and the outcome variable from the effects of one or more other variables

(called covariates or confounders).

Multiple logistic regression lets you answer the question,

"how does gender affect the probability of having hypertension, after accounting for — or

unconfounded by — or independent of – age, income, etc.?" covariates or confounders — is also called adjustment.

This process — accounting for

Comparing the results of simple and multiple logistic regression can help to answer the question "how much did the covariates in the model alter the relationship between exposure and outcome (i.e., how much confounding was there)?"

Research Question

In this module, you will assess the association between gender (the exposure variable) and the likelihood of having hypertension (the outcome). You will look at both simple logistic regression and then multiple logistic regression. The multiple logistic regression will include the covariates of age, cholesterol, body mass index (BMI) and fasting triglycerides. This analysis will answer the question, what is the effect of gender on the likelihood of having hypertension – after controlling for age, cholesterol, BMI, and fasting triglycerides?

Dependent Variable and Independent Variables

As noted, the dependent variable Y i for a Logistic Regression is dichotomous, which means that it can take on one of two possible values. NHANES includes many questions where people must answer either “yes” or “no”, questions like “has the doctor ever told you that you have congestive heart failure?”. Or, you can create dichotomous variables by setting a threshold (e.g., “diabetes” = fasting blood sugar > 126); or by combining information from several variables. In this module, you will create a dichotomous variable called “hyper” based on two variables:

measured blood pressure and use of blood pressure medications. In SUDAAN, SAS Survey, and Stata, the dependent variable is coded as 1 (for having the outcome) and 0 (for not having the outcome). In this example, for people who have been told they have hypertension or reported use of blood pressure medication, the hypertension variable would have a value of 1, while people who were never told of hypertension or not taking blood pressure medication would have a value of 0.

The independent variables X j can be dichotomous (e.g. gender ,"high cholesterol"), ordinal (e.g. age groups, BMI categories), or continuous (e.g. fasting triglycerides).

Logit Function

Since you are trying to find associations between risk factors and a condition, you need a formula that will allow you to link these variables. The logit function that you use in logistic regression is also known as the link function because it connects, or links, the values of the independent variables to the probability of occurrence of the event defined by the dependent variable.

Logit Model

Dependent Variable and Independent Variables As noted, the dependent variable Y for a Logistic Regression is

In the logit formula above, E(Y i )=p i implies that the Expected Value of (Y i ) equals the probability that Y i =1. In this case, ‘Log' indicates natural Log.

Optional: Learn more about odds ratios, linear and logistic regression

Click here to read the optional material.

Optional: Learn more about odds ratios, linear and logistic regression Click here to read the optional
Optional: Learn more about odds ratios, linear and logistic regression Click here to read the optional
Optional: Learn more about odds ratios, linear and logistic regression Click here to read the optional
Optional: Learn more about odds ratios, linear and logistic regression Click here to read the optional
Optional: Learn more about odds ratios, linear and logistic regression Click here to read the optional
Output of Logistic Regression The statistics of primary interest in logistic regression are the  coefficients
Output of Logistic Regression The statistics of primary interest in logistic regression are the  coefficients
Output of Logistic Regression The statistics of primary interest in logistic regression are the  coefficients

Output of Logistic Regression

The statistics of primary interest in logistic regression are the coefficients (   ), their standard errors, and their p-values. Like other statistics, the standard errors are used to calculate confidence intervals around the beta coefficients.

The interpretation of the beta coefficients for different types of independent variables is as follows:

If X j is a dichotomous variable with values of 1 or 0, then the coefficient represents the log odds that an individual will have the event for a person with X j =1 versus a person with X j =0. In a multivariate model, this coefficient is the independent effect of variable X j on Y i after adjusting for all other covariates in the model.

If X j is a continuous variable, then the e represents the odds that an individual will have the event for a person with X j =m+1 versus an individual with X j =m. In other words, for every one

unit increase in X j, the odds of having the event Y i changes by e , adjusting for all other covariates in a multivariate model.

A summary table about interpretation of beta coefficients is provided below:

Table: What does the Coefficient Mean?

Independ

 

The coefficient in

The coefficient in

ent

Example

simple logistic

multiple logistic

Variable

Variables

Type

regression

regression

Continuou

height, weight,

The change in the log odds of the dependent

The change in the log odds of dependent variable per 1 unit change in the independent

s

LDL

variable per 1unit change in the independent variable.

variable after controlling for the confounding effects of the covariates in the model.

Categoric

sex (two subgroups - men and

The difference in the log odds of the dependent variable for one value of categorical

The difference in the log odds of the dependent variable for one value of categorical variable vs. the reference

al (also

women. This

known as

example will

variable vs. the

group (for example, between

discrete)

use men as the reference group.)

reference group (for example, between women, and the reference group, men).

women and the reference group, men), after controlling for the confounding effects of the covariates in the model.

It is easy to transform the coefficients into a more interpretable format, the odds ratio, as follows:

e odds ratio

Odds and odds ratios are not the same as risk and relative risks.

Odds and odds ratios are not the same as risk and relative risks.

Odds and probability are two different ways to express the likelihood of an outcome.

Here are their definitions and some examples.

Table of Differences between Odds and Probability

 

Definition

Example: Getting heads in a 1 flip of a coins

Example: Getting a 1 in a single roll of a dice

Odds

 

# of times something happens # of times it does NOT happen

= 1/1 = 1 (or 1:1)

= 1/5 = 0.2 (or 1:5)

Probabilit

# of times something happens

 

= 1/2 = .5 (or 50%)

= 1/6 = .16 (or 16%)

y

# of times it could happen

Few people think in terms of odds. Many people equate odds with probability and thus equate odds ratios with risk ratios. When the outcome of interest is uncommon (i.e. it occurs less than 10% of the time), such confusion makes little difference, since odds ratios and risk ratios are approximately equal. When the outcome is more common, however, the odds ratio increasingly overstates the risk ratio. So, to avoid confusion, when event rates are high, odds ratios should be converted to risk ratios. (Schwartz LM, Woloshin S, Welch HG. Misunderstandings about the effects of race and sex on physicians’ referrals for cardiac catheterization. N Engl J Med 1999;341:279–83) There are simple methods of conversion for both crude and adjusted data. (Zhang J, Yu KF. What's the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA 1998;280:1690-1691. Davies HT, Crombie IK, Tavakoli M. When can odds ratios mislead? BMJ 1998;316:989-991)

The following formulas demonstrate how you can go between probability and odds.

Probability or Odds

Probability or Odds Key Concepts About Setting Up a Logistic Regression in NHANES Univariate analyses examineCorrect weight  Appropriate procedure , and  model statement Univariate analyses examine the relationship between the outcome variable and one other variable, while multivariate analyses examine the relationship between the outcome and two or more variables. Simple logistic regression is used for univariate analyses, while multiple logistic regression is used for multivariate analyses. Determine the appropriate weight for the data used " id="pdf-obj-9-5" src="pdf-obj-9-5.jpg">

Key Concepts About Setting Up a Logistic Regression in NHANES

Univariate analyses examine the relationship between the outcome variable and one other variable, while multivariate analyses examine the relationship between the outcome and two or more variables. Simple logistic regression is used for univariate analyses, while multiple logistic regression is used for multivariate analyses. To run univariate and multivariate Logistic Regression in SAS-callable SUDAAN, SAS, and Stata, you will need to provide three things:

Univariate analyses examine the relationship between the outcome variable and one other variable, while multivariate analyses

Univariate analyses examine the relationship between the outcome variable and one other variable, while multivariate analyses examine the relationship between the outcome and two or more variables. Simple logistic regression is used for univariate analyses, while multiple logistic regression is used for multivariate analyses.

Determine the appropriate weight for the data used

It is always important to check all the variables in the model, and use the weight of the smallest common denominator. In the example of univariate analysis, the 4-year MEC weight is used, because the hypertension variable is from the MEC examination. In the multivariate analysis example, the 4-year MEC morning subsample weight is used, because the fasting triglycerides variable is from the morning fasting subsample from the lab component, which is the smallest common denominator for all variables in the model.

Examples

Simple logistic regressions for gender, age, cholesterol, and BMI:

Because these analyses use 4 years of data and includes variables that come from the household interview and the MEC (e.g. blood pressure, BMI, HDL cholesterol), the MEC 4-year weight - wtmec4yr is the right one.

Simple logistic regression for fasting triglyceride:

Because this analysis uses 4 years of data and fasting triglycerides were only done on the morning subsample, the MEC morning fasting subsample 4-year weight - wtsaf4yr is the right one.

Multiple logistic regression:

Because this analysis uses 4 years of data and includes variables from the household interview, MEC and morning subsample of the MEC, the weight for the smallest group - the morning fasting subsample 4 -year weight - wtsaf4yr is the right one.

See the Weighting module for more information on weighting and combining weights.

Determine the appropriate procedure

You can run logistic regression with stand-alone SUDAAN, SAS-callable SUDAAN, SAS Survey procedure, or Stata Survey commands. However, note that each version of SUDAAN, SAS-callable SUDAAN, and SAS Survey procedures has its own unique commands for executing logistic regression analysis. You need to use the correct command for the software that you are using. Please also note that different versions of SAS and SUDAAN use slightly different statements to specify categorical variables and reference groups. Make sure that you are using the correct commands for the version of software on your computer.

If you use

the stand-alone version of SUDAAN, the procedure is logistic

SAS-callable SUDAAN, the procedure is called rlogist

SAS survey procedures, the procedure is surveylogistic

Be sure you are using the correct procedure name because SAS also has a procedure logistic, which is used with simple random samples and not complex datasets like NHANES. Using logistic in SAS will yield different results from stand-alone SUDAAN.

Provide a model statement

Remember that when you run logistic regression analyses, you must provide a model statement to specify the dependent variable and independent variable(s), and you can have only one model statement each time you run a logistic regression analysis.

Do Loops

Do-loops are used to repeat a sequence of operations. Here is an example.

Each row of data in the file iris.dat contains 12 measurements: measurements of four characteristics for each of the 3 species of iris. The 4 measurements are to be read into the 4 variables m1-m4 and an additional variable species will contain a number designating the type of iris from which the measurements were taken, say, 1 for iris setosa, 2 for versicolor, 3 for virginica.

data iris; infile `iris.dat';

do species=1 to 3;

end;

The do loop reads in turn the 4 measurements for each type in a given row of the data set and defines the variable species appropriately.

Note that the do loop MUST BE paired with an end statement to mark the end of the loop.

The output statement is used to write the variables to the data set being created. If the output statement above were omitted, only the last group of measurments would appear in the SAS data set.

The counter for a do-loop can take values specified by a comma delimited list. Examples:

do sex='m','f'; ... end; do parity=1,3,5; ... end;

A counterless do-loop can be used to have more complicated processing after an if-branch. Example:

data old; input age sex income; if age<30 then do; if income<20000 then cat=1; if income>20000 and income <30000 then cat=2; end; if age>= 30 then do; ....

______________________________________________________________________________

_______________________________________________________________________

1. Making a copy of a SAS data set

Manipulating data is possible in the initial DATA step when you read in or access data for the first time or in subsequent data steps when you modify or replace an existing data set. Assume that data set MYDATA exists and you wish to create new variables, drop variables, subset the data set or perform some other manipulation of it. This requires a new DATA step in which you have to make available the information stored in MYDATA to SAS. There easiest way to do this is with the SET command. There are two forms:

data mydata; set mydata; < more statements> run;

and

data newdata; set mydata; <more statements> run;

The SET command adds the contents of a data set to a DATA step. In the first example the data set being created (MYDATA) and the data set being added are the same. This works, since SAS does not assign a name to the data set until the DATA step completed successfully. In the interim, a dummy data set name is being used. The statement data mydata is interpreted as "I should create a new data set, if the DATA step executes without errors, I will call it mydata". Prior to completion of the DATA step the data set mydata exists in its unmodified, original form. This construction, to use the name of an existing data set in the DATA and SET command is a convenient way to modify an existing data set. If you wish to modify mydata but store the results in a new data set use syntax as in the second example. The names of the data sets in the DATA and SET command are different. Upon completion of the DATA step a new data set named newdata is being created. Mydata remains unchanged.

2. Creating new variables

2.1. Transformations

SAS' power makes it unnecessary to perform most data manipulations outside of The SAS System. Creation of new variables, calculations with existing variables, subsetting of data, sorting, etc. are best done within SAS. Transformations in the narrow sense use the built-in mathematical functions available in The SAS System. An example DATA step would be:

data survey; set survey;

x

= ranuni(123); /* A uniform(0,1) random number

*/

lny

= log(inc);

/* The natural logarithm (base e) */

logy

= log10(inc);

/* The log to base 10

*/

rooty = sqrt(inc);

/* The square root

*/

expy

= exp(inc/10); /* The exponential function

*/

cos

= cos(x);

/* The cosine function

*/

sin

= sin(x);

/* The sine function

*/

tan

= tan(x);

/* The tangent function

*/

cosz

= cos(z);

z

= ranuni(0);

run;

This DATA step calculates various transformations of the income variable. Watch out for the subtle difference between the log() and log10() functions. In mathematics, the natural logarithm (logarithm with base e) is usually abbreviated ln, while log is reserved for the logarithm with base 10. SAS does not have a ln() function. The natural log is calculated by the log() function. The RANUNI() function generates a random number from a Uniform(0,1) distribution. The number in parentheses is the seed of the random number generator. If you set the seed to a non-zero number the same random numbers are being generated, every time you run the program. The expressions (function calls) on the right hand side of the '=' sign must involve existing variables. These are either variables already in the data set being SET, or are created previously. For example, the statement tan = tan(x); will function properly, since x has been defined prior to

the call to the tan() function. The variable cosz however will contain missing values only, since its argument, the variable z, is not defined prior. Here is a printout of the data set survey after this DATA step.

OBS

ID

SEX

AGE

INC

R1

R2

R3

X

LNY

LOGY

ROOTY

1

1

F

35

17

7

2

2

0.75040 2.83321 1.23045

4.12311

2

17

M

50

14

5

5

3

0.17839

2.63906 1.14613 3.74166

3

33

F

45

6

7

2

7

0.35712

1.79176 0.77815 2.44949

4

49

M

24

14

7

5

7

0.78644

2.63906 1.14613 3.74166

5

65

F

52

9

4

7

7

0.12467

2.19722 0.95424 3.00000

6

81

M

44

11

7

7

7

0.77618

2.39790 1.04139 3.31662

7

2

F

34

17

6

5

3

0.96750

2.83321 1.23045 4.12311

8

18

M

40

14

7

5

2

0.71393

2.63906 1.14613 3.74166

9

34

F

47

6

6

5

6

0.53125

1.79176 0.77815 2.44949

10

50

M

35

17

5

7

5

0.14208

2.83321 1.23045 4.12311

OBS

EXPY

COS

SIN

TAN

COSZ

Z

1

5.47395

0.73142

0.68193

 

0.93234

.

0.32091

2

4.05520

0.98413

0.17745

0.18031

.

0.90603

3

1.82212

0.93691

0.34957

0.37312

.

0.22111

4

4.05520

0.70637

0.70784

1.00208

.

0.39808

5

2.45960

0.99224

0.12434

0.12532

.

0.18769

6

3.00417

0.71359

0.70056

0.98174

.

0.43607

7

5.47395

0.56736

0.82347

1.45141

.

0.26370

8

4.05520

0.75579

0.65481

0.86639

.

0.55486

9

1.82212

0.86217

0.50661

0.58760

.

0.86134

10

5.47395

0.98992

0.14160

0.14304

.

0.86042

For more information about mathematical, trigonometric, and other functions see the Help Files (go to Help-Extended Help, then select SAS System Help, select SAS Language, select SAS Functions, select Function Categories)

2.2. Operators

The most important operators are listed in the following table. The smaller the group number, the higher the precedence of the operator. For example, in the expression y = 3 * x + 4; multiplication is carried out before addition, since the group number of the multiplication operator is less than that of the addition operator.

Group Type of

 

DATA Step

Operator

Operator Description

Example

0

()

expression in

y = 3*(x+1);

 

parentheses is evaluated

 

first

  • 1 raises argument to a power

Math

**

y = x**2;

  • 2 to indicate a positive or negative number

Math

+,-

y = -x;

  • 3 Math

*

multiplication

y = x * z;

  • 4 Math

+

addition

y = x + 3;

-

subtraction

z = y - 3*x;

     

name =

  • 5 String

||

string

concatenation

firstname ||

lastname;

   

y = x in

whether value is

(1,2,3,4);

  • 6 Set

In

contained in a set

gender in

('F','M');

  • 7 equals

Logical

=, eq

if x = 12;

 

Logical

<>, ne does not equal

if x ne 5;

Logical

>, gt

greater than

if sin(x) > 0.4;

Logical

<, lt

less than

is cos(x) < sin(z);

Logical

>=, ge greater than or equal

 

Logical

<=, le

less than or equal

   

if (a=b) and

  • 8 Logical

and

logical and

(sin(x)>0.3);

 

Logical

or

logical or

if (a=b) or (sin(x) < 0.3);

Logical

not

logical not

if not (a=b);

2.3. Algebra with logical expressions

Algebra with logical expressions is a nifty trick. Like many other computing packages, logical comparisons in SAS return the numeric value 1 if true, 0 otherwise. This feature can be used in DATA steps elegantly. Imagine you need to create a new variable agegr grouping ages in the survey examples. The first group comprises ages between 0 and 25 years, the second group between 26 and 40 years and the third all individuals age 41 and older.

data survey; set survey; agegr = (age <= 25) + 2*((age > 25) and (age <=40)) + 3*(age > 41); run;

For individuals less than or equal to 25 years old, only the first logical comparison is true, the resulting algebraic expression is agegr = 1 + 0 + 0;. For those between 26 and 40 years old, the second expression is true and the expression yields agegr = 0 + 2*1 + 0;. Finally for those above 40 years old, you get agegr = 0 + 0 + 3*1;. Using algebra with logical

expressions is sometimes easier and more compact than using if

..

then

..

else constructs. The

if

..

then

syntax that accomplishes the same as the one-liner above is

data survey; set survey; if age <= 25 then agegr = 1; else if age <= 40 then agegr = 2; else agegr = 3;

run;

3. Dropping variables from a data set

Variables are dropped from a data set with the DROP DATA step statement or the DROP= option. To drop the variables r1, r2, and r3 from the survey data the two syntax constructs are

data survey; set survey; drop r1 r2 r3; run;

if you use the DROP statement and

data survey; set survey(drop=r1 r2 r3); run;

if you use the DROP= option. The end result of the two versions is the same. The variables r1, r2, and r3 are no longer part of the data set survey when the DATA step completes. There is a subtle difference between the two. When you use the DROP statement inside the DATA step all variables in survey are initially copied into the new data set being created. The variables being dropped are available in the DATA step itself. Dropping takes place only at completion of the DATA step. When you list variables in a DROP= option as in the second example, the variables are not copied (SET) into the DATA set. This version is slightly faster, since the interim data set being manipulated is smaller. But it precludes you from using the variables r1, r2, r3 somewhere in the data set. For example, if you want to calculate a new variable, the sum of r1, r2, r3 before dropping the variables you have to use

data survey; set survey; total = r1 + r2 + r3; drop r1 r2 r3; run;

If you would use

data survey; set survey(drop=r1 r2 r3); total = r1 + r2 + r3; run;

the total would contain missing values, since r1, r2, r3 are not known after survey has been copied into the new data set. If many variables are to be listed that form a numbered list, such as r1, r2, r3, etc. you can use a shortcut to describe the elements in the list:

data survey; set survey(drop=r1-r3); run;

The complementary DATA step command and data set option to DROP (DROP=) are the KEEP statement and KEEP= option. Instead of dropping the variables listed, only the variables listed after KEEP (KEEP=) are being kept in the data set. All others are eliminated. If you use the KEEP= data set option, variables not listed are not being copied into the new data set. The next line of statements eliminates all variables, except age and inc.

data survey; set survey(keep=age inc); run;

4. Dropping observations from a data set (Subsetting data)

Dropping observations (subsetting data) means to retain only those observations that satisfy a certain conditions. This is accomplished with IF and WHERE statements as well as the WHERE= data set option. For example to keep observations of individuals more than 35 years old use

data survey; set survey; if age > 35; run;

SAS evaluates the logical condition for each observation and upon successful completion of the

DATA step deletes those observations for which the condition is not true. An alternative syntax

construction is IF <condition> THEN DELETE;:

data survey; set survey; if age <= 35 then delete; /* or use: if not (age > 35) then delete; */ run;

If you use this construction, the condition has to be reversed of course. The WHERE statement functions exactly like the first IF syntax example:

data survey; set survey; where age > 35; run;

The advantage of the WHERE construction is that it can be used as a data set option:

data survey; set survey(where=(age > 35)); run;

Only those observations for which the expression in parentheses is true are copied into the new data set. If a lot of observations must be deleted, this is much faster than using the WHERE or IF statement inside the DATA step. The other advantage of subsetting data with the WHERE= data set option is that it can be combined with any procedure. For example, if you want to print the data set for only those age 35 and above you can use

proc print data=survey(where=(age >= 35)); run;

without having to create a new data set containing the over 35 year old survey participants first. To calculate sample means, standard deviations, etc. for 1994 yield data from a data set containing multiple years:

proc means data=yielddat(where=(year = 1994)); run;

5. Setting and merging multiple data sets

Setting data sets means concatenating their contents vertically. Merging data means combining two or more data sets horizontally. Imagine two SAS data sets. The first contains n1 observations and v1 variables, the second n2 observations and v2 variables. When you set the two data sets the new data set will contain n1+n2 observations and max(v1,v2) variables. Variables that are not in both data sets receive missing values for observations from the data set where the variable is not present. An example will make this clearer.

data growth1; input block trtmnt growth @@; year = 1997; datalines;

1

1 7.84

2 1 8.69

3 1 8.11

4 1 7.74

5 1 8.35

1

2 6.78

2 2 6.69

3 2 6.95

4 2 6.41

5 2 6.64

1

3 6.79

2 3 6.79

3 3 6.79

4 3 6.43

5 3 6.61

;

run;

 

proc print data=growth1(obs=10); run;

Data set growth1 contains 15 observations and four variables (BLOCK, TRTMNT, GROWTH, YEAR). Only the first 10 observations are displayed (OBS=10 data set option).

OBS

BLOCK

TRTMNT

GROWTH

YEAR

  • 1 7.84

1

1

1997

  • 2 8.69

2

1

1997

  • 3 8.11

3

1

1997

1

  • 4 7.74

4

1997

  • 5 8.35

5

1

1997

  • 6 6.78

1

2

1997

  • 7 6.69

2

2

1997

  • 8 6.95

3

2

1997

  • 9 6.41

4

2

1997

10

5

2

6.64

1997

The next data set (growth2) contains 10 observations. The variable YEAR is not part of the data set growth2.

data growth2; input block trtmnt growth @@; datalines;

1

1

4 6.64

5 7.31

run;

2 4 6.57

2 5 7.65

3 4 6.78

3 5 7.26

4 4 6.54

4 5 6.98

5 4 6.48 5 5 7.39

To combine the data sets vertically, use the SET data set statement and list the data sets you wish to combine. The data sets are placed in the new data set in the order in which they appear in the SET statement. In this example, the observations in growth1 go first followed by the observations in growth2.

data growth; set growth1 growth2; run; proc print data=growth; run;

The combined data has four variables and 50 observations, the variable year contains missing values for all observations from growth2 since year was not present in this data set.

OBS

BLOCK

TRTMNT

GROWTH

  • 1 7.84

1

1

  • 2 8.69

2

1

  • 3 8.11

3

1

1

  • 4 7.74

4

  • 5 8.35

5

1

  • 6 6.78

1

2

  • 7 6.69

2

2

  • 8 6.95

3

2

  • 9 6.41

4

2

  • 10 6.64

5

2

  • 11 6.79

1

3

  • 12 6.79

2

3

  • 13 6.79

3

3

  • 14 6.43

4

3

  • 15 6.61

5

3

YEAR

1997

1997

1997

1997

1997

1997

1997

1997

1997

1997

1997

1997

1997

1997

1997

  • 16 6.64

1

4

  • 17 6.57

2

4

  • 18 6.78

3

4

  • 19 6.54

4

4

  • 20 6.48

5

4

  • 21 7.31

1

5

  • 22 7.65

2

5

  • 23 7.26

3

5

  • 24 6.98

4

5

  • 25 7.39

5

5

.

.

.

.

.

.

.

.

.

.

Merging data sets is usually done when the data sets contain the same observations but different variables, while setting is reasonable when data sets contain different observations but the same variables. Consider the survey example and assume that the baseline information (id, sex, age, inc) are in one data set, while the subjects ratings of product preference (r1, r2, r3) are contained in a second data set.

DATA baseline; INPUT id sex $ age inc; DATALINES;

 

1

F

35 17

  • 17 50 14

M

  • 33 45

F

6

  • 49 24 14

M

  • 65 52

F

9

  • 81 44 11

M

2

F

34 17

  • 18 40 14

M

  • 34 47

F

6

  • 50 35 17

M

; DATA rating; INPUT r1 r2 r3 ; DATALINES;

7

5

7

7

4

7

6

7

6

5

2

2

5 3

2

5

7

7

7

7

7

7

5 3

5 2

5

7

6

5

;

run;

Merging the two data sets in a DATA step combines the variables and observations horizontally. If the first data set has n1 observations and v1 variables and the second data set has n2

observations and v2 variables, the merged data set will have max(n1,n2) observations. Observations not present in the smaller data set are patched with missing values. The number of variables in the combined data set depends on whether the two data sets share some variables. If variables are present in either data set, they are retained from the data set in the merge list that contains the variable last. If the rating data set above would contain a variable ID, the value of the ID variable in the merged data set would come from the rating data set.

data survey; merge baseline rating; run; proc print data=survey; run;

OBS ID

SEX

AGE

INC

R1

R2

R3

1

1

F

35

17

7

2

2

2

17

M

50

14

5

5

3

3

33

F

45

6

7

2

7

4

49

M

24

14

7

5

7

5

65

F

52

9

4

7

7

6

81

M

44

11

7

7

7

7

2

F

34

17

6

5

3

8

18

M

40

14

7

5

2

9

34

F

47

6

6

5

6

10

50

M

35

17

5

7

5

6. Sorting your data

If your data is entered or read in some "ordered" fashion, one could consider it ordered. For example the data set growth in the example above appears sorted first by the variable TRTMNT and for each value of TRTMNT by BLOCKS. As far as SAS is concerned, the data are simply ordered by these variables, but not sorted. A data set is not sorted, unless you process it with the SORT procedure. The basic syntax of PROC SORT is

proc sort data=yourdata; by <variable list> run;

<variable list> is the list of variables by which to sort the data set. If this list contains more than one variable, SAS sorts the data set by the variable listed first. Then, for each value of this variable, it sorts the data set by the second variable. For example

by block trtmnt;

will cause SAS to sort by BLOCK first. All observations with the same value of variable BLOCK are then sorted by variable TRTMNT. By default, variables are sorted in ascending order. To reverse the sort order add the key-word DESCENDING before the name of the variable you want to be arranged in descending order. For example

by descending block trtmnt;

will sort the data in descending order of BLOCK and all observations of the same block in ascending order of TRTMNT. Why is sorting so important and how does it differ from arranging (reading) data in an ordered sequence to begin with? When you sort data with PROC SORT, SAS adds hidden variables for each variable in the BY statement. For example the code

proc sort data=growth; by block trtmnt; run;

sorts data set growth by BLOCK and TRTMNT. The hidden variables added to the data set are

first.block

last.block

first.trtmnt

last.trtmnt

You are not able to see these variables or print them out. However, they can be accessed in DATA steps. By default these are logical variables containing the values 0 and 1 only. In a group of observations with the same value of BLOCK first.block takes on the value 1 for the first observation in the group and last.block takes on the value 0. For the last observation in the group last.block takes on the value 1. With some trickery, I made first.block and last.block visible in the printout of the sorted data set growth:

 

first.

last.

OBS

BLOCK

TRTMNT

block

block

GROWTH

  • 1 1

1

1

0

7.84

  • 2 2

1

0

0

6.78

  • 3 3

1

0

0

6.79

1

  • 4 4

0

0

6.64

1

  • 5 5

0

1

7.31

  • 6 1

2

1

0

8.69

  • 7 2

2

0

0

6.69

  • 8 3

2

0

0

6.79

  • 9 4

2

0

0

6.57

  • 10 5

2

0

1

7.65

3

  • 11 1

1

0

8.11

  • 12 2

3

0

0

6.95

  • 13 3

3

0

0

6.79

3

  • 14 4

0

0

6.78

3

  • 15 5

0

1

7.26

  • 16 1

4

1

0

7.74

  • 17 2

4

0

0

6.41

  • 18 3

4

0

0

6.43

  • 19 4

4

0

0

6.54

  • 20 5

4

0

1

6.98

  • 21 1

5

1

0

8.35

  • 22 2

5

0

0

6.64

  • 23 3

5

0

0

6.61

24

5

4

0

0

6.48

25

5

5

0

1

7.39

A convenient way to process data in procedures is to use BY-processing. This is only possible if the data contains the hidden variables first.whatever and last.whatever. Before you can use BY- processing, the data must be sorted accordingly.

The next code example shows how the first.whatever and last.whatever variables can be used in DATA steps. Here only the first observation for each block is output to the data set. This trick allows counting the number of unique values of BLOCK in the data set:

data blocks; set growth; by block; if first.block; run; proc print data=blocks; run;

OBS

BLOCK

TRTMNT

GROWTH

YEAR

  • 1 1

1

7.84

1997

  • 2 2

1

8.69

1997

  • 3 3

1

8.11

1997

  • 4 4

1

7.74

1997

  • 5 5

1

8.35

1997

Notice that the SET statement is followed by the BY statement. It is the presence of the BY statement in the DATA step that forces SAS to copy the first.block and last.block variables from the sorted data set. When the BY statement is omitted, first.block can not be accessed.

The SORT procedure has many nifty aspects. Like for any other procedure you can access help and explore the syntax by entering help procedurename into the little white box in the top left corner of the SAS application workspace. Then click on the checkmark to the left of the box. Here are some other uses of PROC SORT:

1. By default the data set being sorted is replaced with the sorted version at completion of PROC SORT. To prevent this, use the OUT= option.

proc sort data=growth out=newdata; by descending block; run;

sorts the data set growth by descending levels of BLOCK but leaves the original data untouched. The sorted data is written instead to a data set calle newdata.

2. Sometimes you want to identify how many combinations of the levels of certain variables are in your data set. One technique to determine this is shown above, a DATA step using the first.xxx and/or last.xxx variables. More conveniently, you can use the NODUPKEY option. If there are multiple observations for a certain combination of the sort variables, only the first one is retained:

proc sort data=growth out=blocks nodupkey; by block; run;

6.1. By-processing in PROC steps

By-processing in general refers to the use of the BY statement in PROC or DATA steps. If a BY statement

by <variable list>

appears in a PROC or DATA step the data set(s) being processed must be sorted as indicated by the variable list. However, if you sort data with

by block tx a b ;

for example, by processing is possible with any of the following:

by block; by block tx; by block tx a; by block tx a b;

since a data set sorted by BLOCK and TX is also sorted by BLOCK and so forth. You can not, however change the order in which the varaibles appear in the list or omit a variable ranked higher in the sort order. By statements such as

by b; by tx a; by a block;

and so forth will create errors at execution.

A BY statement in a procedure causes SAS to execute the procedure separately for all combinations of the BY variables. For example,

proc means data=growth; var growth; run;

will calculate mean, standard deviation, etc. for the variable growth across all observations in the data set. The statements

proc means data=growth; var growth; by block; run;

will calculate these descriptive statistics separately for each level of the BLOCK variable in the data set.

6.2. By-merging of sorted data sets (sorted matching)

By-merging is a powerful tool to merge data according to variables contained in both data sets. The easiest case is one-to-one merging where each data set in the merge contributes at most one observation for a variable according to which the data sets are to be matched. Consider the following example: In a spacing study two levels of P fertilization (0 and 25 lbs/acre) and three levels of row spacing (40, 80, 120 cm) are applied. Yield data are collected in 1996 and 1997. The data sets for the two years are shown below. Each data set contains 17 observations. To calculate the yield difference between the two years the data sets need to be merged. Upon closer inspection one sees however, that the observations do not have the same P and SPACE variable arrangement. There are only two observations for P=0, SPACE=40 in the 1996 data set whereas there are three observations for this combination in the 1997 data set. Conversely, replicate 2 for P=25, SPACE=120 observations in 1996 is not represented in the 1997 data set.

/* 1996 yield data */ data spacing1; input P space rep yield96; datalines;

  • 0 40

1 57

  • 0 40

2 58

  • 0 80

1 57

  • 0 80

2 58

  • 0 80

3 56

  • 0 120

1 49

  • 0 120

2 54

  • 0 120

3 53

  • 25 40

1 53

  • 25 40

2 45

  • 25 40

3 46

  • 25 80

1 54

  • 25 80

2 50

  • 25 80

3 48

  • 25 120

1 63

  • 25 120

2 57

  • 25 120

3 53

;;

run;

/* 1997 yield data */ data spacing2; input P space rep yield97; datalines;

0

0

0

0

0

0

40

40

40

80

80

80

1 35

2 28

3 29

1 38

2 29

3 27

0

120

2 25

0

120

3 34

  • 25 40

1 24

  • 25 40

2 24

  • 25 40

3 17

  • 25 80

1 25

  • 25 80

2 31

  • 25 80

3 29

  • 25 120

1 44

  • 25 120

3 28

;;

run;

The following DATA step merges the data incorrectly. It matches observation by observation and since both data sets contain variables P SPACE, and REP, the observations for these variables are pulled from spacing2, the last data set in the list.

data spacing; merge spacing1 spacing2; run; proc print data=spacing; run;

OBS

P

SPACE

REP

YIELD96

YIELD97

  • 1 0

40

1

57

35

  • 2 0

40

2

58

28

  • 3 0

40

3

57

29

  • 4 0

80

1

58

38

  • 5 0

80

2

56

29

  • 6 0

80

3

49

27

  • 7 0

120

1

54

10

  • 8 0

120

2

53

25

  • 9 0

120

3

53

34

  • 10 25

40

1

45

24

  • 11 25

40

2

46

24

  • 12 25

40

3

54

17

  • 13 25

80

1

50

25

  • 14 25

80

2

48

31

  • 15 25

80

3

63

29

  • 16 25

120

1

57

44

  • 17 25

120

3

53

28

Yield measurements in the two years are matched up correctly for the first two observations, but incorrectly for the third and all following observations. Since the P, SPACE, and REP variables contained in data set spacing1 were overwritten by the variables in spacing2, the problem is not at all obvious on first glance. To correctly merge the two data sets, we sort them both such that each observation is properly identified:

proc sort data=spacing1; by p space rep; run; proc sort data=spacing2; by p space rep; run;

Then merge them using BY-processing:

data spacing; merge spacing1 spacing2; by p space rep; run; proc print data=spacing; run;

This produces

 

OBS

P

SPACE

REP

YIELD96

YIELD97

  • 1 0

40

1

57

35

  • 2 0

40

2

58

28

  • 3 0

40

3

.

29

  • 4 0

80

1

57

38

  • 5 0

80

2

58

29

  • 6 0

80

3

56

27

  • 7 0

120

1

49

10

  • 8 0

120

2

54

25

  • 9 0

120

3

53

34

  • 10 25

40

1

53

24

  • 11 25

40

2

45

24

  • 12 25

40

3

46

17

  • 13 25

80

1

54

25

  • 14 25

80

2

50

31

  • 15 25

80

3

48

29

  • 16 25

120

1

63

44

  • 17 25

120

2

57

.

  • 18 25

120

3

53

28

The observations are now matched up properly.

By-merging is also useful to match many-to-one relationships in data sets. Assume that data set one contains the cahnge in ear temperature at treatment and 4 days after treatment of 5 rabbits treated with three different treatments each:

data rabbits; input Treat Rabbit day0 day4; datalines;

1

1

1

1

1

1

2

3

4

5

-0.3

-0.5

-1.1

1.0

-0.3

-0.2

2.2

2.4

1.7

0.8

2

1

-1.1

-2.2

2

2

-1.4

-0.2

2

3

-0.1

-0.1

2

4

-0.2

0.1

2

5

-0.1

-0.2

3

1

-1.8

0.2

3

2

-0.5

0.0

3

3

-1.0

-0.3

3

4

0.4

0.4

3

5

-0.5

0.9

;;

run;

 

A second data set contains the treatment means averaged across the five rabbits for each treatment group:

data means; input treat day0mn day4mn; datalines;

 

1

-0.24

1.38

 

2

-0.58

-0.52

3

-0.68

0.24

;

run;

 

You want to calculate the deviation between each observation and the respective treatment means. That means one observation in data set means myst be matched with five observations in data set rabbits. This is done easily with By-merging. Sort both data sets by TREAT and merge them by TREAT.

proc sort data=rabbits; by treat; run; proc sort data=means; by treat; run; data deviate; merge rabbits means; by treat; dev0 = day0 - day0mn; dev4 = day4 - day4mn; run; proc print data=deviate; run;

 
 

which produces:

 

OBS

TREAT

RABBIT

DAY0

DAY4

DAY0MN

DAY4MN

DEV0

DEV4

1

1

1

-0.3

-0.2

-0.24

1.38

-0.06

-1.58

 

2

1

2

-0.5

2.2

-0.24

1.38

-0.26

0.82

 

3

1

3

-1.1

2.4

-0.24

1.38

-0.86

4

1

4

1.0

1.7

-0.24

1.38

1.24

0.32

5

1

5

-0.3

0.8

-0.24

1.38

-0.06

-0.58

6

2

1

-1.1

-2.2

-0.58

-0.52

-0.52

-1.68

7

2

2

-1.4

-0.2

-0.58

-0.52

-0.82

0.32

8

2

3

-0.1

-0.1

-0.58

-0.52

0.48

0.42

9

2

4

-0.2

0.1

-0.58

-0.52

0.38

0.62

10

2

5

-0.1

-0.2

-0.58

-0.52

0.48

0.32

11

3

1

-1.8

0.2

-0.68

0.24

-1.12

-0.04

12

3

2

-0.5

0.0

-0.68

0.24

0.18

-0.24

13

3

3

-1.0

-0.3

-0.68

0.24

-0.32

-0.54

14

3

4

0.4

0.4

-0.68

0.24

1.08

0.16

15

3

5

-0.5

0.9

-0.68

0.24

0.18

0.66

This type of merge works well too, if the two data sets do not have the same levels of the BY variables. Assume, for example, that the third observation is missing in data set means:

data means; input treat day0mn day4mn; datalines;

  • 1 -0.24

1.38

  • 2 -0.58

-0.52

;

run;

To keep only observations when the data sets are merged, for which a mean value exists, you can use the IN= option:

proc sort data=rabbits; by treat; run; proc sort data=means; by treat; run; data deviate; merge rabbits means(in=y); by treat; if y; dev0 = day0 - day0mn; dev4 = day4 - day4mn;

run; proc print data=deviate; run;

which produces

OBS

TREAT

RABBIT

DAY0

DAY4

DAY0MN

DAY4MN

DEV0

DEV4

1

1

1

-0.3

-0.2

-0.24

1.38

-0.06

-1.58

2

1

2

-0.5

2.2

-0.24

1.38

-0.26

0.82

3

1

3

-1.1

2.4

-0.24

1.38

-0.86

1.02

4

1

4

1.0

1.7

-0.24

1.38

1.24

0.32

5

1

5

-0.3

0.8

-0.24

1.38

-0.06

-0.58

6

2

1

-1.1

-2.2

-0.58

-0.52

-0.52

-1.68

7

2

2

-1.4

-0.2

-0.58

-0.52

-0.82

0.32

8

2

3

-0.1

-0.1

-0.58

-0.52

0.48

0.42

9

2

4

-0.2

0.1

-0.58

-0.52

0.38

0.62

10

2

5

-0.1

-0.2

-0.58

-0.52

0.48

0.32

7. Formatting

 

By formatting we mean the display of variable names and observations on SAS printouts in a user-defined form. By default if a data set is printed, SAS labels the columns with the variable names and displays the values of the variables with their actual content. For readability, it is often advisable to replace the short mnemonic variable names with more descriptive captions and to display observations differently. In the preceding example, the variable TREAT refers to a treatment applied in the experiment. Without explicitly knowing what treat=1, treat=2, etc. refers to, printouts may be hard to read. Rather than re-entering the data and changing the contents of treat to a descriptive string, you can instruct SAS to display the contents of the variable differently, without changing its contents.

7.1. Labels

Labels are descriptive strings associated with a variable. Labels are assigned in data steps with the LABEL statement. The following data set is from an experiment concerned with the accumulation of thatch in creeping bentgrass turf under various nitrogen and thatch management protocols.

data chloro; label block = 'Experimental replicate' nitro = 'Nitrogen Source' thatch = 'Thatch management system'

chloro = 'Amount chlorophyll in leaves (mg/g)'; input block nitro thatch chloro; datalines;

1

1

1

3.8

1

1

2

5.3

1

1

3

5.9

<and so forth> ;; run; proc print data=chloro label; run;

 

produces

 
 

Amount

 

Thatch

chlorophyll

 

Experimental

Nitrogen

management

in leaves

OBS

replicate

Source

system

(mg/g)

1

1

1

1

3.8

2

1

1

2

5.3

3

1

1

3

5.9

etc.

The LABEL option of PROC PRINT instructs SAS to display labels instead of variable names if these exist. It is not necessary to assign labels to all the variabels in the data set.

7.2. Formats

While labels replace variable names in displays, formats substitute for the values of a variable. Many formats are predefined in The SAS System (check the SAS/Lanuage or SAS/User's Guide manuals or online help files). We are concerned here with user-defined formats. These are created in PROC FORMAT. The next example creates two different formats, NI and YEA. The definition of a format commences with the VALUE keyword followed by the format name. Then follows a list of the form value = 'some string'. If a variable is associated with a format and SAS encounters a given value, it displays the string instead. For example, values of 2 are displayed as Amm. sulf. for the variable associated with the format NI.

proc format; value Ni 1='Urea' 2='Amm. sulf.' 3='IBDU' 4='Urea(SC)'; value Yea 1='2 years' 2='5 years' 3='8 years';

run;

The association between variable and format is done in a DATA step (it is also possible to create this association in certain procedures such as PROC FREQ, PROC PRINT, etc.) with the FORMAT statement.

data chloro; label block = 'Experimental replicate' nitro = 'Nitrogen Source' thatch = 'Thatch management system' chloro = 'Amount chlorophyll in leaves (mg/g)'; format nitro ni. thatch yea.; input block nitro thatch chloro; datalines;

1

1

1

1

1

1

1

2

3

3.8

5.3

5.9

<and so forth> ;; run;

The name of the format follows the name of the variable to be formatted. The format name must be followed with a period. Otherwise SAS will assume that the format name is a variable name. Printing this data set after assigning of labels and formats produces a much more pleasing printout.

Amount

 

Thatch

chlorophyll