Sunteți pe pagina 1din 55

Document K6.1

Introduction to SAS

Document K6.1 Introduction to SAS Document K6.1
Document K6.1 Introduction to SAS Document K6.1

Document K6.1

*

Introduction to SAS

1

Introduction

1

Documentation

1

The SAS DATA step

2

Creating and naming a dataset - the DATA statement

2

Preparing data for SAS

3

Entering data to SAS and naming variables - the INPUT statement

5

Reading data from a file - the INFILE statement

5

Including data with SAS commands - the CARDS statement

5

Reading from a SAS dataset - the SET statement

6

Calculations - the assignment statement

7

Conditional evaluation and selection - the IF, THEN and ELSE statements

8

Giving labels to variables - the LABEL statement

8

Displaying data - the PUT statement

9

Controlling output format - the FORMAT statement

10

Executing instructions - the RUN statement

10

Recoding data

10

Example DATA steps

11

The SAS PROC step

11

Analysing data in subgroups - the BY statement

11

Controlling output format - the FORMAT statement

11

Printing titles on output from procedures - the TITLE statement

12

Printing footnotes - the FOOTNOTE statement

12

Descriptive statistics

12

Frequency tables and cross-tabulations - the FREQ procedure

14

Descriptive statistics - the MEANS procedure

15

Statistics about a single variable - the UNIVARIATE procedure

17

Correlations - the CORR procedure

19

Complex tables - the TABULATE procedure

21

Student’s T-Test

23

Drawing simple diagrams with SAS

24

Bar charts - the CHART procedure

26

Scattergrams - the PLOT procedure

28

High-quality graphics

28

Bar charts - the GCHART procedure

31

Scattergrams - the GPLOT procedure

33

Maps - the GMAP procedure

34

Regression - the REG procedure

35

The MODEL statement

36

The OUTPUT statement

36

Example

39

The CLASS statement

39

The MODEL statement

40

The MEANS statement

40

Example

44

General linear modelling - the GLM procedure

44

The CLASS statement

45

The MODEL statement

45

The ID statement

45

The MEANS statement

45

The RANDOM statement

45

The OUTPUT statement

46

Example

46

Miscellaneous useful procedures

46

Sorting a dataset - the SORT procedure

47

Defining your own formats - the FORMAT procedure

49

Printing the values of variables - the PRINT procedure

52

References and further information

* We are grateful to the University of Liverpool Computer Laboratory for permitting us to use their original user guide as a basis for this one.

Introduction

SAS (Statistical Analysis System usually pronounced as a single syllable ‘sass’) provides facilities for the manipulation and analysis of both

numeric and character data using a variety of techniques.

facilities range from production of frequency tables and bar charts to

multivariate regression and analysis of variance.

input and manipulation give control over how data is to be read, and provide for the sorting and merging of datasets and calculations on the

data.

applications including survey analysis, data processing, and analysis of designed experiments.

A series of SAS statements is often divided into DATA steps and PROC

steps.

manipulate it, select subgroups, create new variables etc (the DATA step), and those which analyse the data and produce output (the PROC

step).

functions possible in the DATA step, the section ‘The SAS PROC step’ describes some of the statements which can occur in any PROC statement, while the sections ‘Descriptive Statistics’ to the end describe some specific PROC statements.

SAS is available locally on CMS and CSA. A PC version may be purchased from the Computing Service.

The section ‘The SAS DATA step’ describes some of the

The statistical

The facilities for data

This variety of facilities makes SAS suitable for a wide range of

The division is between those statements which input data,

Documentation

SAS is described in several manuals, amounting to many hundreds of

pages.

material and select only a small portion of the facilities available.

when a facility is described here it will not be described in full and many

options may be ignored.

manuals if there is something you want to do that is not described in the following sections. Details of manuals may be found on page 52. On CMS, type HELP SAS for details of the CMS version.

This means that you should check with the full

For this document it has been necessary to omit much of the

Even

The SAS DATA step

A SAS DATA step is a set of statements which set up the data, for

example read the data, manipulate it, select subgroups or create new variables. This section describes some of the functions possible in the DATA step.

A DATA step starts with a DATA statement and ends with a RUN

statement (or with a further DATA or a PROC statement).

Comments can appear anywhere in a SAS statement provided they appear within the delimiters /* (to start a comment) and */ (to end one). For example:

DATA STOCK; INPUT CODE QUANTITY PRICE /*in dollars*/ STOREBIN;

Creating and naming a dataset - the DATA statement

The DATA statement has the general form:

DATA dataset options ;

Note that, as with all SAS statements, the DATA statement ends with a

semicolon.

DATA step, either by the input of data or by processing an existing dataset. ‘dataset’ has two parts; a library name (which may be omitted) and the dataset name. These names follow the same rules as other names used in SAS, which are:

¤ The name must not be more than eight characters long.

¤ It may contain only the letters A to Z, the digits 0 to 9 and the underline character.

¤ It must not start with a digit.

Examples of valid names are NEW_ONE, OLDDATA2 and _EX_A1.

If a library name is specified it is separated from the dataset name by a

full stop.

temporary library that is deleted when you leave SAS.

valid DATA statements are:

DATA KEEP.GNP82;

DATA PRICES;

DATA E_4_DATA;

If you miss out ‘dataset’ altogether, SAS gives the dataset the name

WORK.DATAn, where n is 1, 2, 3, datasets you have created so far.

If you specify a library name you can save the library and use it in a later SAS session rather than having to set it up again. Note however that only the library name SASUSER is set up for you by SAS. If you wish to use any other library name then you must define it both to the system you are using and to SAS (using a LIBNAME statement). This is described in more detail in Reference A or B.

The ‘dataset’ part names the dataset being created by this

If no library name is given,it is named WORK, which is a

Examples of

depending on how many such

Preparing data for SAS can read either fixed format or free format data (or even data which

SAS

mixes fixed and free format in some cases).

values be in fixed positions in each line of the data, while free format

requires only that values be separated by a blank.

easier to prepare their data in free format.

example names) as well as numbers, and it has special provision for

times and dates.

¤ Each value must be separated from the next by at least one space or by the end of a line.

¤ Decimal points must be included where they are required.

Fixed format requires that

Most people find it

SAS can read text data (for

Free format data must follow certain rules:

¤

Any missing data must be represented by a full stop (.) and not by

blank (full stop is the standard symbol for a missing value in SAS).

a

¤

If

you are using text data then no item of text may be longer than

eight characters and the data must be enclosed in quotes.

If your data satisfies all these conditions, you can use the free format method of reading data. If not, you must use fixed format, or possibly a mixture of free and fixed format, as described below.

Entering data to

SAS and naming

variables - the

INPUT statement

A SAS dataset is made up of a number of cases or observations, each of which contains a value (either measurements or calculations) for each variable in the dataset.

If you are inputting new data, rather than processing an existing dataset, use the INPUT statement to specify:

¤ the names of the variables

¤ the order of variables in each complete set of variables (or case or observation)

¤ whether variables are numeric or contain characters

¤ how the variables are laid out on the input line

The names of the variables follow the same rules as those for dataset names given on page 2, for example HEIGHT, TOT_WAGE, AVE3. A variable is assumed to be numeric unless it is followed by a dollar sign ($) in the INPUT list. The dollar sign is not part of its name and is not typed when using the variable in other SAS statements.

The order in which the variables appear in the INPUT statement is their order in the dataset (although not necessarily their order in the input data, as some forms of fixed format input can read the data in a different order to the one in which it was typed).

If you have a list of variable names ending in numeric suffixes, for example VAR1, VAR2, VAR3 etc, SAS allows an abbreviated form for their specification:

INPUT CASENUM VAR1-VAR12 DETAILS;

This form may also be used to specify such a list in procedures. For example:

PROC MEANS; VAR AGE1-AGE7; RUN;

For more details of PROC MEANS see page 14.

Where variable names do not follow such a pattern you can still use an abbreviated form when specifying them in a procedure, but it is a little different. If the INPUT statement for the dataset was:

INPUT CASE SEX AGE TESTA TESTB INCOME MARSTAT REGION;

you could type:

PROC MEANS; VAR AGE--INCOME; RUN;

to obtain means on the variables AGE, TESTA, TESTB and INCOME. The double hyphen indicates a list of variable names.

Free format - list input

Fixed format - column input

To input data which satisfies the rules of free format input given on page 2 you need only list the variables, for example:

INPUT CASE_NO SEX $ AGE HEIGHT WEIGHT;

SEX has been declared as a character variable so that M and F can be used to denote male and female, rather than using numeric codes. As with all SAS statements, the INPUT statement ends with a semicolon.

There are two ways of specifying fixed format input. The easier is column input. The columns containing the value for the variable are specified after the variable name, for example:

INPUT REF 1-6 NAME $ 7-26 AGE 27-28 SEX 59 HEIGHT 64-66 2;

This specifies the name and type of the variables and how they are laid out:

REF

numeric, columns 1 to 6

NAME

character, columns 7 to 26

AGE

numeric, columns 27 to 28

SEX

numeric, column 59

HEIGHT

numeric, columns 64 to 66, the last two columns (65 and 66) are assumed to be after a decimal point (which need not be typed)

If you always type decimal points where they are needed then you have no need to use the type of column format given for HEIGHT. Instead you could specify simply HEIGHT 64-66 and type in all decimal points.

Anything typed in columns other than those mentioned in the INPUT statement is ignored. If your data does not fit on one line you must indicate when the second or third line begins. For example, if the data has the name on one line and the address on the next:

INPUT NAME $ 1-20 #2 ADDRESS $ 1-60;

The #2 means that subsequent column numbers refer to line 2.

Fixed format - formatted input

Formatted input is the second way to input fixed format data. It is the only way of inputting certain unusual types of data and for declaring that data represents a time or date so that it can be printed appropriately later. The data layout used above would be specified as follows using SAS informats (an informat describes how a value is to read; a format describes how it is to be printed):

INPUT REF @1 6. NAME $ 20. HEIGHT @64 3.2;

¤ ‘@1’ specifies that REF starts in Column 1

¤ ‘6.’ means that it is 6 digits long with no figures after the decimal point

¤ since NAME follows immediately after REF there is no need to specify the column at which it starts (column 7 is assumed)

¤ ‘20.’ indicates that it is 20 characters long

¤ AGE is two digits long with no decimal point

AGE 2. SEX @59 1.

¤ SEX does not immediately follow AGE, and so ‘@59’ is needed to show where SEX starts, and ‘1.’ specifies that it takes up one column

¤ HEIGHT takes up three columns, the last two of which are after the decimal point, so its informat is ‘3.2’ rather than ‘3.’, but ‘3.’ would be quite adequate if you had typed the decimal points in the data.

If there is more than one line of data for each case use the ‘#n’ notation, for example:

INPUT NAME $ @1 20. #2 ADDRESS $ @1 60.;

Mixing free and fixed format

SAS allows the different styles of describing how the data is arranged to be specified on the same INPUT statement. For example, suppose you had data which consisted of a name of up to 20 letters and seven measurements, and you wished to type the measurements in free format but could not use free format for the name because it had more than 8 characters. If you put the name first you could describe this data quite simply:

INPUT NAME $ 1-20 M1 M2 M3 M4 M5 M6 M7;

The name starts in column 1 and the measurements can be typed in free format after column 20. If any name is 20 characters long, a space between the end of the name and the first measurement is advisable.

Multiple cases on a line

If the list of variables ends with @@ SAS does not expect each case to

start on a new line.

DATA LINES; INPUT A B@@; CARDS; 1 2 1 4 1 5 6 4 2 3 6 7 10 8 3 6 RUN;

The data for eight cases is typed on one line. For an explanation of CARDS see page 5 and for RUN see page 10.

For example:

Reading data from a file - the

INFILE statement further details.

If data is to be read from a file you must include an INFILE statement to specify the name of the file. There are different ways of specifying the

file according to which system you are using. See Reference A or B for

Including data with SAS commands - the

If the data is to be included with the SAS statements the CARDS statement appears just before the data. It has the format:

CARDS;

CARDS statement If CARDS is being used then only the data itself and a RUN statement (see page 10) should follow it. Any transformations of the data must appear before the CARDS statement.

Reading from a SAS dataset - the SET statement

If the data is to read from an existing dataset the SET statement specifies the dataset to be used. A DATA step should normally contain either a SET statement or an INPUT statement. The format of the SET statement is:

SET dataset;

Calculations - the assignment statement

For example:

SET GHS_DATA; SET SASUSER.ANSWERS;

Assignment statements have no keyword to identify them. They are used to perform calculations on the data and have the general form:

variable = expression;

The expression contains numbers, variable names and arithmetic operators. The multiplication operator (*) must always appear between two quantities which are to be multiplied together.

The symbols used for arithmetic operators are:

** exponentiation, for example (1 + R)**T

* multiplication, for example TAXABLE * RATE

/ division, for example C367 / C45

+ addition, for example A1 + B9

- subtraction, for example GROSS - TAX

Brackets may be used to make your meaning clear. Examples of assignment statements are:

AVEINC = INCOME / NUMFAM

B1 = (A7 + C9 - EXPENSES)*0.54

GRPPAY = PAY / 1000

SAS has rules for the order in which it evaluates expressions by giving priorities (or precedence) to each operator, as follows:

1 Bracketed expressions

2 Exponentiation

3 Multiplication and Division

4 Addition and Subtraction

If there are operators of equal precedence SAS works from left to right. This means that an expression like:

A + B * C

is evaluated by SAS as A + (B * C). If you wish to add A and B before multiplying by C then you must use brackets:

(A + B) * C

If you are in doubt about how SAS will evaluate a complex expression then either insert brackets or split it into simpler expressions and use several assignment statements to build up the full expression.

SAS expressions can also include SAS functions. These provide many facilities including square roots of numbers, logarithms, sines and cosines, probabilities, etc. A list of the more commonly used functions is given below.

ABS absolute value (that is, the value ignoring sign)

MAX maximum of a list of values

MIN minimum of a list of values

SQRT square root

INT

gives the integer part of the value, that is it discards the decimal part

ROUND rounds off to the nearest whole number

NORMAL

gives a random number from a normal distribution with mean 0 and standard deviation 1

UNIFORM

gives a random number from a uniform distribution in the range 0 to 1.

There are also functions for manipulating dates and times and for character variables.

Functions are used by giving the value on which they are to operate in brackets following the name, for example:

BIG = MAX(A,B,C);

S = SQRT(S2/N);

RX = ROUND(X) + C;

Z = M + S*NORMAL(0);

Conditional evaluation and

selection - the

THEN and ELSE statements

IF,

The IF statement is used when an action is to be carried out on only some of the cases being processed. For example, you may wish to take special action if data is missing, or do calculations differently for people in work and those unemployed, or you may wish to exclude certain groups. The general form of the IF statement is:

IF condition THEN statement1; ELSE statement2;

‘statement1’ is acted upon if ‘condition’ is true, otherwise ‘statement2’ is acted upon. The ‘condition’ is usually of the form:

expression comparator expression

where the expression is as described on page 6, and ‘comparator’ is one of the following:

EQ or = equals

NE or ^= not equal to

GT or > greater than

NG or ^> not greater than

LT or < less than

NL or ^< not less than

GE or >= greater than or equal to

LE or <= less than or equal to

Note that NL and GE are equivalent and so are NG and LE.

These simple conditions can be linked with the words AND and OR. NOT may be used to change the meaning of a condition.

Giving labels to variables - the LABEL statement

Displaying data - the PUT statement

PUT - list style

When using a complex condition you should use brackets to make your meaning clear. Examples of conditions are:

AGE LE 15 AND WORKSTAT EQ 1

NOT (SEX = 1 OR NISTAMP < 2)

MARSTAT EQ 1 OR MARSTAT >= 3

Examples of IF statements are:

IF A > B THEN X=A;

IF MONTH >= 3 AND MONTH LT 6 THEN SEASON = ’SPRING’;

IF EXPENSES GT (EARNINGS - TAX) THEN DEBT = 1;

IF

ELSE X=B;

The above examples have shown only assignment statements following THEN and ELSE, but other statements can also be used, for example DELETE (which omits the case from the dataset) or PUT (which can print values; see page 8). For example, if you already had a dataset and wished to set up another one which only contained men over retirement age, then you might have

IF SEX NE ’M’ OR AGE LT 65 THEN DELETE;

or

IF NOT (SEX EQ M AND AGE GE 65) THEN DELETE;

A special value used to indicate missing values is the full stop (which

can also be used in your data for the same purpose). Suppose your data had been prepared with ‘9’ indicating a missing value for the variable MARSTAT (which is marital status) you could replace this with a missing value symbol by:

IF MARSTAT = 9 THEN MARSTAT = . ;

To eliminate cases with important data missing:

IF INCOME = . AND EXPENSES = . THEN DELETE;

A

>

B

THEN X=A;

The LABEL statement allows you to define labels for variables, which will be used by various procedures to document the output. A label may

be up to 40 characters long, for example:

LABEL INCOME=’ANNUAL INCOME INCLUDING STATE BENEFITS’;

The PUT statement enables you to print out values as the DATA step is being processed. It has equivalent styles to the three forms of the INPUT statement (list, columns, and formatted). Only simple ways of using PUT are described here.

This can be simply a list of the variable names whose values you wish to

be printed, for example:

PUT NAME AGE MARSTAT;

in which case the values are printed with one space between each value,

and each case starts on a new line.

If you follow the name of the variable with an equals sign, the value is labelled with the name of the variable, for example:

PUT REF= AGE=;

The output from this statement is:

REF=103 AGE=56

If a value is missing a full stop is printed to represent it.

You may also print text with PUT. For example, to check the validity of the data and print an error message if a mistake is found:

IF AGE LT 16 AND WORK EQ 1 THEN PUT ’UNDER AGE ’ NAME AGE= WORK=;

PUT - column style

PUT - formatted style

PUT - mixed style

Controlling output format - the FORMAT statement

To lay out the data in regular columns use column style. For example:

PUT CASE_NO 1-8 HEIGHT 11-14 2 WEIGHT 16-19 1;

prints the case number in columns 1 to 8, the height in columns 11 to 14 with a decimal point in column 12, and the weight in columns 16 to 19 with a decimal point in column 18.

You can print text by specifying how many blanks are to be left between the last field printed and the text. For example:

IF NRUNS GE 100 THEN PUT BATSMAN 1-20 +2 ’SCORED A CENTURY’;

prints the name in the first 20 columns, skip 2 columns and then prints the text.

In this style the name of each variable is followed by the name of the format in which it is to be printed. This format may be a standard SAS format or one which you have defined (see the description of PROC FORMAT in the section on page 47). For example, using the standard currency formats, the statement:

PUT DEBTS DOLLAR7.2 ASSETS DOLLAR9.2;

produces:

$130.00 $245.45

In the same way that styles can be mixed with INPUT you can mix styles in PUT statements. For example:

PUT COMPANY 5-30 +1 ’HAS LOW ASSETS ’ FUNDS

DOLLAR8.2;

The FORMAT statement associates a format with a variable for printing. The association lasts until the session ends, not just for the DATA step. If you use your own formats (declared using PROC FORMAT, see the section on page 47) they must have been declared before they are used. Examples of FORMAT statements are:

FORMAT HEIGHT 4.2;

FORMAT WEEK_PAY DOLLAR6.2;

FORMAT FILMYEAR ROMAN12.;

The last example will print FILMYEAR in Roman numerals allowing 12 spaces for the value.

Executing

instructions -

RUN statement

the

Recoding data

Example DATA steps

Using data within SAS statements and list input

Using data within SAS statements and column input

The RUN statement is used to end both DATA steps and PROC steps, and shows that the statements in the step are complete and should be executed. The format of the statement is simply:

RUN;

It is not essential to end a DATA step with RUN because the step is executed when SAS meets a DATA or PROC statement, but it is certainly tidier to use RUN especially when typing commands at the terminal.

Sometimes you may wish to group data or do some relabelling of values. This can be done by a series of IF statements, but can also be done quite conveniently with a format and the PUT function. See section page 48 for details.

DATA ONE;

INPUT REF_NUM SEX $ AGE HEIGHT WEIGHT; LABEL HEIGHT=’HEIGHT IN METRES’; LABEL WEIGHT=’WEIGHT IN KILOGRAMS’; CARDS;

101

M 31 1.88 82

102

F 26 1.6 60

103

M 24 1.9 75.5

150

M 38 1.87 76

RUN;

DATA SURVEY; INPUT CASENO 1-6 SEX $ 7 AGE 8-10 MARSTAT $ 11 INCOME 12-18 2; CARDS;

000001M026S0741200

000002F056S2568000

100247M092M0403909

RUN; PAYPERYR = INCOME/AGE;PUT CASENO PAYPERYR INCOME AGE; LABEL MARSTAT=’MARITAL STATUS’; RUN;

For an example of reading data from a file, see Reference A or B.

The SAS PROC step

The PROC step starts with a PROC statement and ends with RUN (or by meeting a DATA or PROC statement). There are many varieties of PROC statement, each one providing a different SAS facility. The following sections describe specific PROC statements, but some of the statements which can occur in any PROC step are described in this section.

Analysing data in subgroups - the BY statement

Controlling

output format -

the FORMAT

statement

Printing titles on output from

the

procedures -

TITLE statement

A procedure can produce analyses for subgroups rather than for the whole data if a BY statement is included in the PROC step and the data is sorted on the variable or list of variables specified (for details of how to sort data see the description of SORT on page 46). For example, to produce separate mean values of income for men and women use the procedure MEANS, and include the statement:

BY SEX;

within the PROC step. To produce tables for men and women in different age groups, use:

BY SEX AGE_GRP;

The FORMAT statement gives a format for printing to variables used in the PROC step. It has the same layout as the FORMAT statement used in a DATA step, but while the DATA step associates a format with a variable for the whole SAS session, its use in a PROC step associates the format with the variable only for the duration of that step. An example is:

FORMAT INCOME DOLLAR7.2;

The TITLE statement prints a title on the output from a procedure. It can appear anywhere but is most useful in a PROC step. The title can be several lines long. The first line can be numbered 1 or can be blank as you wish, but any following lines must be numbered. For example, a one line title could be either:

TITLE ’ANALYSIS OF ANTIGEN LEVELS’;

or:

TITLE1 ’ANALYSIS OF ANTIGEN LEVELS’;

If there are several lines in the title, the second and subsequent lines must be numbered. For example:

TITLE ’ATTITUDES TO OUT-PATIENT CARE’; TITLE3 ’DELAY IN RECEIVING APPOINTMENTS’;

This would give a title of three lines (the second line, TITLE2, is assumed to be blank). You can redefine TITLE3 later without changing TITLE1; a new TITLE statement suppresses only that numbered line and any lines with higher numbers.

If you are using a graphics device for output then there are many extra options for this statement.

Printing footnotes The FOOTNOTE statement prints notes at the foot of the output page. It

- the FOOTNOTE

statement

can appear anywhere but is most useful in a PROC step. Like a title, a footnote can be several lines long. The first line can be numbered 1 or can be blank as you wish, but any subsequent lines must be numbered. For example, a one-line footnote could be either:

FOOTNOTE ’1985 figures, Pounds Sterling’;

or:

FOOTNOTE1 ’1985 figures, Pounds Sterling’;

If there are several lines in the footnote the second and subsequent lines must be numbered. For example:

FOOTNOTE ’Data obtained from official sources’; FOOTNOTE3 ’Estimated 12% under-reporting’;

This would give a footnote of three lines (the second line, FOOTNOTE2, is assumed to be blank). You can redefine FOOTNOTE3 later without changing FOOTNOTE1; a new FOOTNOTE statement suppresses only that numbered line and any lines with higher numbers.

Descriptive statistics

This section describes some of the procedures in SAS for descriptive and other statistics. These are FREQ, MEANS, UNIVARIATE, CORR and TABULATE.

Frequency tables and cross- tabulations - the FREQ procedure

The FREQ procedure produces tables; one-way, two-way, three-way, etc. One-way tables are normally called frequency tables, while two-way or more are often called cross-tabulations. The PROC step for FREQ starts with the statement:

PROC FREQ options;

The options may be omitted entirely:

PROC FREQ;

The option DATA= specifies which dataset is to be used (if omitted the most recently created dataset is used). For example:

PROC FREQ DATA=LIB83.GNPFIGS;

The TABLES statement The TABLES statement specifies which variables are to be analysed and the sort of tables to be produced. It has the form:

TABLES tablerequests / options ;

If no options are specified the / is not required. For one-way tables, give the name of the variable or variables required. For multi-way tables list the variables required separated by asterisks, for example:

AGE_GRP*SEX

or

INC_GRP*MARSTAT*CITY

The TABLES statement has shorthand forms for specifying tables. For example:

HEIGHT -- EXAMS

specifies all the variables from HEIGHT to EXAMS (inclusive) in the dataset.

QN32*(QN01 QN02 QN03)

specifies the three tables QN32*QN01, QN32*QN02 and QN32*QN03.

QUALS*(LABVOTE -- SDPVOTE)

combines the two shorthand methods.

If no options are given, the content of a table is the frequency, the percentage of the total number in the table, the percentage of the number in the row, and the percentage of the number in the column (the last two are only printed for cross-tabulations). The content can be changed by specifying options in the TABLES statement. Some useful options are NOPERCENT, which suppresses printing of overall percentages; NOROW, which suppresses row percentages, and NOCOL, which suppresses column percentages.

Examples

¤

PROC FREQ; TABLES MARSTAT NUM_KIDS; TABLES HOUSING*REGION / NOROW NOCOL; TABLES INJURIES*(SHIFT MONTH); RUN;

¤

PROC FREQ; TABLES AREA HOUSING VAR01 - VAR10; RUN;

¤

PROC FREQ DATA=RENTED; TABLES AREA*HOUSING /NOPERCENT; TABLES REPAIRS*TENURE; RUN;

¤

To obtain frequencies for the number of children (NOOFCH) and a cross-tabulation of SEX and marital status (MARSTAT), type:

PROC FREQ; TABLES NOOFCH SEX*MARSTAT; RUN;

This could produce the following output:

Cumulative Cumulative NOOFCH Frequency Percent Frequency Percent ------------------------------------------------
Cumulative
Cumulative
NOOFCH
Frequency
Percent
Frequency
Percent
------------------------------------------------
------
.
1
.
.
.
0
9
37.5
9
37.5
1
2
8.3
11
45.8
2
6
25.0
17
70.8
3
5
20.8
22
91.7
4
1
4.2
23
95.8
7
1
4.2
24
100.0
TABLE OF SEX BY MARSTAT
SEX
MARSTAT
Frequency|
Percent |
Row Pct |
Col Pct |
Total
1|
2|
3|
4|
---------+--------+--------+--------+--------+
1 |
3
|
7
|
0
|
2
|
12
| 12.50 |
29.17 |
0.00 |
8.33 |
50.00
|
|
25.00 |
58.33 |
0.00 |
16.67 |
42.86 |
63.64 |
0.00 |
50.00 |
---------+--------+--------+--------+--------+
2 |
4
|
4
|
2
|
2
|
12
| 16.67 |
16.67 |
8.33 |
8.33 |
50.00
|
|
33.33 |
33.33 |
16.67 |
16.67 |
57.14 |
36.36 | 100.00 |
50.00 |
---------+--------+--------+--------+--------+
Total
7
11
2
4
24
29.17
45.83
8.33
16.67
100.00
Frequency Missing = 1

Descriptive statistics - the MEANS procedure

The procedure MEANS prints the mean, standard deviation, and maximum and minimum values of a variable. The format of the MEANS statement is:

PROC MEANS options;

The option which is most likely to be required is DATA= to specify the dataset to be used, for example:

PROC MEANS DATA=OLD_DATA;

As is usual, the most recently created dataset is used if no DATA= option is specified.

The option MAXDEC= may also be useful as it specifies how many decimal places (0 to 8) are to be printed in the results. For example:

PROC MEANS MAXDEC=4;

PROC MEANS DATA=INT_DATA MAXDEC=0;

You can also specify the statistics to be produced by MEANS.

The VAR statement

Examples

The VAR statement specifies the variables which are to be analysed, for example:

VAR AGE INCOME HEIGHT WEIGHT;

If no VAR statement is used all the numeric variables are processed.

¤ PROC MEANS; VAR HEIGHT WEIGHT; RUN;

¤ PROC MEANS DATA=BPAIN MAXDEC=4 VAR C7HGHT ILIAC CHEST; RUN;

¤ To obtain the mean values for AGE and INCOME, type:

PROC MEANS; VAR AGE INCOME; RUN;

This could produce the following output:

N Obs Variable

Mean

------------------------------------------------

------------------

N

Minimum

Maximum

Std Dev

20

AGE

17

18.00

39.00

29.00

6.67

 

INCOME

19

3900.00

9800.00

6820.50

1773.43

------------------------------------------------

------------------

There are three missing values for AGE (the complete dataset has 20 cases) and one for INCOME.

Statistics about a single variable - the UNIVARIATE procedure

The UNIVARIATE procedure can provide very detailed statistics on a variable as well as plots to illustrate the distribution of values. The statistics produced include the mean, sum, standard deviation, variance, maximum, minimum, median, mode, quartiles, percentiles, and the five highest and lowest values.

The PROC UNIVARIATE statement has the form:

PROC UNIVARIATE options;

Useful options are:

FREQ

produces a frequency table giving the frequency, percentage and cumulative percentage for each value.

NORMAL

tells SAS to test if the distribution of the variable is close to a Normal (Gaussian) distribution. It is sometimes important to know whether a distribution is very different from Normal, as several statistical techniques give misleading results on such variables.

PLOT

gives information on whether the variable is normally distributed, by drawing a Normal probability plot and a bar chart.

DATA=

specifies which dataset is to be analysed if you do not wish to use the most recently created one.

Example statements are:

PROC UNIVARIATE;

PROC UNIVARIATE DATA=ORIGDATA PLOT FREQ;

PROC UNIVARIATE FREQ NORMAL;

The VAR statement

Examples

The VAR statement specifies which variables are to be analysed. For example:

VAR INCOME;

VAR MALE_POP FEML_POP OAP_POP RATEABLE AREA;

The variables must be numeric. If the VAR statement is omitted all numeric variables in the dataset are analysed.

¤ PROC UNIVARIATE; VAR INCOME; BY SEX; RUN;

¤ PROC UNIVARIATE DATA=UK_82 PLOT; TITLE ANALYSIS OF MONTHLY FIGURES; VAR IMPORTS EXPORTS EMIGRATE; RUN;

¤ You can use UNIVARIATE to test whether the distribution of AGE is normal.

PROC UNIVARIATE NORMAL; VAR AGE; RUN;

produces the output:

UNIVARIATE Variable=AGE Moments N 20 Sum Wgts 20 Mean 29 Sum 580 Std Dev 6.672804
UNIVARIATE
Variable=AGE
Moments
N
20
Sum Wgts
20
Mean
29
Sum
580
Std Dev
6.672804
Variance
44.52632
Skewness
-1.07339
USS
846
CV
-.029524 Kurtosis
17666 CSS
23.00967 Std Mean
19.4359 Prob>|T|
105 Prob>|S|
1.492084
T:Mean=0
0.0001
Sgn Rank
Num ^= 0
0.0001
20
W:Normal .9518337 Prob<W
0.407
UNIVARIATE
Variable=AGE Quantiles(Def=5)
100% Max
39
99%
39
75% Q3
35
95%
39
50% Med
29
90%
38.5
25% Q1
24
10%
19.5
0% Min
18
5%
18.5
1%
18
Range
21
Q3-Q1
11
Mode
25
Extremes
Lowest Obs
Highest Obs
18
(8)
36
(1)
19
(9)
36
(3)
20
(10)
38
(14)
22
(11)
39
(19)
23
(12)
39
(20)

The distribution is not significantly different from Normal. The mean is significantly different to zero.

Correlations - the CORR procedure

The CORR procedure calculates the correlation between variables. It uses the product-moment (Pearson) definition of correlation, which is not appropriate for some types of variable, or Spearman’s and Kendall’s definitions which are more suitable for rankings and positions. Basic statistics like the mean are also printed for the variables used. The PROC CORR statement has the form:

PROC CORR options;

If no options are specified the most recently created dataset is used and Pearson correlations are calculated. To change the dataset used specify the DATA= option. To request a different correlation coefficient use the SPEARMAN or KENDALL option. These can be used in combination with each other and with PEARSON. Examples of PROC CORR statements are:

PROC CORR;

PROC CORR DATA=FRENCH KENDALL SPEARMAN;

PROC CORR PEARSON SPEARMAN KENDALL;

The VAR statement

The WITH statement

There are different ways of specifying the correlations you wish to calculate. If you use a VAR statement and not a WITH statement (see below) coefficients are printed for all possible pairs of variables in the list. For example:

VAR A B C;

gives the correlations between A and B, B and C and A and C.

Omitting the VAR statement is equivalent to including one with all the numeric variables in the set specified.

If the WITH statement is used it modifies the way in which the VAR statement is obeyed. The variables in the VAR statement are treated as one list and those in the WITH statement as another, and coefficients are calculated for all pairs, taking one from each list. For example:

VAR AGE HEIGHT WEIGHT; WITH LIFT1 LIFT2 LIFT3;

produces the correlation of AGE with LIFT1, LIFT2 and LIFT3; the correlation of HEIGHT with LIFT1, LIFT2 and LIFT3; and the correlation of WEIGHT with LIFT1, LIFT2 and LIFT3.

Examples

¤

PROC CORR; VAR ENGLISH MATHS PHYSICS; RUN;

¤

PROC CORR DATA=OPINION KENDALL SPEARMAN; VAR SOCGROUP; WITH THEATRE CINEMA FOOTBALL CONCERTS; RUN;

¤

A correlation of AGE and INCOME can be obtained by typing:

PROC CORR; VAR AGE INCOME; RUN;

which would produce the following result:

VARIABLE

N

MEAN

STD DEV

SUM

MINIMUM

MAXIMUM

 

AGE

22

41.0909

18.0262

904.00

19.0000

80.0000

 

INCOME

24 5799.9583 2547.5247 139199.00

1750.0000

9754.0000

 

PEARSON CORRELATION COEFFICIENTS / PROB > |R| UNDER H0:RHO=0 / NUMBER OF OBSERVATIONS

AGE

INCOME

 

AGE

1.00000 -

0.33609

 

0.0000

0.1363

 

22

21

 

INCOME

-0.33609

1.00000

 

0.1363

0.0000

 

21

24

Complex tables -

the TABULATE

procedure

The CLASS statement

The VAR statement

The TABULATE procedure produces tables and gives far more control over their layout than the FREQ procedure (see page 12). The entries in the tables can be means, standard deviations etc, rather than just counts. The options to the PROC TABULATE statement include the usual DATA=. Another important option is FORMAT, which defines how values are to be printed in the tables. For example, the statement:

PROC TABULATE FORMAT=6.3;

allows two spaces before the decimal point, one for the decimal point and three after it (making six in all). If no format is specified, it is assumed to be 12.2, that is twelve spaces for the values with nine places before the decimal point and two after it.

The CLASS statement specifies the variables which will be used to define the rows and columns of tables. For example:

CLASS SEX AGEGRP REGION;

The VAR statement specifies the variables which will be used to form the entries in the cells of the tables. For example:

VAR AGE INCOME;

The TABLE statement

The TABLE statement can be extremely complex, and only some of the possible specifications are described here. Any variable appearing in a TABLE statement must have appeared in a preceding CLASS or VAR statement.

The simplest sort of table is like those produced by PROC FREQ. For example:

TABLE SEX, REGION;

produces a two-way table showing the frequency of each combination of SEX and REGION. Note the use of a comma rather than an asterisk.

TABLE SEX RACE, REGION;

produces a frequency table of SEX by REGION with a table of RACE by REGION joined to the bottom.

TABLE REGION, SEX RACE;

produces tables of SEX and RACE side by side. By using the FORMAT= option to reduce the width of the columns you can put several cross-tabulations side by side.

To produce marginal totals use the keyword ALL. For example:

TABLE(REGION ALL), SEX RACE;

gives totals by adding each region together.

TABLE(REGION ALL), (SEX ALL RACE ALL);

gives totals for everything.

To produce percentages rather than the original counts use:

TABLE REGION, (SEX*PCTN RACE*PCTN);

A comma starts a new level of the table; an asterisk starts a nesting. The statement:

TABLE REGION, SEX*RACE;

gives a table with each row representing a region. Each row contains a count of the people of each sex, split into racial groups:

 

S1

S2

RG1

R1

R2

R3

R1

R2

R3

RG2

R1

R2

R3

R1

R2

R3

As well as arranging tables into a concise form, TABULATE can display statistics. For example:

TABLE (AGE*MEAN INCOME*MEAN), REGION;

shows the means for AGE and INCOME for each region. Other statistics which may be requested include:

STD

for standard deviation

MIN

for minimum

MAX

for maximum

SUM

for total

PCTSUM

for the percentage of the sum of values

PCTN

for percentages, as shown above

A table request like:

TABLE (INCOME*MAX), AGEGRP, SEX;

includes the highest income for all combinations of AGEGRP and SEX

in the output.

Examples

¤ PROC TABULATE; CLASS REGION SEX MARSTAT; TABLE (SEX ALL MARSTAT ALL),REGION; RUN;

¤ PROC TABULATE FORMAT=6.2; CLASS REGION SEX MARSTAT; TABLE (SEX*PCTN MARSTAT*PCTN),REGION; RUN;

¤ PROC TABULATE; CLASS REGION; VAR AGE,INCOME; TABLE (AGE*MEAN INCOME*MEAN), REGION; RUN;

¤ PROC TABULATE; CLASS SEX MARSTAT; VAR AGE; TABLE (AGE*MEAN), MARSTAT, SEX; RUN;

This last example produces the following output:

MEAN OF AGE +-----------------+-------------------------+ | | SEX | | | |-------------------------| | 1 | 2
MEAN OF AGE
+-----------------+-------------------------+
|
|
SEX
|
|
|
|-------------------------|
|
1
|
2
|
|-----------------+------------+------------|
|MARSTAT
|
|
|
|-----------------|
|
|
|1
|
24.33|
26.33|
|-----------------+------------+------------|
|2
|
37.17|
41.00|
|-----------------+------------+------------|
|3
|
.|
42.00|
|-----------------+------------+------------|
|4
|
59.50|
75.00|
+-------------------------------------------+

Students T-Test

The TTEST procedure tests whether two groups have the same mean value for a particular variable. The t-test was devised by an author who wrote under the pseudonym Student. Note that the other use for Student’s t-test, the comparison of the means of two variables (known as a paired t-test), must be done in a different way (see page 23 for details). The PROC TEST statement has the form:

PROC TTEST options;

As usual the option DATA= specifies a dataset other than the one most recently created.

The CLASS statement

The VAR statement

Example

The CLASS statement specifies the variable identifying the groups to be compared. Since the procedure can only deal with two groups, the variable must have only two values. You must specify a CLASS statement.

The VAR statement specifies the variables on which the test is to be carried out. If you specify more than one variable a t-test is performed on each. If this statement is omitted a t-test is performed on all the numeric variables in the dataset except the one specified in the CLASS statement.

Suppose you are comparing the crop obtained from tomato plants, some of which have been treated with a fertiliser. The yield of tomatoes is in a variable CROP; the variable FERTIL contains 1 if no fertiliser was used and 2 if it was. To perform a t-test on the two groups:

DATA TOMS; INPUT CROP FERTIL; CARDS;

12.3

1

11.6

1

14.5

2

RUN; PROC TTEST; CLASS FERTIL; VAR CROP; RUN;

SAS gives the mean and other information for each group as well as the t value, the degrees of freedom, the significance assuming unequal variances, and the significance assuming equal variances. In each case the test is a two-sided one. Following the table in which these values appear is the result of an F test on the equality of the variances. The output from the above statements is:

TTEST PROCEDURE

VARIABLE: CROP

FERTIL

N

MEAN

STD DEV

STD ERROR

MINIMUM

MAXIMUM

1 6 12.35000000 0.63482281 0.25916533

10.11

11.22

14.20

2 5 14.36000000 0.53665631 0.24000000

15.31

VARIANCES

T

DF PROB>|T|

UNEQUAL -5.6905

9.0 0.0003

EQUAL

-5.5957

9.0 0.0003

FOR H0: VARIANCES ARE EQUAL, F’= 1.40 DF=(5,4) PROB > F’= 0.7669

Five plants were treated with fertiliser out of the eleven used. The means are significantly different whether equal or unequal variances are assumed. The F test shows that the variances are not significantly different.

Paired T-Test

The TTEST procedure cannot test for two variables having the same mean (a paired t-test). However, this test can be done using the MEANS procedure, which can test if the mean of a variable is zero. One variable is subtracted from the other and the result tested to see if it is zero. If it is the two variables do not have significantly different means. The following statements illustrate the procedure:

DATA TS; INPUT TEST1 TEST2;

DIFF=TEST2 - TEST1; CARDS;

34

45

36

44

57

62

RUN; PROC MEANS MEAN T PRT; VAR DIFF; RUN;

The options MEAN, T and PRT print the mean, the t-test value for the test of the mean being zero, and the corresponding probability. The output is:

Analysis Variable : DIFF

N Obs

MEAN

T

PROB>|T|

------------------------------------------------

-----

20

6.85714286

8.45

0.0001

------------------------------------------------

-----

This shows that the variables have means which are significantly different.

Drawing simple diagrams with SAS

SAS can draw pictures on the screen which can then be printed on a printer using the CHART and PLOT procedures. The procedures GCHART, GPLOT, GMAP, etc produce higher-quality pictures but require special facilities in order to produce a copy on paper. They are described on page 28.

Bar charts - the CHART procedure

The VBAR statement

The CHART procedure draws vertical or horizontal bar charts (histograms), pie diagrams, block charts and star charts. They all give a visual appreciation of your data which may help you understand it better. Only the method of producing histograms and pie charts is described here. The PROC CHART statement has the form:

PROC CHART options;

The options include DATA= to specify the dataset to be used if you do not wish to use the most recently created dataset. Example statements are:

PROC CHART;

PROC CHART DATA=OLDSTATS;

Any number of charts may be requested within the same CHART procedure (see the examples on page 25).

To produce a vertical bar chart (a histogram with the bars drawn vertically) specify the variables to be used with a VBAR statement, which has the form:

VBAR variablelist / options;

If no options are specified the / is omitted. The options available include DISCRETE, MIDPOINTS, SUMVAR, TYPE, MISSING and NOZEROS.

DISCRETE

draws a bar for each value of the variable. If you do not use this option the range of values is divided into groups by automatic choice of midpoints or by your own choice (see MIDPOINTS below) and a bar is drawn for each sub-range.

MIDPOINTS=values

specifies the points at which the distribution is to be split. For example MIDPOINTS=10 50 100 produces three bars; one for those below 30, one for those below 75 and one for those over 75 (these being the boundaries produced by these midpoints). If the MIDPOINTS option is not specified, SAS splits the range of values into a number of intervals.

SUMVAR=variable

means that the bars represent the sum of the variable ‘variable’ for cases with that value of the VBAR variable. For example, VBAR CITY/ SUMVAR=INCOME; gives a chart with bars representing the total income for each city in the data. If TYPE=MEAN is also specified the mean value of ‘variable’ is used instead of the sum.

TYPE=type

specifies what the bars are to represent and has several different choices. The one of most interest is MEAN when it is used with SUMVAR.

NOZEROS

omits entries for empty categories, avoiding gaps in the chart.

MISSING

treats missing values as a valid category and draws a bar for them.

Examples of VBAR statements are:

VBAR MARSTAT/MISSING NOZEROS;

VBAR INCOME/MIDPOINTS=2500 5000 7500 10000 15000 20000;

VBAR GNP EXPORTS IMPORTS;

The HBAR statement

The PIE statement

Examples

The HBAR statement is just like the VBAR statement except that it produces histograms with the bars horizontal rather than vertical. The options DISCRETE, MISSING, MIDPOINTS, SUMVAR, TYPE and NOZEROS all apply as described above. Examples are:

HBAR HEIGHT WEIGHT;

HBAR SEX/SUMVAR=ACCIDENT;

The PIE statement draws a pie chart which illustrates the relative frequency of values by presenting them as slices of a cake or pie. The statement format is:

PIE variablelist / options;

If no options are requested the / is omitted. The options DISCRETE, MIDPOINTS, SUMVAR, TYPE and MISSING all apply (as described on page 24) but NOZEROS does not. An example is:

PIE DEPT AREA REGION;

If neither DISCRETE nor MIDPOINTS are specified the pie has three slices.

¤ PROC CHART; VBAR NOOFSONS / MIDPOINTS= 1 2 3 4; PIE MARSTAT / MISSING DISCRETE; RUN;

¤ PROC CHART DATA=SALES; HBAR MONTH/SUMVAR=VALUE DISCRETE; HBAR AGENCY REGION; RUN;

¤ The following statements produce a horizontal barchart for SEX and a vertical bar chart for MARSTAT and AGE:

PROC CHART; HBAR SEX/DISCRETE; VBAR MARSTAT/DISCRETE; RUN;

The output is:

FREQUENCY BAR CHART

SEX FREQ CUM. PERCENT

CUM.

FREQ

 

PERCENT

 

1

|

10

10

50.00

50.00

 

2

|

10

20

50.00 100.00

|****************************************

|****************************************

----------+---------+---------+---------+

5

2.5

7.5

10

FREQUENCY

FREQUENCY BAR CHART FREQUENCY 10 + ***** | ***** 8 + ***** | ***** 6
FREQUENCY BAR CHART
FREQUENCY
10 +
*****
|
*****
8 +
*****
|
*****
6 +
*****
|
*****
*****
4 +
*****
*****
|
*****
*****
*****
2 +
*****
*****
*****
|
*****
*****
*****
+---------------------------------
---
1
2
3
MARSTAT

Scattergrams - the The PLOT procedure produces a plot of one variable against another.

PLOT procedure

The PLOT statement

Such diagrams are known as scatterplots, scattergrams or scatter diagrams, as they show the scatter of the cases in the sample. The PROC PLOT statement has the form:

PROC PLOT options;

The most important option is DATA= which specifies a dataset other than the one most recently created.

The PLOT statement specifies which variables are to be plotted against each other. Its format is:

PLOT plotrequests / options;

If no options are specified the / is omitted.

A plot request can have several parts. The simplest form is ‘var*var’, for

example AGE*EXAM meaning a plot with AGE on the vertical axis and EXAM on the horizontal axis.

A point is marked by a letter, which shows how many cases lie on that

point (to within the accuracy of the plot and given the size of the character indicating the point). A indicates one case, B means two cases, up to Z which indicates 26 or more cases at that point.

To specify a symbol to mark the points instead of letters, use var*var=’symbol’. For example:

Y*X=’.’

causes each point to be marked by a full stop regardless of how many cases are represented there.

You can produce a sort of three-dimensional plot by marking each point with the values of another variable, by specifying var*var=var. For

example:

CONTENT*INCOME=MARSTAT

prints the value of MARSTAT (which may be numeric or character) at each point on the plot of CONTENT by INCOME. If more than one case is mapped to the same point the value of the first case is used. Note

that only the first character is used from the value of the variable, and so

if the values of MARSTAT were SINGLE, MARRIED and

SEPARATED, you could not distinguish between SINGLE and SEPARATED.

Several plots can be specified in the same PLOT statement, for example:

PLOT AGE*INCOME HEIGHT*WEIGHT=SEX;

A useful option is OVERLAY, which causes several plots to be

produced on the same axes so that they can be compared, for example:

PLOT POP71*CITY=’7’ POP81*CITY=’8’ / OVERLAY;

Examples

¤

PROC PLOT; PLOT MATHS*ENGLISH; RUN;

¤

PROC PLOT DATA=SUNSPOTS; PLOT NUM*MONTH=’*’; PLOT NUM*MONTH=’+’ UHFPROBS*MONTH=’-’ /OVERLAY; RUN;

¤

PROC PLOT; PLOT INCOME*AGE=SEX; RUN;

gives the following plot of AGE and INCOME using a symbol for SEX.

Plot of INCOME*AGE Symbol is value of SEX NOTE: 1 obs hidden INCOME | 10000
Plot of INCOME*AGE Symbol is value of SEX
NOTE: 1 obs hidden
INCOME |
10000
+
1
|
2
1
1
2
|
1
2
|
2
1
1
|
1
2
1
2
5000
+
1
2
1
|
|
2
2
|
|
0 +--+--+--+--+--+--+--+--+--+--+--+--+--+-
-+--+--+--+--+--+--+--+--+
18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39
AGE

The message ‘1

(to within the accuracy of the graph). Normally this would be indicated by using a different symbol, but here the symbol is the value of SEX.

obs hidden’ means that two points coincided

High-quality graphics

As well as the simple graphics produced by CHART and PLOT, SAS can draw high-quality graphics with smooth lines, colours and shading. These facilities are documented in the ‘SAS/GRAPH Guide’. See Reference A or B for producing graphics on output devices. This section describes some of the procedures.

Bar charts - the GCHART procedure

The GCHART procedure draws the same sort of pictures as CHART but on graphics devices. Many more options are available because of the extra facilities on a graphics device, and so there are more parts to the PROC step. The PROC GCHART statement has the form:

PROC GCHART DATA=dataset GOUT=dataset;

where the most recently created dataset will be used if DATA= is omitted. The GOUT= option is also optional, and is used to save the graphical output as a dataset which can be redrawn later by the GREPLAY procedure.

The VBAR statement

The VBAR statement specifies the variables for which a vertical bar chart is to be drawn. The general form is:

VBAR variablelist / options;

If no options are requested the / is omitted. The options include DISCRETE, MIDPOINTS, SUMVAR, TYPE, MISSING, NOZEROS, CAXIS and CTEXT.

DISCRETE

draws a bar for each value of the variable. If you do not use this option the range of values is divided into groups by automatic choice of midpoints or by your own choice (see MIDPOINTS below) and a bar is drawn for each sub-range.

MIDPOINTS=values

specifies the points at which the distribution is to be split. For example MIDPOINTS=10 50 100 produces three bars; one for those below 30, one for those below 75, and one for those over 75 (these being the boundaries produced by these midpoints). If the MIDPOINTS option is not specified, SAS splits the range of values into a number of intervals.

SUBGROUP=svar

divides each bar into sections to show the distribution of ‘svar’ within each category of the variable named on the VBAR statement.

GROUP=gvar

draws several bars for each value of the variable mentioned in the VBAR statement, - one for each value of ‘gvar’.

SUMVAR=variable

means that the bars represent the sum of the variable ‘variable’ for cases with that value of the VBAR variable. For example, VBAR CITY/ SUMVAR=INCOME; gives a chart with bars representing the total income for each city in the data. If TYPE=MEAN is also specified the mean value of ‘variable’ is used instead of the sum.

TYPE=type

specifies what the bars are to represent and has several different choices. The one of most interest is MEAN when it is used with SUMVAR.

NOZEROS

omits entries for empty categories, avoiding gaps in the chart.

MISSING

treats missing values as a valid category and draws a bar for them.

CAXIS=colour draws the axis in the specified colour.

CTEXT=colour

draws the text on the chart in the specified colour.

Examples of VBAR statements are:

VBAR MARSTAT/MISSING NOZEROS SUBGROUP=SEX;

VBAR INCOME/MIDPOINTS=2500 5000 7500 10000 15000 GROUP=AGE;

The HBAR statement

The HBAR statement is just like the VBAR statement except that it produces histograms with the bars horizontal rather than vertical. The options DISCRETE, MISSING, MIDPOINTS, SUMVAR, TYPE, GROUP, SUBGROUP, NOZEROS, CTEXT and CAXIS all apply as described above. Examples are:

HBAR HEIGHT WEIGHT GROUP=MARSTAT;

HBAR SEX/SUMVAR=ACCIDENT SUBGROUP=AGE;

The PATTERN

With HBAR and VBAR you can specify how the bars are to be coloured

statement

or shaded using PATTERN statements. If SUBGROUP= is specified in a HBAR or VBAR statement you may wish to specify the patterns of shading to be used to distinguish the subgroups. Normally each bar is shaded by cross-hatching, and if it is necessary to distinguish groups each is coloured using the colours in order.

Suppose you have two groups and want the first to be a solid blue bar and the second to be black (the background colour) specify:

PATTERN1 V=S C=BLUE; PATTERN2 V=E;

where V=S means use solid colour, V=E means empty, and C= specifies the colour to be used. Other types of pattern which involve hatching of various kinds are described and illustrated in the appropriate SAS/GRAPH Guide.

The PIE statement

The PIE statement draws a pie chart which illustrates the relative frequency of values by presenting them as slices of a cake or pie. The statement format is:

PIE variablelist / options;

If no options are requested the / is omitted. The options DISCRETE, MIDPOINTS, SUMVAR, TYPE, MISSING and CTEXT all apply (as described on page 30) but SUBGROUP, GROUP, NOZEROS and CAXIS do not. If you do not specify DISCRETE or MIDPOINTS then SAS uses its own method to divide the data.

PIE DEPT AREA REGION;

The FILL option is used only with the PIE statement. It may be set to X or SOLID:

FILL=SOLID

means the sectors of the pie-chart are to be filled in with solid colour. SAS calculates how many colours are needed and takes them in order from the colours available on the graphics device. If only one colour is available a uniformly shaded disk is drawn.

FILL=X

means the sectors are to be filled in by cross-hatching. If colours are available the slices will be coloured.

Examples

¤ PROC GCHART; VBAR NOOFSONS / MIDPOINTS= 1 2 3 4; PIE MARSTAT / MISSING DISCRETE FILL=X; RUN;

¤ PROC GCHART DATA=SALES; HBAR MONTH/SUMVAR=VALUE DISCRETE; HBAR AGENCY REGION; RUN;

Scattergrams - the The GPLOT procedure plots scatter diagrams with a choice of patterns,

GPLOT

procedure

The PLOT statement

filling, and plot symbols, and the option of fitted regression lines of various types. Only some of the facilities are described here. The general form of the GPLOT statement is:

PROC GPLOT DATA=dataset GOUT=dataset UNIFORM;

If DATA= is omitted the most recently created dataset is used. If

GOUT= is specified then the picture is saved and may be redrawn by GREPLAY. The UNIFORM option may be useful if you are using the BY statement to plot pictures for several subgroups, because it forces the use of the same scale for all the plots, so that comparisons may be made.

The PLOT statement specifies which variables are to be plotted against each other. Its format is:

PLOT plotrequests / options;

If no options are specified the / is omitted.

A plot request can have several parts. The simplest form is just var*var,

for example AGE*EXAM, meaning a plot with AGE on the vertical axis and EXAM on the horizontal axis.

A point is marked by a plus sign. If several cases are plotted at the same

point (not necessarily identical but the same to within the accuracy of the plot) it appears that cases are missing because there are fewer plus signs on the plot than there are cases.

A

SYMBOL statement (see below) can be used to specify another symbol

to

mark the points instead of a plus sign.

You can produce a sort of three-dimensional plot by marking each point with the values of another variable by specifying var*var=var. For

example,:

CONTENT*INCOME=MARSTAT

prints the value of MARSTAT (which may be numeric or character) at each point on the plot of CONTENT by INCOME. If more than one case is mapped to the same point the value of the first case is used. Note

that only the first character is used from the value of the variable, and so

if the values of MARSTAT were SINGLE, MARRIED and

SEPARATED, you could not distinguish between SINGLE and SEPARATED.

Several plots can be specified in the same PLOT statement, for example:

PLOT AGE*INCOME HEIGHT*WEIGHT=SEX;

A useful option is OVERLAY, which causes several plots to be

produced on the same axes so that they can be compared, for example:

PLOT POP71*CITY=’7’ POP81*CITY=’8’ / OVERLAY;

You may also specify CAXIS=colour to specify the colour of the axis, and CTEXT=colour to specify the colour of the text.

In order to overlay plots where the vertical scales are very different you

can use PLOT and PLOT2. The horizontal scales must be the same but the right-hand vertical scale is for the variable specified for PLOT2. For example:

PLOT HEIGHT*AGE; PLOT2 WEIGHT*AGE;

The SYMBOL

The SYMBOL statement defines the symbols to be used in the plot and

statement

specifies whether any regression fitting is to be carried out. A different SYMBOL statement can be included for each plot in the GPLOT procedure. The SYMBOL statement has several optional parts, as described below. To specify the colour of the symbol use C=colour, for example C=BLUE.

To

specify the symbol use V=symbol, for example V=1. The letters A to

W

and the digits 0 to 9 may be used as symbols. There are also special

symbols which are represented by such characters as * or <. A table

showing how they appear is given in the SAS/GRAPH Guide. If no symbols are to be drawn for the points use V=NONE.

To specify the sort of line to be used use the ‘L=number’ option. If L=1

a solid line is used. There are also 31 types of dotted or dashed lines, which are shown in the SAS/GRAPH Guide.

The interpolation facilities to draw lines connecting the points include the following:

I=JOIN

connects the points by straight lines.

I=SPLINE

uses a cubic spline method to fit a smooth line to the points.

I=SMxx

is used when the data is widely spread, so that a normal cubic spline would look very jagged. It fits a smooth curve through the points but the points may not all appear on the line (as is the case with a normal spline curve). The value ‘xx’ determines how closely the curve is to be fitted to the points. A value of 1 makes it follow the points quite closely, while a value of 99 produces a smooth curve which may miss many of the points.

I=Rxxxxxxx

is used when a regression line is to be fitted to the data. The characters which follow the R are:

L

linear regression is to be used

Q

quadratic regression is to be used

C

cubic regression is to be used

0

the regression line is forced through the origin. If you want a constant term in the regression omit this term. If it appears it is the second character.

CLMnn

draws lines representing the confidence limits on the regression for the mean predicted values where the confidence limits (nn) may be at 90%, 95% or 99%, for example CLM95. These characters follow the type of regression and the optional constant term. The style of line used for the confidence lines is determined by adding 1 to the line style used for the main plot. For example if the line is drawn with line style 2 (small dashes) the confidence lines are drawn with style 3 (medium dashes).

CLInn

draws lines representing the confidence limits on the regression for the individual values, where the confidence limits (nn) may be at 90%, 95% or 99%, for example CLI90. Other details are as for CLMnn above.

Examples of complete specifications for interpolation are I=RQ, I=RL0, I=RLCLI90 and I=RC0CLM95.

Examples

Maps - the GMAP procedure

¤ PROC GPLOT; PLOT TESTA*TESTB; RUN;

¤ PROC GPLOT DATA=NWEST; PLOT POLLUT1*TOWN INCID*TOWN/OVERLAY; SYMBOL1 V=NONE L=2 C=RED I=SPLINE; SYMBOL2 V=NONE L=2 C=BLUE I=SPLINE; RUN;

¤ PROC GPLOT; PLOT DOSE*DAYS; SYMBOL V=* L=1 I=RLCLM95; RUN;

The GMAP procedure produces maps illustrating the values of variables for the areas on the map. If the map dataset already exists you need to know how the areas are identified. SAS provide maps of the United States and Canada, which are described in the SAS/GRAPH Guide. However, these are not installed automatically so you must check to see if they are installed on the machine you are using. Maps of the counties of the United Kingdom and Ireland are also available. If you do not already have a map in a suitable form then you may need help in converting your map into a dataset. Contact the Computing Service Advisory Service for assistance.

The PROC GMAP statement specifies the map dataset as well as the response dataset containing the data to be shown on the map, for example:

PROC GMAP MAP=SASUSER.MERSEY DATA=POPLAN;

If no DATA= option appears the most recently created dataset is used.

The ALL option specifies that all areas in the map are to be drawn even if there is no value for that area. Normally only areas for which data exists are drawn, and the map is scaled to fill the space available.

Four types of map can be drawn; choropleth, surface, block and prism. Examples of each are given in the SAS/GRAPH Guide. Only choropleth maps are described below.

The CHORO

The CHORO statement specifies that a choropleth map is to be drawn,

statement

and gives the name of the variable to be used. A choropleth map is one

where the areas of the map are shaded or coloured to indicate the value

of the variable for each area. The form of the CHORO statement is:

CHORO variablelist;

A choropleth map is drawn for each variable specified. Options

available with the statement include DISCRETE, LEVELS and

MIDPOINTS.

DISCRETE

means the data is a set of discrete values rather than a continuous variable. Each value is represented separately, unless you have a very large number of values (or have also specified LEVELS or MIDPOINTS).

LEVELS=n

means SAS is to divide the data into n+1 groups of the same size, and shade the map accordingly.

MIDPOINTS=list

means the data is to be divided at the values specified. You do not have to list every value, for example:

The ID statement

MIDPOINTS = 10 TO 100 BY 10

The ID statement specifies the variable which ties together the areas and the values of the response variable. The variable must have the same name in the response dataset as in the map dataset. The form is:

ID variable;

For example:

ID CNUM;

Regression - the REG procedure

Given a situation in which one or more variables seem to control the behaviour of another (for example, blood pressure given weight, age and

activity), it is possible to build an equation which expresses the relation numerically. This relation is only approximate in any real situation, but you can measure how closely it fits the data and decide whether or not it

is useful for prediction.

The variable whose behaviour you are trying to explain is called the dependent variable, and those variables being used for the explanation (or ‘model’) are called the independent variables. In mathematical terms there is a dependent variable Y which you are trying to predict

using values of independent variables X1, X2, equation:

Xn, using the

Y = B

0

+ B *X

1

1

+ B *X

2

2

+

+ B *X

n

n

+ eps

The MODEL

statement

where ‘eps’ is the error involved in using such a simple model. The

procedure calculates the values of B0, B1, B2

small as possible over the known values of Y, X1, X2

The values B0, B1 etc are called the parameters of the regression

equation. The term B0 is known as the constant term or the intercept

and is the value of Y when X1, X2

The values of Y which were recorded when the survey was done or the experiment was performed are known as the observed values. The

values you would obtain by putting the values of X1, X2

regression equation are called the predicted values. The difference between the predicted value and the observed value is called the residual value. The square of the correlation of the observed values and the predicted values is called the coefficient of determination or just the r- square value. It can be regarded as the fraction of the variability of Y

explained by the equation.

The PROC REG statement has the form:

PROC REG options;

The DATA= option specifies that a dataset other than the one most recently created is to be used. The SIMPLE option gives simple descriptive statistics on each of the variables used in the procedure. Example of PROC REG statements are:

PROC REG;

PROC REG DATA=SHELLS;

PROC REG SIMPLE;

PROC REG DATA=SASUSER.SAVED SIMPLE;

Bn, so that ‘eps’ is as

Xn.

Xn are all zero.

Xn into the

The variables to be used are specified with the MODEL statement. For example, to express the cost of producing a motor car (variable CARCOST), given the hourly wage rate of workers on the production line (HOURLY), the cost of steel (STEEL) and the price of electricity (ELECTRIC):

MODEL CARCOST = HOURLY STEEL ELECTRIC;

CARCOST is the dependent variable and HOURLY, STEEL and ELECTRIC are the independent variables.

The MODEL statement has several options. For example:

MODEL CARCOST = HOURLY STEEL ELECTRIC / NOINT;

forces the equation to have no constant term, that is the intercept is set to zero and CARCOST is zero when all the other variables are zero, which is not entirely realistic for the model as there are other costs involved in building a car. However if all the sources of expense were included in the model you would expect the cost of the car to be zero when all the factors contributing to the cost were zero.

To check how well the solution fits, you can print the values of the dependent variable along with the value the regression equation predicts, by specifying the P option, for example:

MODEL CARCOST = HOURLY STEEL ELECTRIC / P;

The R option prints extra information indicating whether the predicted values are significantly different from the observed values. This can be useful for spotting unusual cases in the data or for showing a pattern in the residuals indicating that the model has systematic errors, for example that a linear model is not appropriate and one with squares or cubes of values should be used instead.

The OUTPUT statement

Example

To analyse the predicted or residual values you must write them to a SAS dataset. Having done that you can use any of the facilities of SAS, especially the graphical ones, to examine them. The OUTPUT statement allows you to write these and other values to a dataset.

The OUTPUT statement must specify the name of a dataset. This can be a permanent dataset (for example SASUSER.PREDVALS) or a temporary one (for example PREDVALS). The information to be written follows the dataset name. The keywords PREDICTED and RESIDUAL (which can be abbreviated to P and R) specify that these values are to be written and gives them names. For example:

PROC REG; MODEL Y = X Z/NOINT; OUTPUT OUT=SAVED P=PY R=RY;

The output dataset contains all the variables from the input dataset (whether or not they were used used to calculate the regression equation) as well as the ones specified by P or R. If the regression has multiple dependent variables you must specify predicted and residual variable names for each dependent variable.

The following statements look at the relation of age to income, by saving the predicted values and plotting them to compare with the observed value:

PROC REG; MODEL INCOME=AGE; OUTPUT OUT=SAVE P=PINCOME; RUN; PROC PLOT; PLOT(INCOME PINCOME)*AGE/OVERLAY; RUN;

The output is:

DEP VARIABLE: INCOME

ANALYSIS OF VARIANCE

 

SUM OF

MEAN

SOURCE

DF

SQUARES

SQUARE

F VALUE

PROB>F

MODEL

1

15069036.69

15069036.69

2.419

0.1363

ERROR

19

118335771

6228198.45

C TOTAL

20

133404807

ROOT MSE

2495.636

R-SQUARE

0.1130

DEP MEAN

6034.476

ADJ R-SQ

0.0663

C.V.

41.3563

PARAMETER ESTIMATES

 

PARAMETER

STANDARD

T

FOR H0:

VARIABLE DF

ESTIMATE

ERROR

PARAMETER=0

PROB > |T|

INTERCEP

1

7958.69044

1351.63099

5.888

0.0001

AGE

1

-47.42781593

30.49099533

-1.555

0.1363

The probability that the coefficient for AGE is zero shows that the variable is not a very good predictor of INCOME. The plot also shows that the fit is a very poor one:

PLOT OF INCOME*AGE SYMBOL USED IS O PLOT OF PINCOME*AGE SYMBOL USED IS P |
PLOT OF INCOME*AGE
SYMBOL
USED IS O
PLOT OF PINCOME*AGE
SYMBOL
USED IS P
|
10000 +
|
O
|
|
O
|
9000
+
|
O
|
O
|
|
O
8000
+
O
|
|
O
O
|
O
|
O
P 7000 +
P PP
R |
O
E
|
P
P
D
|
P
PP
I
|
P
PP
C 6000 +
P
T |
O
E
|
O
PP
D
|
PP
|
O
V 5000 +
O
A
|
L
|
PP
4000 +
|
|
|
O
|
3000
+
O
|
|
O
O
|
|
2000
+
O
|
O
|
|
|
1000
+
----+---+---+---+---+---+---+---+---+-
--+---+---+---+---+---+-
19
23
27
31
35
39
43
47
51
55
59
63
67
71
75
AGE
NOTE:
7 OBS HAD MISSING VALUES
3 OBS
HIDDEN

P U

E

|
|

P U E | |

Analysis of variance - the ANOVA procedure

The ANOVA procedure is restricted to analysing balanced designs, that is those experiments where there are the same number of replicate observations for each combination of factors. If your data does not satisfy this condition see page 44 for details of a general linear model. ANOVA can deal with one or many response variables and so can do multivariate analysis of variance.

As usual you may specify the dataset to be used with the DATA= option, for example:

PROC ANOVA DATA=GRASSES;

The CLASS

statement

The MODEL

statement

The factors in the design are declared using the CLASS statement, for example:

CLASS STRAIN HERBCIDE;

You must have a CLASS statement and it must precede the MODEL statement described below.

The MODEL statement specifies the dependent variable (sometimes called the response variable) and how it is thought to be related to the independent variables (the factors). You can specify several dependent variables, in which case SAS treats them together in a multivariate analysis. The specification of the model is more complex than with the REG procedure as you can include interaction effects between variables as well as the variables themselves.

Suppose you have a dependent variable Y with factors A, B and C. To fit only the factors with no interactions type:

MODEL Y = A B C;

To allow an interaction term between B and C, use:

MODEL Y = A B C B*C;

To include all possible interactions, type:

MODEL Y = A B C B*C A*B A*C A*B*C;

Since this is such a common model, SAS allows you to write this in the shorter form:

MODEL Y = A|B|C;

You can specify a mixture of these, for example

MODEL Y = A B|C|D;

where only the main effect of A is used but the full interactions of B, C and D are required.

If a factor B is nested within another factor A, type:

MODEL Y = A B(A);

This occurs when not all values of B are observed for each value of A, and so you do not have a ‘crossed’ model. For example, if you were comparing teaching methods in different schools then the teachers would only teach in one school, and so any teacher effect would be nested within the school effect.

The MEANS

statement

Example

Having established that not all groups have the same mean, you might like to know which groups are different from other groups. This can be done using the MEANS statement. Suppose the MODEL is:

MODEL CROP = VARIETY FIELD VARIETY*FIELD;

To look at the effect of VARIETY in more detail type:

MEANS VARIETY;

The mean and standard deviation for each value of VARIETY is shown. You can also specify various tests to investigate whether these means are significantly different. These include the Scheffe test, Duncan’s test, Tukey’s test, and Least Significant Difference (LSD). To request a Scheffe multiple comparison test on VARIETY, type:

MEANS VARIETY/SCHEFFE;

This will then show in detail how the group means differ.

Suppose a biochemist is interested in the effect of a new herbicide on the mortality of plants. Fifty plants were placed in each of twelve pots containing nutrient solution. After ten days growth, three of the pots were sprayed with herbicide and three were left as controls. After a further ten days three more pots were sprayed and the remaining three designated as controls. Thus two factors were considered - herbicide treatment and age of plant - each treatment combination being replicated three times.

The analysis produces an analysis of variance table assessing the significance of the herbicide spraying, the age of plants and their interaction. The experimenter was also interested in calculating least significant differences for comparison of main effect means. The following statements enter the data and carry out the analysis:

DATA HERB; /* DATA INPUT SPECIFYING EACH FACTOR LEVEL EXPLICITLY*/ INPUT AGE HERBICID SURVIVOR ; CARDS;

1

1

1

1

1

1

2

2

2

2

2

2

PROC ANOVA; CLASS AGE HERBICID; MODEL SURVIVOR = AGE HERBICID AGE*HERBICID; MEANS AGE HERBICID AGE*HERBICID / LSD; RUN;

This produces the following output:

1 20

1 18

1 23

2 11

2 12

2 15

1 40

1 43

1 39

2 35

2 37

2 32

ANALYSIS OF VARIANCE PROCEDURE CLASS LEVEL INFORMATION CLASS LEVELS VALUES AGE 2 1 2 HERBICID
ANALYSIS OF VARIANCE
PROCEDURE
CLASS LEVEL
INFORMATION
CLASS
LEVELS
VALUES
AGE
2
1
2
HERBICID
2
1
2
NUMBER OF OBSERVATIONS IN DATA
SET = 12
ANALYSIS OF VARIANCE
PROCEDURE
DEPENDENT VARIABLE: SURVIVOR
SOURCE
DF
SUM OF SQUARES
MEAN
SQUARE
F VALUE
MODEL
3
1486.25000000
495.41666667
92.89
ERROR
8
42.66666667
5.33333333
PR > F
CORRECTED TOTAL
11
1528.91666667
0.0001
R-SQUARE
SURVIVOR MEAN
C.V.
ROOT MSE
0.972094
8.5270
2.30940108
27.08333333
SOURCE
DF
ANOVA SS
F
VALUE
PR > F
AGE
1
1344.08333333
252.02
0.0001
HERBICID
1
140.08333333
26.27
0.0009
AGE*HERBICID
1
2.08333333
0.39
0.5494

ANALYSIS OF VARIANCE

PROCEDURE

T TESTS (LSD) FOR VARIABLE: SURVIVOR

NOTE: THIS TEST CONTROLS THE TYPE I COMPARISONWISE ERROR RATE,

NOT THE EXPERIMENTWISE ERROR RATE

ALPHA=0.05 DF=8 MSE=5.33333

CRITICAL VALUE OF T=2.30600

LEAST SIGNIFICANT DIFFERENCE=3.0747

MEANS WITH THE SAME LETTER ARE NOT SIGNIFICANTLY DIFFERENT.

 

T

GROUPING

MEAN

N

AGE

 

A

37.667

6

2

 

B

16.500

6

1

ANALYSIS OF VARIANCE

PROCEDURE

T TESTS (LSD) FOR VARIABLE: SURVIVOR

NOTE: THIS TEST CONTROLS THE TYPE I COMPARISONWISE ERROR RATE,

NOT THE EXPERIMENTWISE ERROR RATE

ALPHA=0.05 DF=8 MSE=5.33333

CRITICAL VALUE OF T=2.30600

LEAST SIGNIFICANT DIFFERENCE=3.0747

MEANS WITH THE SAME LETTER ARE NOT SIGNIFICANTLY DIFFERENT.

T

GROUPING

 

MEAN

N

HERBICID

 

A

30.500

6

1

 

B

23.667

6

2

 

ANALYSIS OF VARIANCE PROCEDURE

 

MEANS

 

AGE

HERBICID

N

SURVIVOR

 

1

1

3

20.3333333

 

1

2

3

12.6666667

 

2

1

3

40.6666667

 

2

2

3

34.6666667

An alternative way of inputting the data using DO loops is shown below.

It saves specifying factor levels individually. The @@ symbol is used to

specify that several cases occur on one line. Note that each case requires

an explicit OUTPUT statement.

DATA HERB; /* DATA INPUT SPECIFYING EACH FACTORS AND LEVELS THROUGH LOOPS; */ DO AGE = 1 TO 2; DO HERBICID = 1 TO 2; DO REPLICAT = 1 TO 3; INPUT SURVIVOR @@; OUTPUT; END; END; END; CARDS;

20

18

23

11

12

15

40

43

39

35

37

32

RUN; PROC ANOVA; CLASS AGE HERBICID; MODEL SURVIVOR = AGE HERBICID AGE*HERBICID ; MEANS AGE HERBICID AGE*HERBICID / LSD; RUN;

General linear modelling - the GLM procedure

The GLM procedure allows analysis of the general linear model. This views analysis of variance, regression and several other techniques as transformations of the simple model described above for regression. This enables it to carry out many sophisticated analyses which are not available elsewhere in SAS. Although the specification looks very much like the REG or ANOVA procedures, the output is quite different. A very important aspect of GLM is that it can perform an analysis of variance on unbalanced designs. For balanced designs ANOVA is to be preferred.

As usual you may specify the dataset to be used with the DATA= option, for example:

PROC GLM DATA=SASUSER.GRASSES;

The CLASS

statement

If variables to be used in the model are to be regarded as factors, that is having a small number of defined categories, they should be specified in

a CLASS statement. If they do not appear in a CLASS statement SAS

assumes that a regression type model is appropriate, which is not true for factors. If the CLASS statement is used then it must precede the MODEL statement.

If none of the variables in the model appear in a CLASS statement regression is being used. If all the variables in the model appear in a CLASS statement analysis of variance is being used. If some of the variables in the model appear in a CLASS statement analysis of covariance is being used.

The MODEL

statement

The ID statement

The MEANS

statement

The RANDOM statement

The OUTPUT statement

The MODEL statement for GLM is essentially the same as for ANOVA in the way that it describes the model. However the options which can be specified are not identical. As with REG the NOINT option specifies that the model is to have no constant term. Similarly the P option asks for information to be produced on the predicted and residual values. For example:

MODEL X3 = X1 X2 / NOINT P;

There is no R option. You can specify several dependent variables to carry out a multivariate analysis.

If the P option is specified in the MODEL statement you may wish to label the cases. The ID statement specifies the variable name to be used to label the observations. For example:

ID NAME;

As with ANOVA you can request multiple comparisons on the group means, for example:

MEANS MAKER/DUNCAN;

This shows in detail how the group means differ.

The RANDOM statement specifies that a factor in the model is to be considered as a random effects factor rather than a fixed effects factor. For example, the number of days on which it rained during a trial period may be a very important factor, but it must be regarded as a random effects factor rather than a fixed effects one. If used, the RANDOM statement must appear after the MODEL statement, for example:

CLASS A B;

MODEL

RANDOM B;

Y

=

A

B;

To analyse the predicted or residual values you must write them to a SAS dataset. Having done that you can use any of the facilities of SAS, especially the graphical ones, to examine them. The OUTPUT statement allows you to write these values.

The OUTPUT statement must specify the name of a dataset. This can be a permanent dataset (for example SASUSER.PREDVALS) or a temporary one. The information to be written follows the dataset name. The keywords PREDICTED and RESIDUAL (which can be abbreviated to P and R) specify that these values are to be output and give them names. For example:

PROC GLM; MODEL Y = X Z/NOINT; OUTPUT OUT=SAVED P=PY R=RY;

The output dataset contains all the variables from the input dataset (whether or not they were used in the model) as well as the names specified by P or R. If the analysis has multiple dependent variables you must specify predicted and residual variable names for each dependent variable.

Example

DATA MILEAGE; INPUT MPH MPG @@:

CARDS; 20 15.4 30 20.2 40 25.7

RUN; PROC GLM; MODEL MPG=MPH /P CLM; OUTPUT OUT=PP P=MPGPRED R=RESID; PROC PLOT DATA=PP; PLOT MPG*MPH=’A’ MPGPRED*MPG=’P’/OVERLAY;

60 24.8

Miscellaneous useful procedures

This section describes some further procedures which you are likely to need but which do not form a logical group. These are the procedures for sorting data, for declaring formats, and for listing the contents of a dataset.

Sorting a dataset - The SORT procedure is used to sort a dataset on one or more variables.

the SORT

procedure

It is necessary to have the data sorted if the BY statement is to be used on the dataset (see page 11). The general form of the SORT statement is:

PROC SORT options;

where the options include:

DATA=dataset

specifies the dataset to be sorted. If it is omitted the most recently created dataset will be used.

OUT=dataset

specifies the output dataset. If it is omitted the input dataset is overwritten by the sorted version.

NODUPLICATES

specifies that the data is checked after it has been sorted and any exact duplicates are dropped from the output data set. This checking uses all the variables in the data, not just the ones used for the sorting.

EQUALS

specifies that the original order of cases is preserved if they have identical sort key values. If EQUALS is not specified then the order may be changed.

The BY statement

The BY statement specifies the variable or variables to be used as a key or keys to order the data. For example:

BY AGEGRP;

specifies that the data is to be sorted according to the values of the variable AGEGRP. Unless otherwise specified, the values are arranged with the lowest values first, that is in ascending order. To have data sorted in descending order, that is with the high values first, insert DESCENDING before the variable name, for example:

BY DESCENDING INCOME;

If more than one variable is specified the first one mentioned is the most important, the next the second most important etc. For example:

BY AGEGRP DESCENDING GRADE REGION;

sorts the data so that the first case has the lowest value of REGION within the highest value of GRADE within the lowest value of AGEGRP. The last case has the highest value of REGION within the lowest value of GRADE within the highest value of AGEGRP.

Examples

¤

PROC SORT;

 

BY SEX;

RUN;

¤ PROC SORT DATA=NATION OUT=REGION; BY REGION CITY; RUN;

Defining your

own

FORMAT

procedure

formats - the

The VALUE statement

The FORMAT statement described on pages 9 and 11 associates a format with a variable. This may be a standard SAS format or one constructed by the user. These constructed formats are defined by the FORMAT

procedure. You must use FORMAT to give labels to individual values of

a variable, using the VALUE.

When a format is specified in the FORMAT statement it always includes

a full stop. This is how formats are recognised. Because of it, format

names end with a full stop when they appear in a FORMAT statement but they do not do so in a PROC FORMAT statement.

To label individual values of a variable or ranges of values use the VALUE statement. The format is given a name and then the values and corresponding labels are declared. For example:

VALUE YESNO

1=’YES’

2=’NO’

3=’MISSING’;

Ranges of values are specified using a hyphen, for example:

VALUE NATIONS

1, 3-16, 28=’WESTERN BLOC’ 2, 17-21=’EASTERN BLOC’ 22-27, 29=’NON-ALIGNED’;

The keyword OTHER may be used to catch any values not explicitly mentioned, for example:

VALUE AGEFMT

31-55=’MIDDLE’

18-30=’YOUNG’

56-80=’OLD’

OTHER=’MISSING’;

The keywords LOW and HIGH may be used to specify the ends of ranges, that is the lowest and highest values.

Using PUT with formats

The PUT function recodes a variable in accordance with a format. For example:

PROC FORMAT; FORMAT AGEGRP 0-18=1

19-30=2

31-50=3

51-65=4

65-100=5;

DATA SURV; INPUT AGE SEX INCOME; AGEGRP=PUT(AGE,AGEGRP.); CARDS;

RUN;

The variable AGEGRP is set to 1 whenever AGE is 0 to 18, 2 whenever AGE is 19 to 30, etc. AGEGRP is a character variable even though all its values are digits. Note that the PUT function is quite different to the PUT Statement (see page 8).

Example

To produce a frequency table for income with the data grouped into a small number of categories, a suitable format is declared with VALUE and then the FORMAT statement is used within the FREQ procedure to assign the format to the variable:

PROC FORMAT; VALUE INCFMT LOW-1499=’<1,500’

1500-2499=’<2,500’

2500-5999=’<6,000’

6000-9999=’<10,000’

10000-HIGH=’10,000+’;

PROC FREQ; TABLES INCOME; FORMAT INCOME INCFMT.; RUN;

This produces a frequency table with only five categories, which are labelled as described. Note the full stop after the format name in the FORMAT statement.

Printing the values of variables - the PRINT procedure

The PRINT procedure prints the contents of variables, and produces simple reports by use of the BY, PAGEBY and SUM statements. The PRINT statement has the form:

PROC PRINT options;

The options include:

DATA=dataset

specifies the dataset to be used. If it is missing the most recently created dataset is used.

N

outputs the number of cases at the end of the data.

UNIFORM

specifies that the same layout is to be used for each page. If it is not included, SAS outputs as many variables as possible on a page, which may result in different numbers of variables on different pages.

DOUBLE

specifies double spacing in the output.

LABEL

requests that the labels for the variables (see page 8) be used to head the columns of output rather than the names of the variables.

The VAR statement

The ID statement

The BY and PAGEBY statements

The SUM statement

The VAR statement specifies the variable whose values are to be printed, for example:

VAR AGE GRADE;

If no VAR statement is used all the variables are printed.

Normally the number of the observation or case in the dataset is used to identify it in the output. The ID statement means that the values of the specified variable are to be used instead. For example:

ID NAME;

The BY statement specifies that the procedure is to operate on the subgroups defined by the values of a variable. The PAGEBY statement may be used with the BY statement to start a new page when a BY variable changes. For example:

BY

PAGEBY A;

causes a new page to be started when the value of A changes.

BY

PAGEBY B;

causes a new page to be started whenever A or B change, because PAGEBY triggers a new page for the variable specified and for any earlier variables in the BY statement list.

A

B

C;

A

B

C;

If a variable appears in a SUM statement the total of the values is also produced. If a BY statement has also been used totals are printed for the subgroups (provided there is more than one case in the subgroup).

Examples

¤ PROC PRINT; RUN;

¤ PROC PRINT DOUBLE DATA=SICKNESS; VAR GRADE AGE DAYSSICK; ID NAME; BY DEPT; PAGEBY DEPT; SUM DAYSSICK; RUN;

¤ The following statements list the data by sex with income summed over men and women separately:

PROC SORT; BY SEX; RUN; PROC PRINT; VAR AGE MARSTAT NOOFCH INCOME; BY SEX; SUM INCOME; ID CASENO; RUN;

The output is as follows:

---------------------------- SEX=1 -------------

----------

 

CASENO

AGE

MARSTAT

NOOFCH

INCOME

 

101

35

2

4

9754

 

103

53

.

0

7560

 

104

39

4

2

8500

 

106

38

2

3

8210

 

107

49

2

7

9607

 

108

27

1

0

8895

 

110

21

1

0

2954

 

114

25

1

0

5650

 

115

80

4

.

1750

 

116

43

2