Document K6.1
Introduction to SAS
Document K6.1
*
Introduction to SAS
1 
Introduction 
1 
Documentation 
1 
The SAS DATA step 
2 
Creating and naming a dataset  the DATA statement 
2 
Preparing data for SAS 
3 
Entering data to SAS and naming variables  the INPUT statement 
5 
Reading data from a file  the INFILE statement 
5 
Including data with SAS commands  the CARDS statement 
5 
Reading from a SAS dataset  the SET statement 
6 
Calculations  the assignment statement 
7 
Conditional evaluation and selection  the IF, THEN and ELSE statements 
8 
Giving labels to variables  the LABEL statement 
8 
Displaying data  the PUT statement 
9 
Controlling output format  the FORMAT statement 
10 
Executing instructions  the RUN statement 
10 
Recoding data 
10 
Example DATA steps 
11 
The SAS PROC step 
11 
Analysing data in subgroups  the BY statement 
11 
Controlling output format  the FORMAT statement 
11 
Printing titles on output from procedures  the TITLE statement 
12 
Printing footnotes  the FOOTNOTE statement 
12 
Descriptive statistics 
12 
Frequency tables and crosstabulations  the FREQ procedure 
14 
Descriptive statistics  the MEANS procedure 
15 
Statistics about a single variable  the UNIVARIATE procedure 
17 
Correlations  the CORR procedure 
19 
Complex tables  the TABULATE procedure 
21 
Student’s TTest 
23 
Drawing simple diagrams with SAS 
24 
Bar charts  the CHART procedure 
26 
Scattergrams  the PLOT procedure 
28 
Highquality graphics 
28 
Bar charts  the GCHART procedure 
31 
Scattergrams  the GPLOT procedure 
33 
Maps  the GMAP procedure 
34 
Regression  the REG procedure 
35 
The MODEL statement 
36 
The OUTPUT statement 
36 
Example 
39
Analysis of variance  the ANOVA procedure
39
The CLASS statement
39 
The MODEL statement 
40 
The MEANS statement 
40 
Example 
44 
General linear modelling  the GLM procedure 
44 
The CLASS statement 
45 
The MODEL statement 
45 
The ID statement 
45 
The MEANS statement 
45 
The RANDOM statement 
45 
The OUTPUT statement 
46 
Example 
46 
Miscellaneous useful procedures 
46 
Sorting a dataset  the SORT procedure 
47 
Defining your own formats  the FORMAT procedure 
49 
Printing the values of variables  the PRINT procedure 
52 
References and further information 
* We are grateful to the University of Liverpool Computer Laboratory for permitting us to use their original user guide as a basis for this one.
Introduction
SAS (Statistical Analysis System usually pronounced as a single syllable ‘sass’) provides facilities for the manipulation and analysis of both
numeric and character data using a variety of techniques.
facilities range from production of frequency tables and bar charts to
multivariate regression and analysis of variance.
input and manipulation give control over how data is to be read, and provide for the sorting and merging of datasets and calculations on the
data.
applications including survey analysis, data processing, and analysis of designed experiments.
A series of SAS statements is often divided into DATA steps and PROC
steps.
manipulate it, select subgroups, create new variables etc (the DATA step), and those which analyse the data and produce output (the PROC
step).
functions possible in the DATA step, the section ‘The SAS PROC step’ describes some of the statements which can occur in any PROC statement, while the sections ‘Descriptive Statistics’ to the end describe some specific PROC statements.
SAS is available locally on CMS and CSA. A PC version may be purchased from the Computing Service.
The section ‘The SAS DATA step’ describes some of the
The statistical
The facilities for data
This variety of facilities makes SAS suitable for a wide range of
The division is between those statements which input data,
Documentation
SAS is described in several manuals, amounting to many hundreds of
pages.
material and select only a small portion of the facilities available.
when a facility is described here it will not be described in full and many
options may be ignored.
manuals if there is something you want to do that is not described in the following sections. Details of manuals may be found on page 52. On CMS, type HELP SAS for details of the CMS version.
This means that you should check with the full
For this document it has been necessary to omit much of the
Even
The SAS DATA step
A SAS DATA step is a set of statements which set up the data, for
example read the data, manipulate it, select subgroups or create new variables. This section describes some of the functions possible in the DATA step.
A DATA step starts with a DATA statement and ends with a RUN
statement (or with a further DATA or a PROC statement).
K6.1 (10.90)
page 1
Comments can appear anywhere in a SAS statement provided they appear within the delimiters /* (to start a comment) and */ (to end one). For example:
DATA STOCK; INPUT CODE QUANTITY PRICE /*in dollars*/ STOREBIN;
_{C}_{r}_{e}_{a}_{t}_{i}_{n}_{g} _{a}_{n}_{d} naming a dataset  the DATA statement
The DATA statement has the general form:
DATA dataset options ;
Note that, as with all SAS statements, the DATA statement ends with a
semicolon.
DATA step, either by the input of data or by processing an existing dataset. ‘dataset’ has two parts; a library name (which may be omitted) and the dataset name. These names follow the same rules as other names used in SAS, which are:
¤ The name must not be more than eight characters long.
¤ It may contain only the letters A to Z, the digits 0 to 9 and the underline character.
¤ It must not start with a digit.
Examples of valid names are NEW_ONE, OLDDATA2 and _EX_A1.
If a library name is specified it is separated from the dataset name by a
full stop.
temporary library that is deleted when you leave SAS.
valid DATA statements are:
DATA KEEP.GNP82;
DATA PRICES;
DATA E_4_DATA;
If you miss out ‘dataset’ altogether, SAS gives the dataset the name
WORK.DATAn, where n is 1, 2, 3, datasets you have created so far.
If you specify a library name you can save the library and use it in a later SAS session rather than having to set it up again. Note however that only the library name SASUSER is set up for you by SAS. If you wish to use any other library name then you must define it both to the system you are using and to SAS (using a LIBNAME statement). This is described in more detail in Reference A or B.
The ‘dataset’ part names the dataset being created by this
If no library name is given,it is named WORK, which is a
Examples of
depending on how many such
_{P}_{r}_{e}_{p}_{a}_{r}_{i}_{n}_{g} _{d}_{a}_{t}_{a} _{f}_{o}_{r} SAS can read either fixed format or free format data (or even data which
SAS
mixes fixed and free format in some cases).
values be in fixed positions in each line of the data, while free format
requires only that values be separated by a blank.
easier to prepare their data in free format.
example names) as well as numbers, and it has special provision for
times and dates.
¤ Each value must be separated from the next by at least one space or by the end of a line.
¤ Decimal points must be included where they are required.
Fixed format requires that
Most people find it
SAS can read text data (for
Free format data must follow certain rules:
K6.1 (10.90)
page 2
¤ 
Any missing data must be represented by a full stop (.) and not by 

blank (full stop is the standard symbol for a missing value in SAS). a 

¤ 
If 
you are using text data then no item of text may be longer than 
eight characters and the data must be enclosed in quotes.
If your data satisfies all these conditions, you can use the free format method of reading data. If not, you must use fixed format, or possibly a mixture of free and fixed format, as described below.
_{E}_{n}_{t}_{e}_{r}_{i}_{n}_{g} _{d}_{a}_{t}_{a} _{t}_{o}
SAS and ^{n}^{a}^{m}^{i}^{n}^{g}
variables  the
INPUT statement
A SAS dataset is made up of a number of cases or observations, each of which contains a value (either measurements or calculations) for each variable in the dataset.
If you are inputting new data, rather than processing an existing dataset, use the INPUT statement to specify:
¤ the names of the variables
¤ the order of variables in each complete set of variables (or case or observation)
¤ whether variables are numeric or contain characters
¤ how the variables are laid out on the input line
The names of the variables follow the same rules as those for dataset names given on page 2, for example HEIGHT, TOT_WAGE, AVE3. A variable is assumed to be numeric unless it is followed by a dollar sign ($) in the INPUT list. The dollar sign is not part of its name and is not typed when using the variable in other SAS statements.
The order in which the variables appear in the INPUT statement is their order in the dataset (although not necessarily their order in the input data, as some forms of fixed format input can read the data in a different order to the one in which it was typed).
If you have a list of variable names ending in numeric suffixes, for example VAR1, VAR2, VAR3 etc, SAS allows an abbreviated form for their specification:
INPUT CASENUM VAR1VAR12 DETAILS;
This form may also be used to specify such a list in procedures. For example:
PROC MEANS; VAR AGE1AGE7; RUN;
For more details of PROC MEANS see page 14.
Where variable names do not follow such a pattern you can still use an abbreviated form when specifying them in a procedure, but it is a little different. If the INPUT statement for the dataset was:
INPUT CASE SEX AGE TESTA TESTB INCOME MARSTAT REGION;
you could type:
PROC MEANS; VAR AGEINCOME; RUN;
to obtain means on the variables AGE, TESTA, TESTB and INCOME. The double hyphen indicates a list of variable names.
K6.1 (10.90)
page 3
Free format  list input
Fixed format  column input
To input data which satisfies the rules of free format input given on page 2 you need only list the variables, for example:
INPUT CASE_NO SEX $ AGE HEIGHT WEIGHT;
SEX has been declared as a character variable so that M and F can be used to denote male and female, rather than using numeric codes. As with all SAS statements, the INPUT statement ends with a semicolon.
There are two ways of specifying fixed format input. The easier is column input. The columns containing the value for the variable are specified after the variable name, for example:
INPUT REF 16 NAME $ 726 AGE 2728 SEX 59 HEIGHT 6466 2;
This specifies the name and type of the variables and how they are laid out:
REF 
numeric, columns 1 to 6 
NAME 
character, columns 7 to 26 
AGE 
numeric, columns 27 to 28 
SEX 
numeric, column 59 
HEIGHT 
numeric, columns 64 to 66, the last two columns (65 and 66) are assumed to be after a decimal point (which need not be typed) 
If you always type decimal points where they are needed then you have no need to use the type of column format given for HEIGHT. Instead you could specify simply HEIGHT 6466 and type in all decimal points.
Anything typed in columns other than those mentioned in the INPUT statement is ignored. If your data does not fit on one line you must indicate when the second or third line begins. For example, if the data has the name on one line and the address on the next:
INPUT NAME $ 120 #2 ADDRESS $ 160;
The #2 means that subsequent column numbers refer to line 2.
Fixed format  formatted input
Formatted input is the second way to input fixed format data. It is the only way of inputting certain unusual types of data and for declaring that data represents a time or date so that it can be printed appropriately later. The data layout used above would be specified as follows using SAS informats (an informat describes how a value is to read; a format describes how it is to be printed):
INPUT REF @1 6. NAME $ 20. HEIGHT @64 3.2;
¤ ‘@1’ specifies that REF starts in Column 1
¤ ‘6.’ means that it is 6 digits long with no figures after the decimal point
¤ since NAME follows immediately after REF there is no need to specify the column at which it starts (column 7 is assumed)
¤ ‘20.’ indicates that it is 20 characters long
¤ AGE is two digits long with no decimal point
AGE 2. SEX @59 1.
K6.1 (10.90)
page 4
¤ SEX does not immediately follow AGE, and so ‘@59’ is needed to show where SEX starts, and ‘1.’ specifies that it takes up one column
¤ HEIGHT takes up three columns, the last two of which are after the decimal point, so its informat is ‘3.2’ rather than ‘3.’, but ‘3.’ would be quite adequate if you had typed the decimal points in the data.
If there is more than one line of data for each case use the ‘#n’ notation, for example:
INPUT NAME $ @1 20. #2 ADDRESS $ @1 60.;
Mixing free and fixed format
SAS allows the different styles of describing how the data is arranged to be specified on the same INPUT statement. For example, suppose you had data which consisted of a name of up to 20 letters and seven measurements, and you wished to type the measurements in free format but could not use free format for the name because it had more than 8 characters. If you put the name first you could describe this data quite simply:
INPUT NAME $ 120 M1 M2 M3 M4 M5 M6 M7;
The name starts in column 1 and the measurements can be typed in free format after column 20. If any name is 20 characters long, a space between the end of the name and the first measurement is advisable.
Multiple cases on a line
If the list of variables ends with @@ SAS does not expect each case to
start on a new line.
DATA LINES; INPUT A B@@; CARDS; 1 2 1 4 1 5 6 4 2 3 6 7 10 8 3 6 RUN;
The data for eight cases is typed on one line. For an explanation of CARDS see page 5 and for RUN see page 10.
For example:
_{R}_{e}_{a}_{d}_{i}_{n}_{g} _{d}_{a}_{t}_{a} from a file  the
^{I}^{N}^{F}^{I}^{L}^{E} ^{s}^{t}^{a}^{t}^{e}^{m}^{e}^{n}^{t} further details.
If data is to be read from a file you must include an INFILE statement to specify the name of the file. There are different ways of specifying the
file according to which system you are using. See Reference A or B for
_{I}_{n}_{c}_{l}_{u}_{d}_{i}_{n}_{g} _{d}_{a}_{t}_{a} with SAS commands  the
If the data is to be included with the SAS statements the CARDS statement appears just before the data. It has the format:
CARDS;
^{C}^{A}^{R}^{D}^{S} ^{s}^{t}^{a}^{t}^{e}^{m}^{e}^{n}^{t} If CARDS is being used then only the data itself and a RUN statement (see page 10) should follow it. Any transformations of the data must appear before the CARDS statement.
_{R}_{e}_{a}_{d}_{i}_{n}_{g} _{f}_{r}_{o}_{m} _{a} SAS dataset  the SET statement
If the data is to read from an existing dataset the SET statement specifies the dataset to be used. A DATA step should normally contain either a SET statement or an INPUT statement. The format of the SET statement _{i}_{s}_{:}
SET dataset;
K6.1 (10.90)
page 5
_{C}_{a}_{l}_{c}_{u}_{l}_{a}_{t}_{i}_{o}_{n}_{s} _{} _{t}_{h}_{e} assignment ^{s}^{t}^{a}^{t}^{e}^{m}^{e}^{n}^{t}
For example:
SET GHS_DATA; SET SASUSER.ANSWERS;
Assignment statements have no keyword to identify them. They are used to perform calculations on the data and have the general form:
variable = expression;
The expression contains numbers, variable names and arithmetic operators. The multiplication operator (*) must always appear between two quantities which are to be multiplied together.
The symbols used for arithmetic operators are:
** exponentiation, for example (1 + R)**T
* multiplication, for example TAXABLE * RATE
/ division, for example C367 / C45
+ addition, for example A1 + B9
 subtraction, for example GROSS  TAX
Brackets may be used to make your meaning clear. Examples of assignment statements are:
AVEINC = INCOME / NUMFAM
B1 = (A7 + C9  EXPENSES)*0.54
GRPPAY = PAY / 1000
SAS has rules for the order in which it evaluates expressions by giving priorities (or precedence) to each operator, as follows:
1 Bracketed expressions
2 Exponentiation
3 Multiplication and Division
4 Addition and Subtraction
If there are operators of equal precedence SAS works from left to right. This means that an expression like:
A + B * C
is evaluated by SAS as A + (B * C). If you wish to add A and B before multiplying by C then you must use brackets:
(A + B) * C
If you are in doubt about how SAS will evaluate a complex expression then either insert brackets or split it into simpler expressions and use several assignment statements to build up the full expression.
SAS expressions can also include SAS functions. These provide many facilities including square roots of numbers, logarithms, sines and cosines, probabilities, etc. A list of the more commonly used functions is given below.
K6.1 (10.90)
page 6
ABS absolute value (that is, the value ignoring sign)
MAX maximum of a list of values
MIN minimum of a list of values
SQRT square root
INT
gives the integer part of the value, that is it discards the decimal part
ROUND rounds off to the nearest whole number
NORMAL 
gives a random number from a normal distribution with mean 0 and standard deviation 1 
UNIFORM 
gives a random number from a uniform distribution in the range 0 to 1. 
There are also functions for manipulating dates and times and for character variables.
Functions are used by giving the value on which they are to operate in brackets following the name, for example:
BIG = MAX(A,B,C);
S = SQRT(S2/N);
RX = ROUND(X) + C;
Z = M + S*NORMAL(0);
_{C}_{o}_{n}_{d}_{i}_{t}_{i}_{o}_{n}_{a}_{l} evaluation and
selection  the
THEN and ELSE statements
^{I}^{F}^{,}
The IF statement is used when an action is to be carried out on only some of the cases being processed. For example, you may wish to take special action if data is missing, or do calculations differently for people in work and those unemployed, or you may wish to exclude certain groups. The general form of the IF statement is:
IF condition THEN statement1; ELSE statement2;
‘statement1’ is acted upon if ‘condition’ is true, otherwise ‘statement2’ is acted upon. The ‘condition’ is usually of the form:
expression comparator expression
where the expression is as described on page 6, and ‘comparator’ is one of the following:
EQ or = equals
NE or ^= not equal to
GT or > greater than
NG or ^> not greater than
LT or < less than
NL or ^< not less than
GE or >= greater than or equal to
LE or <= less than or equal to
Note that NL and GE are equivalent and so are NG and LE.
These simple conditions can be linked with the words AND and OR. NOT may be used to change the meaning of a condition.
K6.1 (10.90)
page 7
_{G}_{i}_{v}_{i}_{n}_{g} _{l}_{a}_{b}_{e}_{l}_{s} _{t}_{o} variables  the LABEL statement
_{D}_{i}_{s}_{p}_{l}_{a}_{y}_{i}_{n}_{g} _{d}_{a}_{t}_{a} _{} the PUT ^{s}^{t}^{a}^{t}^{e}^{m}^{e}^{n}^{t}
PUT  list style
When using a complex condition you should use brackets to make your meaning clear. Examples of conditions are:
AGE LE 15 AND WORKSTAT EQ 1
NOT (SEX = 1 OR NISTAMP < 2)
MARSTAT EQ 1 OR MARSTAT >= 3
Examples of IF statements are:
IF A > B THEN X=A;
IF MONTH >= 3 AND MONTH LT 6 THEN SEASON = ’SPRING’;
IF EXPENSES GT (EARNINGS  TAX) THEN DEBT = 1;
IF
ELSE X=B;
The above examples have shown only assignment statements following THEN and ELSE, but other statements can also be used, for example DELETE (which omits the case from the dataset) or PUT (which can print values; see page 8). For example, if you already had a dataset and wished to set up another one which only contained men over retirement age, then you might have
IF SEX NE ’M’ OR AGE LT 65 THEN DELETE;
or
IF NOT (SEX EQ M AND AGE GE 65) THEN DELETE;
A special value used to indicate missing values is the full stop (which
can also be used in your data for the same purpose). Suppose your data had been prepared with ‘9’ indicating a missing value for the variable MARSTAT (which is marital status) you could replace this with a missing value symbol by:
IF MARSTAT = 9 THEN MARSTAT = . ;
To eliminate cases with important data missing:
IF INCOME = . AND EXPENSES = . THEN DELETE;
A
>
B
THEN X=A;
The LABEL statement allows you to define labels for variables, which will be used by various procedures to document the output. A label may
be up to 40 characters long, for example:
LABEL INCOME=’ANNUAL INCOME INCLUDING STATE BENEFITS’;
The PUT statement enables you to print out values as the DATA step is being processed. It has equivalent styles to the three forms of the INPUT statement (list, columns, and formatted). Only simple ways of using PUT are described here.
This can be simply a list of the variable names whose values you wish to
be printed, for example:
PUT NAME AGE MARSTAT;
in which case the values are printed with one space between each value,
and each case starts on a new line.
K6.1 (10.90)
page 8
If you follow the name of the variable with an equals sign, the value is labelled with the name of the variable, for example:
PUT REF= AGE=;
The output from this statement is:
REF=103 AGE=56
If a value is missing a full stop is printed to represent it.
You may also print text with PUT. For example, to check the validity of the data and print an error message if a mistake is found:
IF AGE LT 16 AND WORK EQ 1 THEN PUT ’UNDER AGE ’ NAME AGE= WORK=;
PUT  column style
PUT  formatted style
PUT  mixed style
_{C}_{o}_{n}_{t}_{r}_{o}_{l}_{l}_{i}_{n}_{g} output format  the FORMAT ^{s}^{t}^{a}^{t}^{e}^{m}^{e}^{n}^{t}
To lay out the data in regular columns use column style. For example:
PUT CASE_NO 18 HEIGHT 1114 2 WEIGHT 1619 1;
prints the case number in columns 1 to 8, the height in columns 11 to 14 with a decimal point in column 12, and the weight in columns 16 to 19 with a decimal point in column 18.
You can print text by specifying how many blanks are to be left between the last field printed and the text. For example:
IF NRUNS GE 100 THEN PUT BATSMAN 120 +2 ’SCORED A CENTURY’;
prints the name in the first 20 columns, skip 2 columns and then prints the text.
In this style the name of each variable is followed by the name of the format in which it is to be printed. This format may be a standard SAS format or one which you have defined (see the description of PROC FORMAT in the section on page 47). For example, using the standard currency formats, the statement:
PUT DEBTS DOLLAR7.2 ASSETS DOLLAR9.2;
produces:
$130.00 $245.45
In the same way that styles can be mixed with INPUT you can mix styles in PUT statements. For example:
PUT COMPANY 530 +1 ’HAS LOW ASSETS ’ FUNDS
DOLLAR8.2;
The FORMAT statement associates a format with a variable for printing. The association lasts until the session ends, not just for the DATA step. If you use your own formats (declared using PROC FORMAT, see the section on page 47) they must have been declared before they are used. Examples of FORMAT statements are:
FORMAT HEIGHT 4.2;
FORMAT WEEK_PAY DOLLAR6.2;
FORMAT FILMYEAR ROMAN12.;
K6.1 (10.90)
page 9
The last example will print FILMYEAR in Roman numerals allowing 12 spaces for the value.
_{E}_{x}_{e}_{c}_{u}_{t}_{i}_{n}_{g}
instructions 
RUN statement
^{t}^{h}^{e}
_{R}_{e}_{c}_{o}_{d}_{i}_{n}_{g} _{d}_{a}_{t}_{a}
Example DATA steps
Using data within SAS statements and list input
Using data within SAS statements and column input
The RUN statement is used to end both DATA steps and PROC steps, and shows that the statements in the step are complete and should be executed. The format of the statement is simply:
RUN;
It is not essential to end a DATA step with RUN because the step is executed when SAS meets a DATA or PROC statement, but it is certainly tidier to use RUN especially when typing commands at the terminal.
Sometimes you may wish to group data or do some relabelling of values. This can be done by a series of IF statements, but can also be done quite conveniently with a format and the PUT function. See section page 48 for details.
DATA ONE;
INPUT REF_NUM SEX $ AGE HEIGHT WEIGHT; LABEL HEIGHT=’HEIGHT IN METRES’; LABEL WEIGHT=’WEIGHT IN KILOGRAMS’; CARDS;
101 
M 31 1.88 82 
102 
F 26 1.6 60 
103 
M 24 1.9 75.5 
150 
M 38 1.87 76 
RUN; 
DATA SURVEY; INPUT CASENO 16 SEX $ 7 AGE 810 MARSTAT $ 11 INCOME 1218 2; CARDS;
000001M026S0741200
000002F056S2568000
100247M092M0403909
RUN; PAYPERYR = INCOME/AGE;PUT CASENO PAYPERYR INCOME AGE; LABEL MARSTAT=’MARITAL STATUS’; RUN;
For an example of reading data from a file, see Reference A or B.
K6.1 (10.90)
page 10
The SAS PROC step
The PROC step starts with a PROC statement and ends with RUN (or by meeting a DATA or PROC statement). There are many varieties of PROC statement, each one providing a different SAS facility. The following sections describe specific PROC statements, but some of the statements which can occur in any PROC step are described in this section.
_{A}_{n}_{a}_{l}_{y}_{s}_{i}_{n}_{g} _{d}_{a}_{t}_{a} _{i}_{n} subgroups  the ^{B}^{Y} ^{s}^{t}^{a}^{t}^{e}^{m}^{e}^{n}^{t}
_{C}_{o}_{n}_{t}_{r}_{o}_{l}_{l}_{i}_{n}_{g}
output format 
the ^{F}^{O}^{R}^{M}^{A}^{T}
statement
_{P}_{r}_{i}_{n}_{t}_{i}_{n}_{g} _{t}_{i}_{t}_{l}_{e}_{s} _{o}_{n} output from
^{t}^{h}^{e}
procedures 
TITLE statement
A procedure can produce analyses for subgroups rather than for the whole data if a BY statement is included in the PROC step and the data is sorted on the variable or list of variables specified (for details of how to sort data see the description of SORT on page 46). For example, to produce separate mean values of income for men and women use the procedure MEANS, and include the statement:
BY SEX;
within the PROC step. To produce tables for men and women in different age groups, use:
BY SEX AGE_GRP;
The FORMAT statement gives a format for printing to variables used in the PROC step. It has the same layout as the FORMAT statement used in a DATA step, but while the DATA step associates a format with a variable for the whole SAS session, its use in a PROC step associates the format with the variable only for the duration of that step. An example is:
FORMAT INCOME DOLLAR7.2;
The TITLE statement prints a title on the output from a procedure. It can appear anywhere but is most useful in a PROC step. The title can be several lines long. The first line can be numbered 1 or can be blank as you wish, but any following lines must be numbered. For example, a one line title could be either:
TITLE ’ANALYSIS OF ANTIGEN LEVELS’;
or:
TITLE1 ’ANALYSIS OF ANTIGEN LEVELS’;
If there are several lines in the title, the second and subsequent lines must be numbered. For example:
TITLE ’ATTITUDES TO OUTPATIENT CARE’; TITLE3 ’DELAY IN RECEIVING APPOINTMENTS’;
This would give a title of three lines (the second line, TITLE2, is assumed to be blank). You can redefine TITLE3 later without changing TITLE1; a new TITLE statement suppresses only that numbered line and any lines with higher numbers.
If you are using a graphics device for output then there are many extra options for this statement.
K6.1 (10.90)
page 11
_{P}_{r}_{i}_{n}_{t}_{i}_{n}_{g} _{f}_{o}_{o}_{t}_{n}_{o}_{t}_{e}_{s} The FOOTNOTE statement prints notes at the foot of the output page. It
 the ^{F}^{O}^{O}^{T}^{N}^{O}^{T}^{E}
statement
can appear anywhere but is most useful in a PROC step. Like a title, a footnote can be several lines long. The first line can be numbered 1 or can be blank as you wish, but any subsequent lines must be numbered. For example, a oneline footnote could be either:
FOOTNOTE ’1985 figures, Pounds Sterling’;
or:
FOOTNOTE1 ’1985 figures, Pounds Sterling’;
If there are several lines in the footnote the second and subsequent lines must be numbered. For example:
FOOTNOTE ’Data obtained from official sources’; FOOTNOTE3 ’Estimated 12% underreporting’;
This would give a footnote of three lines (the second line, FOOTNOTE2, is assumed to be blank). You can redefine FOOTNOTE3 later without changing FOOTNOTE1; a new FOOTNOTE statement suppresses only that numbered line and any lines with higher numbers.
Descriptive statistics
This section describes some of the procedures in SAS for descriptive and other statistics. These are FREQ, MEANS, UNIVARIATE, CORR and TABULATE.
_{F}_{r}_{e}_{q}_{u}_{e}_{n}_{c}_{y} _{t}_{a}_{b}_{l}_{e}_{s} and cross tabulations  the FREQ procedure
The FREQ procedure produces tables; oneway, twoway, threeway, etc. Oneway tables are normally called frequency tables, while twoway or more are often called crosstabulations. The PROC step for FREQ starts with the statement:
PROC FREQ options;
The options may be omitted entirely:
PROC FREQ;
The option DATA= specifies which dataset is to be used (if omitted the most recently created dataset is used). For example:
PROC FREQ DATA=LIB83.GNPFIGS;
The TABLES statement The TABLES statement specifies which variables are to be analysed and the sort of tables to be produced. It has the form:
TABLES tablerequests / options ;
If no options are specified the / is not required. For oneway tables, give the name of the variable or variables required. For multiway tables list the variables required separated by asterisks, for example:
AGE_GRP*SEX
or
INC_GRP*MARSTAT*CITY
K6.1 (10.90)
page 12
The TABLES statement has shorthand forms for specifying tables. For example:
HEIGHT  EXAMS
specifies all the variables from HEIGHT to EXAMS (inclusive) in the dataset.
QN32*(QN01 QN02 QN03)
specifies the three tables QN32*QN01, QN32*QN02 and QN32*QN03.
QUALS*(LABVOTE  SDPVOTE)
combines the two shorthand methods.
If no options are given, the content of a table is the frequency, the percentage of the total number in the table, the percentage of the number in the row, and the percentage of the number in the column (the last two are only printed for crosstabulations). The content can be changed by specifying options in the TABLES statement. Some useful options are NOPERCENT, which suppresses printing of overall percentages; NOROW, which suppresses row percentages, and NOCOL, which suppresses column percentages.
Examples 
¤ 
PROC FREQ; TABLES MARSTAT NUM_KIDS; TABLES HOUSING*REGION / NOROW NOCOL; TABLES INJURIES*(SHIFT MONTH); RUN; 
¤ 
PROC FREQ; TABLES AREA HOUSING VAR01  VAR10; RUN; 

¤ 
PROC FREQ DATA=RENTED; TABLES AREA*HOUSING /NOPERCENT; TABLES REPAIRS*TENURE; RUN; 

¤ 
To obtain frequencies for the number of children (NOOFCH) and a crosstabulation of SEX and marital status (MARSTAT), type: 
PROC FREQ; TABLES NOOFCH SEX*MARSTAT; RUN;
This could produce the following output:
K6.1 (10.90)
page 13
_{D}_{e}_{s}_{c}_{r}_{i}_{p}_{t}_{i}_{v}_{e} statistics  the MEANS procedure
The procedure MEANS prints the mean, standard deviation, and maximum and minimum values of a variable. The format of the MEANS statement is:
PROC MEANS options;
The option which is most likely to be required is DATA= to specify the dataset to be used, for example:
PROC MEANS DATA=OLD_DATA;
As is usual, the most recently created dataset is used if no DATA= option is specified.
The option MAXDEC= may also be useful as it specifies how many decimal places (0 to 8) are to be printed in the results. For example:
K6.1 (10.90)
page 14
PROC MEANS MAXDEC=4;
PROC MEANS DATA=INT_DATA MAXDEC=0;
You can also specify the statistics to be produced by MEANS.
The VAR statement
_{E}_{x}_{a}_{m}_{p}_{l}_{e}_{s}
The VAR statement specifies the variables which are to be analysed, for example:
VAR AGE INCOME HEIGHT WEIGHT;
If no VAR statement is used all the numeric variables are processed.
_{¤} PROC MEANS; VAR HEIGHT WEIGHT; RUN;
¤ PROC MEANS DATA=BPAIN MAXDEC=4 VAR C7HGHT ILIAC CHEST; RUN;
¤ To obtain the mean values for AGE and INCOME, type:
PROC MEANS; VAR AGE INCOME; RUN;
This could produce the following output:
N Obs Variable
Mean


N
Minimum
Maximum
Std Dev
20 
AGE 
17 
18.00 
39.00 
29.00 
6.67 

INCOME 
19 
3900.00 
9800.00 

6820.50 
1773.43 


There are three missing values for AGE (the complete dataset has 20 cases) and one for INCOME.
_{S}_{t}_{a}_{t}_{i}_{s}_{t}_{i}_{c}_{s} _{a}_{b}_{o}_{u}_{t} _{a} single variable  the UNIVARIATE ^{p}^{r}^{o}^{c}^{e}^{d}^{u}^{r}^{e}
The UNIVARIATE procedure can provide very detailed statistics on a variable as well as plots to illustrate the distribution of values. The statistics produced include the mean, sum, standard deviation, variance, maximum, minimum, median, mode, quartiles, percentiles, and the five highest and lowest values.
The PROC UNIVARIATE statement has the form:
PROC UNIVARIATE options;
K6.1 (10.90)
page 15
Useful options are:
FREQ 
produces a frequency table giving the frequency, percentage and cumulative percentage for each value. 
NORMAL 
tells SAS to test if the distribution of the variable is close to a Normal (Gaussian) distribution. It is sometimes important to know whether a distribution is very different from Normal, as several statistical techniques give misleading results on such variables. 
PLOT 
gives information on whether the variable is normally distributed, by drawing a Normal probability plot and a bar chart. 
DATA= 
specifies which dataset is to be analysed if you do not wish to use the most recently created one. 
Example statements are:
PROC UNIVARIATE;
PROC UNIVARIATE DATA=ORIGDATA PLOT FREQ;
PROC UNIVARIATE FREQ NORMAL;
The VAR statement
_{E}_{x}_{a}_{m}_{p}_{l}_{e}_{s}
The VAR statement specifies which variables are to be analysed. For example:
VAR INCOME;
VAR MALE_POP FEML_POP OAP_POP RATEABLE AREA;
The variables must be numeric. If the VAR statement is omitted all numeric variables in the dataset are analysed.
_{¤} PROC UNIVARIATE; VAR INCOME; BY SEX; RUN;
¤ PROC UNIVARIATE DATA=UK_82 PLOT; TITLE ANALYSIS OF MONTHLY FIGURES; VAR IMPORTS EXPORTS EMIGRATE; RUN;
¤ You can use UNIVARIATE to test whether the distribution of AGE is normal.
PROC UNIVARIATE NORMAL; VAR AGE; RUN;
produces the output:
K6.1 (10.90)
page 16
The distribution is not significantly different from Normal. The mean is significantly different to zero.
_{C}_{o}_{r}_{r}_{e}_{l}_{a}_{t}_{i}_{o}_{n}_{s} _{} _{t}_{h}_{e} CORR procedure
The CORR procedure calculates the correlation between variables. It uses the productmoment (Pearson) definition of correlation, which is not appropriate for some types of variable, or Spearman’s and Kendall’s definitions which are more suitable for rankings and positions. Basic statistics like the mean are also printed for the variables used. The PROC CORR statement has the form:
PROC CORR options;
If no options are specified the most recently created dataset is used and Pearson correlations are calculated. To change the dataset used specify the DATA= option. To request a different correlation coefficient use the SPEARMAN or KENDALL option. These can be used in combination with each other and with PEARSON. Examples of PROC CORR statements are:
PROC CORR;
PROC CORR DATA=FRENCH KENDALL SPEARMAN;
PROC CORR PEARSON SPEARMAN KENDALL;
K6.1 (10.90)
page 17
The VAR statement
The WITH statement
There are different ways of specifying the correlations you wish to calculate. If you use a VAR statement and not a WITH statement (see below) coefficients are printed for all possible pairs of variables in the list. For example:
VAR A B C;
gives the correlations between A and B, B and C and A and C.
Omitting the VAR statement is equivalent to including one with all the numeric variables in the set specified.
If the WITH statement is used it modifies the way in which the VAR statement is obeyed. The variables in the VAR statement are treated as one list and those in the WITH statement as another, and coefficients are calculated for all pairs, taking one from each list. For example:
VAR AGE HEIGHT WEIGHT; WITH LIFT1 LIFT2 LIFT3;
produces the correlation of AGE with LIFT1, LIFT2 and LIFT3; the correlation of HEIGHT with LIFT1, LIFT2 and LIFT3; and the correlation of WEIGHT with LIFT1, LIFT2 and LIFT3.
Examples 
¤ 
PROC CORR; VAR ENGLISH MATHS PHYSICS; RUN; 
¤ 
PROC CORR DATA=OPINION KENDALL SPEARMAN; VAR SOCGROUP; WITH THEATRE CINEMA FOOTBALL CONCERTS; RUN; 

¤ 
A correlation of AGE and INCOME can be obtained by typing: 
PROC CORR; VAR AGE INCOME; RUN;
which would produce the following result:
K6.1 (10.90)
page 18
VARIABLE 
N 
MEAN 
STD DEV 
SUM 
MINIMUM 
MAXIMUM 

AGE 
22 
41.0909 
18.0262 
904.00 
19.0000 
80.0000 

INCOME 
24 5799.9583 2547.5247 139199.00 

1750.0000 
9754.0000 
PEARSON CORRELATION COEFFICIENTS / PROB > R UNDER H0:RHO=0 / NUMBER OF OBSERVATIONS
AGE
INCOME 

AGE 
1.00000  

0.33609 

0.0000 

0.1363 

22 

^{2}^{1} 

INCOME 
0.33609 

1.00000 

0.1363 

0.0000 

21 

24 
_{C}_{o}_{m}_{p}_{l}_{e}_{x} _{t}_{a}_{b}_{l}_{e}_{s} _{}
the ^{T}^{A}^{B}^{U}^{L}^{A}^{T}^{E}
procedure
The CLASS statement
The VAR statement
The TABULATE procedure produces tables and gives far more control over their layout than the FREQ procedure (see page 12). The entries in the tables can be means, standard deviations etc, rather than just counts. The options to the PROC TABULATE statement include the usual DATA=. Another important option is FORMAT, which defines how values are to be printed in the tables. For example, the statement:
PROC TABULATE FORMAT=6.3;
allows two spaces before the decimal point, one for the decimal point and three after it (making six in all). If no format is specified, it is assumed to be 12.2, that is twelve spaces for the values with nine places before the decimal point and two after it.
The CLASS statement specifies the variables which will be used to define the rows and columns of tables. For example:
CLASS SEX AGEGRP REGION;
The VAR statement specifies the variables which will be used to form the entries in the cells of the tables. For example:
VAR AGE INCOME;
K6.1 (10.90)
page 19
The TABLE statement
The TABLE statement can be extremely complex, and only some of the possible specifications are described here. Any variable appearing in a TABLE statement must have appeared in a preceding CLASS or VAR statement.
The simplest sort of table is like those produced by PROC FREQ. For example:
TABLE SEX, REGION;
produces a twoway table showing the frequency of each combination of SEX and REGION. Note the use of a comma rather than an asterisk.
TABLE SEX RACE, REGION;
produces a frequency table of SEX by REGION with a table of RACE by REGION joined to the bottom.
TABLE REGION, SEX RACE;
produces tables of SEX and RACE side by side. By using the FORMAT= option to reduce the width of the columns you can put several crosstabulations side by side.
To produce marginal totals use the keyword ALL. For example:
TABLE(REGION ALL), SEX RACE;
gives totals by adding each region together.
TABLE(REGION ALL), (SEX ALL RACE ALL);
gives totals for everything.
To produce percentages rather than the original counts use:
TABLE REGION, (SEX*PCTN RACE*PCTN);
A comma starts a new level of the table; an asterisk starts a nesting. The statement:
TABLE REGION, SEX*RACE;
gives a table with each row representing a region. Each row contains a count of the people of each sex, split into racial groups:
S1 
S2 

RG1 
R1 
R2 
R3 
R1 
R2 
R3 
RG2 
R1 
R2 
R3 
R1 
R2 
R3 
As well as arranging tables into a concise form, TABULATE can display statistics. For example:
TABLE (AGE*MEAN INCOME*MEAN), REGION;
shows the means for AGE and INCOME for each region. Other statistics which may be requested include:
STD 
for standard deviation 
MIN 
for minimum 
MAX 
for maximum 
SUM 
for total 
PCTSUM 
for the percentage of the sum of values 
PCTN 
for percentages, as shown above 
K6.1 (10.90)
page 20
A table request like:
TABLE (INCOME*MAX), AGEGRP, SEX;
includes the highest income for all combinations of AGEGRP and SEX
in the output.
_{E}_{x}_{a}_{m}_{p}_{l}_{e}_{s}
_{¤} PROC TABULATE; CLASS REGION SEX MARSTAT; TABLE (SEX ALL MARSTAT ALL),REGION; RUN;
¤ PROC TABULATE FORMAT=6.2; CLASS REGION SEX MARSTAT; TABLE (SEX*PCTN MARSTAT*PCTN),REGION; RUN;
¤ PROC TABULATE; CLASS REGION; VAR AGE,INCOME; TABLE (AGE*MEAN INCOME*MEAN), REGION; RUN;
¤ PROC TABULATE; CLASS SEX MARSTAT; VAR AGE; TABLE (AGE*MEAN), MARSTAT, SEX; RUN;
This last example produces the following output:
_{S}_{t}_{u}_{d}_{e}_{n}_{t}_{’}_{s} _{T}_{}_{T}_{e}_{s}_{t}
The TTEST procedure tests whether two groups have the same mean value for a particular variable. The ttest was devised by an author who wrote under the pseudonym Student. Note that the other use for Student’s ttest, the comparison of the means of two variables (known as a paired ttest), must be done in a different way (see page 23 for details). The PROC TEST statement has the form:
PROC TTEST options;
As usual the option DATA= specifies a dataset other than the one most recently created.
K6.1 (10.90)
page 21
The CLASS statement
The VAR statement
Example
The CLASS statement specifies the variable identifying the groups to be compared. Since the procedure can only deal with two groups, the variable must have only two values. You must specify a CLASS statement.
The VAR statement specifies the variables on which the test is to be carried out. If you specify more than one variable a ttest is performed on each. If this statement is omitted a ttest is performed on all the numeric variables in the dataset except the one specified in the CLASS statement.
Suppose you are comparing the crop obtained from tomato plants, some of which have been treated with a fertiliser. The yield of tomatoes is in a variable CROP; the variable FERTIL contains 1 if no fertiliser was used and 2 if it was. To perform a ttest on the two groups:
DATA TOMS; INPUT CROP FERTIL; CARDS;
12.3 
1 
11.6 
1 
14.5 
2 
RUN; PROC TTEST; CLASS FERTIL; VAR CROP; RUN;
SAS gives the mean and other information for each group as well as the t value, the degrees of freedom, the significance assuming unequal variances, and the significance assuming equal variances. In each case the test is a twosided one. Following the table in which these values appear is the result of an F test on the equality of the variances. The output from the above statements is:
TTEST PROCEDURE
VARIABLE: CROP
FERTIL 
N 
MEAN 
STD DEV 
STD ERROR 
MINIMUM 
MAXIMUM 
1 6 12.35000000 0.63482281 0.25916533
10.11
11.22
14.20
2 5 14.36000000 0.53665631 0.24000000
15.31
VARIANCES
T
DF PROB>T
UNEQUAL 5.6905 
9.0 0.0003 

EQUAL 
5.5957 
9.0 0.0003 
FOR H0: VARIANCES ARE EQUAL, F’= 1.40 DF=(5,4) PROB > F’= 0.7669
K6.1 (10.90)
page 22
Five plants were treated with fertiliser out of the eleven used. The means are significantly different whether equal or unequal variances are assumed. The F test shows that the variances are not significantly different.
Paired TTest
The TTEST procedure cannot test for two variables having the same mean (a paired ttest). However, this test can be done using the MEANS procedure, which can test if the mean of a variable is zero. One variable is subtracted from the other and the result tested to see if it is zero. If it is the two variables do not have significantly different means. The following statements illustrate the procedure:
DATA TS; INPUT TEST1 TEST2;
DIFF=TEST2  TEST1; CARDS;
34 
45 
36 
44 
57 
62 
RUN; PROC MEANS MEAN T PRT; VAR DIFF; RUN;
The options MEAN, T and PRT print the mean, the ttest value for the test of the mean being zero, and the corresponding probability. The output is:
Analysis Variable : DIFF
N Obs 
MEAN 
T 
PROB>T 


20
6.85714286
8.45
0.0001


This shows that the variables have means which are significantly different.
Drawing simple diagrams with SAS
SAS can draw pictures on the screen which can then be printed on a printer using the CHART and PLOT procedures. The procedures GCHART, GPLOT, GMAP, etc produce higherquality pictures but require special facilities in order to produce a copy on paper. They are described on page 28.
K6.1 (10.90)
page 23
_{B}_{a}_{r} _{c}_{h}_{a}_{r}_{t}_{s} _{} _{t}_{h}_{e} CHART procedure
The VBAR statement
The CHART procedure draws vertical or horizontal bar charts (histograms), pie diagrams, block charts and star charts. They all give a visual appreciation of your data which may help you understand it better. Only the method of producing histograms and pie charts is described here. The PROC CHART statement has the form:
PROC CHART options;
The options include DATA= to specify the dataset to be used if you do not wish to use the most recently created dataset. Example statements are:
PROC CHART;
PROC CHART DATA=OLDSTATS;
Any number of charts may be requested within the same CHART procedure (see the examples on page 25).
To produce a vertical bar chart (a histogram with the bars drawn vertically) specify the variables to be used with a VBAR statement, which has the form:
VBAR variablelist / options;
If no options are specified the / is omitted. The options available include DISCRETE, MIDPOINTS, SUMVAR, TYPE, MISSING and NOZEROS.
DISCRETE 
draws a bar for each value of the variable. If you do not use this option the range of values is divided into groups by automatic choice of midpoints or by your own choice (see MIDPOINTS below) and a bar is drawn for each subrange. 
MIDPOINTS=values 
specifies the points at which the distribution is to be split. For example MIDPOINTS=10 50 100 produces three bars; one for those below 30, one for those below 75 and one for those over 75 (these being the boundaries produced by these midpoints). If the MIDPOINTS option is not specified, SAS splits the range of values into a number of intervals. 
SUMVAR=variable 
means that the bars represent the sum of the variable ‘variable’ for cases with that value of the VBAR variable. For example, VBAR CITY/ SUMVAR=INCOME; gives a chart with bars representing the total income for each city in the data. If TYPE=MEAN is also specified the mean value of ‘variable’ is used instead of the sum. 
TYPE=type 
specifies what the bars are to represent and has several different choices. The one of most interest is MEAN when it is used with SUMVAR. 
NOZEROS 
omits entries for empty categories, avoiding gaps in the chart. 
MISSING 
treats missing values as a valid category and draws a bar for them. 
K6.1 (10.90)
page 24
Examples of VBAR statements are:
VBAR MARSTAT/MISSING NOZEROS;
VBAR INCOME/MIDPOINTS=2500 5000 7500 10000 15000 20000;
VBAR GNP EXPORTS IMPORTS;
The HBAR statement
The PIE statement
_{E}_{x}_{a}_{m}_{p}_{l}_{e}_{s}
The HBAR statement is just like the VBAR statement except that it produces histograms with the bars horizontal rather than vertical. The options DISCRETE, MISSING, MIDPOINTS, SUMVAR, TYPE and NOZEROS all apply as described above. Examples are:
HBAR HEIGHT WEIGHT;
HBAR SEX/SUMVAR=ACCIDENT;
The PIE statement draws a pie chart which illustrates the relative frequency of values by presenting them as slices of a cake or pie. The statement format is:
PIE variablelist / options;
If no options are requested the / is omitted. The options DISCRETE, MIDPOINTS, SUMVAR, TYPE and MISSING all apply (as described on page 24) but NOZEROS does not. An example is:
PIE DEPT AREA REGION;
If neither DISCRETE nor MIDPOINTS are specified the pie has three slices.
_{¤} PROC CHART; VBAR NOOFSONS / MIDPOINTS= 1 2 3 4; PIE MARSTAT / MISSING DISCRETE; RUN;
¤ PROC CHART DATA=SALES; HBAR MONTH/SUMVAR=VALUE DISCRETE; HBAR AGENCY REGION; RUN;
¤ The following statements produce a horizontal barchart for SEX and a vertical bar chart for MARSTAT and AGE:
PROC CHART; HBAR SEX/DISCRETE; VBAR MARSTAT/DISCRETE; RUN;
The output is:
K6.1 (10.90)
page 25
FREQUENCY BAR CHART
SEX FREQ CUM. PERCENT
CUM.
FREQ 
PERCENT 

1 
 

10 
10 
50.00 
50.00 

2 
 

10 
20 
50.00 100.00 
****************************************
****************************************
++++
5
2.5
7.5
10
FREQUENCY
_{S}_{c}_{a}_{t}_{t}_{e}_{r}_{g}_{r}_{a}_{m}_{s} _{} _{t}_{h}_{e} The PLOT procedure produces a plot of one variable against another.
^{P}^{L}^{O}^{T} ^{p}^{r}^{o}^{c}^{e}^{d}^{u}^{r}^{e}
The PLOT statement
Such diagrams are known as scatterplots, scattergrams or scatter diagrams, as they show the scatter of the cases in the sample. The PROC PLOT statement has the form:
PROC PLOT options;
The most important option is DATA= which specifies a dataset other than the one most recently created.
The PLOT statement specifies which variables are to be plotted against each other. Its format is:
PLOT plotrequests / options;
If no options are specified the / is omitted.
A plot request can have several parts. The simplest form is ‘var*var’, for
example AGE*EXAM meaning a plot with AGE on the vertical axis and EXAM on the horizontal axis.
K6.1 (10.90)
page 26
A point is marked by a letter, which shows how many cases lie on that
point (to within the accuracy of the plot and given the size of the character indicating the point). A indicates one case, B means two cases, up to Z which indicates 26 or more cases at that point.
To specify a symbol to mark the points instead of letters, use var*var=’symbol’. For example:
Y*X=’.’
causes each point to be marked by a full stop regardless of how many cases are represented there.
You can produce a sort of threedimensional plot by marking each point with the values of another variable, by specifying var*var=var. For
example:
CONTENT*INCOME=MARSTAT
prints the value of MARSTAT (which may be numeric or character) at each point on the plot of CONTENT by INCOME. If more than one case is mapped to the same point the value of the first case is used. Note
that only the first character is used from the value of the variable, and so
if the values of MARSTAT were SINGLE, MARRIED and
SEPARATED, you could not distinguish between SINGLE and SEPARATED.
Several plots can be specified in the same PLOT statement, for example:
PLOT AGE*INCOME HEIGHT*WEIGHT=SEX;
A useful option is OVERLAY, which causes several plots to be
produced on the same axes so that they can be compared, for example:
PLOT POP71*CITY=’7’ POP81*CITY=’8’ / OVERLAY;
Examples 
¤ 
PROC PLOT; PLOT MATHS*ENGLISH; RUN; 
¤ 
PROC PLOT DATA=SUNSPOTS; PLOT NUM*MONTH=’*’; PLOT NUM*MONTH=’+’ UHFPROBS*MONTH=’’ /OVERLAY; RUN; 

¤ 
PROC PLOT; PLOT INCOME*AGE=SEX; RUN; 
gives the following plot of AGE and INCOME using a symbol for SEX.
K6.1 (10.90)
page 27
The message ‘1
(to within the accuracy of the graph). Normally this would be indicated by using a different symbol, but here the symbol is the value of SEX.
obs hidden’ means that two points coincided
Highquality graphics
As well as the simple graphics produced by CHART and PLOT, SAS can draw highquality graphics with smooth lines, colours and shading. These facilities are documented in the ‘SAS/GRAPH Guide’. See Reference A or B for producing graphics on output devices. This section describes some of the procedures.
_{B}_{a}_{r} _{c}_{h}_{a}_{r}_{t}_{s} _{} _{t}_{h}_{e} GCHART ^{p}^{r}^{o}^{c}^{e}^{d}^{u}^{r}^{e}
The GCHART procedure draws the same sort of pictures as CHART but on graphics devices. Many more options are available because of the extra facilities on a graphics device, and so there are more parts to the PROC step. The PROC GCHART statement has the form:
PROC GCHART DATA=dataset GOUT=dataset;
where the most recently created dataset will be used if DATA= is omitted. The GOUT= option is also optional, and is used to save the graphical output as a dataset which can be redrawn later by the GREPLAY procedure.
K6.1 (10.90)
page 28
The VBAR statement
The VBAR statement specifies the variables for which a vertical bar chart is to be drawn. The general form is:
VBAR variablelist / options;
If no options are requested the / is omitted. The options include DISCRETE, MIDPOINTS, SUMVAR, TYPE, MISSING, NOZEROS, CAXIS and CTEXT.
DISCRETE 
draws a bar for each value of the variable. If you do not use this option the range of values is divided into groups by automatic choice of midpoints or by your own choice (see MIDPOINTS below) and a bar is drawn for each subrange. 
MIDPOINTS=values 
specifies the points at which the distribution is to be split. For example MIDPOINTS=10 50 100 produces three bars; one for those below 30, one for those below 75, and one for those over 75 (these being the boundaries produced by these midpoints). If the MIDPOINTS option is not specified, SAS splits the range of values into a number of intervals. 
SUBGROUP=svar 
divides each bar into sections to show the distribution of ‘svar’ within each category of the variable named on the VBAR statement. 
GROUP=gvar 
draws several bars for each value of the variable mentioned in the VBAR statement,  one for each value of ‘gvar’. 
SUMVAR=variable 
means that the bars represent the sum of the variable ‘variable’ for cases with that value of the VBAR variable. For example, VBAR CITY/ SUMVAR=INCOME; gives a chart with bars representing the total income for each city in the data. If TYPE=MEAN is also specified the mean value of ‘variable’ is used instead of the sum. 
TYPE=type 
specifies what the bars are to represent and has several different choices. The one of most interest is MEAN when it is used with SUMVAR. 
NOZEROS 
omits entries for empty categories, avoiding gaps in the chart. 
MISSING 
treats missing values as a valid category and draws a bar for them. 
CAXIS=colour draws the axis in the specified colour.
CTEXT=colour
draws the text on the chart in the specified colour.
Examples of VBAR statements are:
VBAR MARSTAT/MISSING NOZEROS SUBGROUP=SEX;
VBAR INCOME/MIDPOINTS=2500 5000 7500 10000 15000 GROUP=AGE;
K6.1 (10.90)
page 29
The HBAR statement
The HBAR statement is just like the VBAR statement except that it produces histograms with the bars horizontal rather than vertical. The options DISCRETE, MISSING, MIDPOINTS, SUMVAR, TYPE, GROUP, SUBGROUP, NOZEROS, CTEXT and CAXIS all apply as described above. Examples are:
HBAR HEIGHT WEIGHT GROUP=MARSTAT;
HBAR SEX/SUMVAR=ACCIDENT SUBGROUP=AGE;
The PATTERN 
With HBAR and VBAR you can specify how the bars are to be coloured 
statement 
or shaded using PATTERN statements. If SUBGROUP= is specified in a HBAR or VBAR statement you may wish to specify the patterns of shading to be used to distinguish the subgroups. Normally each bar is shaded by crosshatching, and if it is necessary to distinguish groups each is coloured using the colours in order. 
Suppose you have two groups and want the first to be a solid blue bar and the second to be black (the background colour) specify:
PATTERN1 V=S C=BLUE; PATTERN2 V=E;
where V=S means use solid colour, V=E means empty, and C= specifies the colour to be used. Other types of pattern which involve hatching of various kinds are described and illustrated in the appropriate SAS/GRAPH Guide.
The PIE statement
The PIE statement draws a pie chart which illustrates the relative frequency of values by presenting them as slices of a cake or pie. The statement format is:
PIE variablelist / options;
If no options are requested the / is omitted. The options DISCRETE, MIDPOINTS, SUMVAR, TYPE, MISSING and CTEXT all apply (as described on page 30) but SUBGROUP, GROUP, NOZEROS and CAXIS do not. If you do not specify DISCRETE or MIDPOINTS then SAS uses its own method to divide the data.
PIE DEPT AREA REGION;
The FILL option is used only with the PIE statement. It may be set to X or SOLID:
FILL=SOLID 
means the sectors of the piechart are to be filled in with solid colour. SAS calculates how many colours are needed and takes them in order from the colours available on the graphics device. If only one colour is available a uniformly shaded disk is drawn. 
FILL=X 
means the sectors are to be filled in by crosshatching. If colours are available the slices will be coloured. 
Examples
¤ PROC GCHART; VBAR NOOFSONS / MIDPOINTS= 1 2 3 4; PIE MARSTAT / MISSING DISCRETE FILL=X; RUN;
K6.1 (10.90)
page 30
¤ PROC GCHART DATA=SALES; HBAR MONTH/SUMVAR=VALUE DISCRETE; HBAR AGENCY REGION; RUN;
_{S}_{c}_{a}_{t}_{t}_{e}_{r}_{g}_{r}_{a}_{m}_{s} _{} _{t}_{h}_{e} The GPLOT procedure plots scatter diagrams with a choice of patterns,
GPLOT
procedure
The PLOT statement
filling, and plot symbols, and the option of fitted regression lines of various types. Only some of the facilities are described here. The general form of the GPLOT statement is:
PROC GPLOT DATA=dataset GOUT=dataset UNIFORM;
If DATA= is omitted the most recently created dataset is used. If
GOUT= is specified then the picture is saved and may be redrawn by GREPLAY. The UNIFORM option may be useful if you are using the BY statement to plot pictures for several subgroups, because it forces the use of the same scale for all the plots, so that comparisons may be made.
The PLOT statement specifies which variables are to be plotted against each other. Its format is:
PLOT plotrequests / options;
If no options are specified the / is omitted.
A plot request can have several parts. The simplest form is just var*var,
for example AGE*EXAM, meaning a plot with AGE on the vertical axis and EXAM on the horizontal axis.
A point is marked by a plus sign. If several cases are plotted at the same
point (not necessarily identical but the same to within the accuracy of the plot) it appears that cases are missing because there are fewer plus signs on the plot than there are cases.
A 
SYMBOL statement (see below) can be used to specify another symbol 
to 
mark the points instead of a plus sign. 
You can produce a sort of threedimensional plot by marking each point with the values of another variable by specifying var*var=var. For
example,:
CONTENT*INCOME=MARSTAT
prints the value of MARSTAT (which may be numeric or character) at each point on the plot of CONTENT by INCOME. If more than one case is mapped to the same point the value of the first case is used. Note
that only the first character is used from the value of the variable, and so
if the values of MARSTAT were SINGLE, MARRIED and
SEPARATED, you could not distinguish between SINGLE and SEPARATED.
Several plots can be specified in the same PLOT statement, for example:
PLOT AGE*INCOME HEIGHT*WEIGHT=SEX;
A useful option is OVERLAY, which causes several plots to be
produced on the same axes so that they can be compared, for example:
PLOT POP71*CITY=’7’ POP81*CITY=’8’ / OVERLAY;
You may also specify CAXIS=colour to specify the colour of the axis, and CTEXT=colour to specify the colour of the text.
K6.1 (10.90)
page 31
In order to overlay plots where the vertical scales are very different you
can use PLOT and PLOT2. The horizontal scales must be the same but the righthand vertical scale is for the variable specified for PLOT2. For example:
PLOT HEIGHT*AGE; PLOT2 WEIGHT*AGE;
The SYMBOL 
The SYMBOL statement defines the symbols to be used in the plot and 
statement 
specifies whether any regression fitting is to be carried out. A different SYMBOL statement can be included for each plot in the GPLOT procedure. The SYMBOL statement has several optional parts, as described below. To specify the colour of the symbol use C=colour, for example C=BLUE. 
To 
specify the symbol use V=symbol, for example V=1. The letters A to 
W 
and the digits 0 to 9 may be used as symbols. There are also special 
symbols which are represented by such characters as * or <. A table
showing how they appear is given in the SAS/GRAPH Guide. If no symbols are to be drawn for the points use V=NONE.
To specify the sort of line to be used use the ‘L=number’ option. If L=1
a solid line is used. There are also 31 types of dotted or dashed lines, which are shown in the SAS/GRAPH Guide.
The interpolation facilities to draw lines connecting the points include the following:
I=JOIN 
connects the points by straight lines. 
I=SPLINE 
uses a cubic spline method to fit a smooth line to the points. 
I=SMxx 
is used when the data is widely spread, so that a normal cubic spline would look very jagged. It fits a smooth curve through the points but the points may not all appear on the line (as is the case with a normal spline curve). The value ‘xx’ determines how closely the curve is to be fitted to the points. A value of 1 makes it follow the points quite closely, while a value of 99 produces a smooth curve which may miss many of the points. 
I=Rxxxxxxx 
is used when a regression line is to be fitted to the data. The characters which follow the R are: 
L 
linear regression is to be used 
Q 
quadratic regression is to be used 
C 
cubic regression is to be used 
0 
the regression line is forced through the origin. If you want a constant term in the regression omit this term. If it appears it is the second character. 
K6.1 (10.90)
page 32
CLMnn 
draws lines representing the confidence limits on the regression for the mean predicted values where the confidence limits (nn) may be at 90%, 95% or 99%, for example CLM95. These characters follow the type of regression and the optional constant term. The style of line used for the confidence lines is determined by adding 1 to the line style used for the main plot. For example if the line is drawn with line style 2 (small dashes) the confidence lines are drawn with style 3 (medium dashes). 
CLInn 
draws lines representing the confidence limits on the regression for the individual values, where the confidence limits (nn) may be at 90%, 95% or 99%, for example CLI90. Other details are as for CLMnn above. 
Examples of complete specifications for interpolation are I=RQ, I=RL0, I=RLCLI90 and I=RC0CLM95.
Examples
_{M}_{a}_{p}_{s} _{} _{t}_{h}_{e} ^{G}^{M}^{A}^{P} ^{p}^{r}^{o}^{c}^{e}^{d}^{u}^{r}^{e}
¤ PROC GPLOT; PLOT TESTA*TESTB; RUN;
¤ PROC GPLOT DATA=NWEST; PLOT POLLUT1*TOWN INCID*TOWN/OVERLAY; SYMBOL1 V=NONE L=2 C=RED I=SPLINE; SYMBOL2 V=NONE L=2 C=BLUE I=SPLINE; RUN;
¤ PROC GPLOT; PLOT DOSE*DAYS; SYMBOL V=* L=1 I=RLCLM95; RUN;
The GMAP procedure produces maps illustrating the values of variables for the areas on the map. If the map dataset already exists you need to know how the areas are identified. SAS provide maps of the United States and Canada, which are described in the SAS/GRAPH Guide. However, these are not installed automatically so you must check to see if they are installed on the machine you are using. Maps of the counties of the United Kingdom and Ireland are also available. If you do not already have a map in a suitable form then you may need help in converting your map into a dataset. Contact the Computing Service Advisory Service for assistance.
The PROC GMAP statement specifies the map dataset as well as the response dataset containing the data to be shown on the map, for example:
PROC GMAP MAP=SASUSER.MERSEY DATA=POPLAN;
If no DATA= option appears the most recently created dataset is used.
The ALL option specifies that all areas in the map are to be drawn even if there is no value for that area. Normally only areas for which data exists are drawn, and the map is scaled to fill the space available.
K6.1 (10.90)
page 33
Four types of map can be drawn; choropleth, surface, block and prism. Examples of each are given in the SAS/GRAPH Guide. Only choropleth maps are described below.
The CHORO 
The CHORO statement specifies that a choropleth map is to be drawn, 
statement 
and gives the name of the variable to be used. A choropleth map is one 
where the areas of the map are shaded or coloured to indicate the value
of the variable for each area. The form of the CHORO statement is:
CHORO variablelist;
A choropleth map is drawn for each variable specified. Options
available with the statement include DISCRETE, LEVELS and
MIDPOINTS. 

DISCRETE 
means the data is a set of discrete values rather than a continuous variable. Each value is represented separately, unless you have a very large number of values (or have also specified LEVELS or MIDPOINTS). 
LEVELS=n 
means SAS is to divide the data into n+1 groups of the same size, and shade the map accordingly. 
MIDPOINTS=list 
means the data is to be divided at the values specified. You do not have to list every value, for example: 
The ID statement
MIDPOINTS = 10 TO 100 BY 10
The ID statement specifies the variable which ties together the areas and the values of the response variable. The variable must have the same name in the response dataset as in the map dataset. The form is:
ID variable;
For example:
ID CNUM;
Regression  the REG procedure
Given a situation in which one or more variables seem to control the behaviour of another (for example, blood pressure given weight, age and
activity), it is possible to build an equation which expresses the relation numerically. This relation is only approximate in any real situation, but you can measure how closely it fits the data and decide whether or not it
is useful for prediction.
The variable whose behaviour you are trying to explain is called the dependent variable, and those variables being used for the explanation (or ‘model’) are called the independent variables. In mathematical terms there is a dependent variable Y which you are trying to predict
using values of independent variables X1, X2, equation:
Xn, using the
Y = B
0
+ B *X
1
1
+ B *X
2
2
+
+ B *X
n
n
+ eps
K6.1 (10.90)
page 34
_{T}_{h}_{e} _{M}_{O}_{D}_{E}_{L}
statement
where ‘eps’ is the error involved in using such a simple model. The
procedure calculates the values of B0, B1, B2
small as possible over the known values of Y, X1, X2
The values B0, B1 etc are called the parameters of the regression
equation. The term B0 is known as the constant term or the intercept
and is the value of Y when X1, X2
The values of Y which were recorded when the survey was done or the experiment was performed are known as the observed values. The
values you would obtain by putting the values of X1, X2
regression equation are called the predicted values. The difference between the predicted value and the observed value is called the residual value. The square of the correlation of the observed values and the predicted values is called the coefficient of determination or just the r square value. It can be regarded as the fraction of the variability of Y
explained by the equation.
The PROC REG statement has the form:
PROC REG options;
The DATA= option specifies that a dataset other than the one most recently created is to be used. The SIMPLE option gives simple descriptive statistics on each of the variables used in the procedure. Example of PROC REG statements are:
PROC REG;
PROC REG DATA=SHELLS;
PROC REG SIMPLE;
PROC REG DATA=SASUSER.SAVED SIMPLE;
Bn, so that ‘eps’ is as
Xn.
Xn are all zero.
Xn into the
The variables to be used are specified with the MODEL statement. For example, to express the cost of producing a motor car (variable CARCOST), given the hourly wage rate of workers on the production line (HOURLY), the cost of steel (STEEL) and the price of electricity (ELECTRIC):
MODEL CARCOST = HOURLY STEEL ELECTRIC;
CARCOST is the dependent variable and HOURLY, STEEL and ELECTRIC are the independent variables.
The MODEL statement has several options. For example:
MODEL CARCOST = HOURLY STEEL ELECTRIC / NOINT;
forces the equation to have no constant term, that is the intercept is set to zero and CARCOST is zero when all the other variables are zero, which is not entirely realistic for the model as there are other costs involved in building a car. However if all the sources of expense were included in the model you would expect the cost of the car to be zero when all the factors contributing to the cost were zero.
To check how well the solution fits, you can print the values of the dependent variable along with the value the regression equation predicts, by specifying the P option, for example:
MODEL CARCOST = HOURLY STEEL ELECTRIC / P;
K6.1 (10.90)
page 35
The R option prints extra information indicating whether the predicted values are significantly different from the observed values. This can be useful for spotting unusual cases in the data or for showing a pattern in the residuals indicating that the model has systematic errors, for example that a linear model is not appropriate and one with squares or cubes of values should be used instead.
_{T}_{h}_{e} _{O}_{U}_{T}_{P}_{U}_{T} ^{s}^{t}^{a}^{t}^{e}^{m}^{e}^{n}^{t}
_{E}_{x}_{a}_{m}_{p}_{l}_{e}
To analyse the predicted or residual values you must write them to a SAS dataset. Having done that you can use any of the facilities of SAS, especially the graphical ones, to examine them. The OUTPUT statement allows you to write these and other values to a dataset.
The OUTPUT statement must specify the name of a dataset. This can be a permanent dataset (for example SASUSER.PREDVALS) or a temporary one (for example PREDVALS). The information to be written follows the dataset name. The keywords PREDICTED and RESIDUAL (which can be abbreviated to P and R) specify that these values are to be written and gives them names. For example:
PROC REG; MODEL Y = X Z/NOINT; OUTPUT OUT=SAVED P=PY R=RY;
The output dataset contains all the variables from the input dataset (whether or not they were used used to calculate the regression equation) as well as the ones specified by P or R. If the regression has multiple dependent variables you must specify predicted and residual variable names for each dependent variable.
The following statements look at the relation of age to income, by saving the predicted values and plotting them to compare with the observed value:
PROC REG; MODEL INCOME=AGE; OUTPUT OUT=SAVE P=PINCOME; RUN; PROC PLOT; PLOT(INCOME PINCOME)*AGE/OVERLAY; RUN;
The output is:
K6.1 (10.90)
page 36
^{D}^{E}^{P} ^{V}^{A}^{R}^{I}^{A}^{B}^{L}^{E}^{:} ^{I}^{N}^{C}^{O}^{M}^{E}
ANALYSIS OF VARIANCE
SUM OF 
MEAN 

SOURCE 
DF 
SQUARES 
SQUARE 
F VALUE 
PROB>F 

MODEL 
1 
15069036.69 
15069036.69 
2.419 
0.1363 

ERROR 
19 
118335771 
6228198.45 
C TOTAL 
20 
133404807 

ROOT MSE 
2495.636 
RSQUARE 

^{0}^{.}^{1}^{1}^{3}^{0} 

DEP MEAN 
6034.476 
ADJ RSQ 

^{0}^{.}^{0}^{6}^{6}^{3} 

C.V. 
41.3563 
PARAMETER ESTIMATES
PARAMETER 
STANDARD 
T 

FOR H0: 

VARIABLE DF 
ESTIMATE 
ERROR 

PARAMETER=0 
PROB > T 

INTERCEP 
1 
7958.69044 
1351.63099 

5.888 
0.0001 

AGE 
1 
47.42781593 
30.49099533 

1.555 
0.1363 
The probability that the coefficient for AGE is zero shows that the variable is not a very good predictor of INCOME. The plot also shows that the fit is a very poor one:
K6.1 (10.90)
page 37
^{P} U
E


K6.1 (10.90)
page 38
Analysis of variance  the ANOVA procedure
The ANOVA procedure is restricted to analysing balanced designs, that is those experiments where there are the same number of replicate observations for each combination of factors. If your data does not satisfy this condition see page 44 for details of a general linear model. ANOVA can deal with one or many response variables and so can do multivariate analysis of variance.
As usual you may specify the dataset to be used with the DATA= option, for example:
PROC ANOVA DATA=GRASSES;
_{T}_{h}_{e} _{C}_{L}_{A}_{S}_{S}
statement
_{T}_{h}_{e} _{M}_{O}_{D}_{E}_{L}
^{s}^{t}^{a}^{t}^{e}^{m}^{e}^{n}^{t}
The factors in the design are declared using the CLASS statement, for example:
CLASS STRAIN HERBCIDE;
You must have a CLASS statement and it must precede the MODEL statement described below.
The MODEL statement specifies the dependent variable (sometimes called the response variable) and how it is thought to be related to the independent variables (the factors). You can specify several dependent variables, in which case SAS treats them together in a multivariate analysis. The specification of the model is more complex than with the REG procedure as you can include interaction effects between variables as well as the variables themselves.
Suppose you have a dependent variable Y with factors A, B and C. To fit only the factors with no interactions type:
MODEL Y = A B C;
To allow an interaction term between B and C, use:
MODEL Y = A B C B*C;
To include all possible interactions, type:
MODEL Y = A B C B*C A*B A*C A*B*C;
Since this is such a common model, SAS allows you to write this in the shorter form:
MODEL Y = ABC;
You can specify a mixture of these, for example
MODEL Y = A BCD;
where only the main effect of A is used but the full interactions of B, C and D are required.
If a factor B is nested within another factor A, type:
MODEL Y = A B(A);
K6.1 (10.90)
page 39
This occurs when not all values of B are observed for each value of A, and so you do not have a ‘crossed’ model. For example, if you were comparing teaching methods in different schools then the teachers would only teach in one school, and so any teacher effect would be nested within the school effect.
_{T}_{h}_{e} _{M}_{E}_{A}_{N}_{S}
statement
_{E}_{x}_{a}_{m}_{p}_{l}_{e}
Having established that not all groups have the same mean, you might like to know which groups are different from other groups. This can be done using the MEANS statement. Suppose the MODEL is:
MODEL CROP = VARIETY FIELD VARIETY*FIELD;
To look at the effect of VARIETY in more detail type:
MEANS VARIETY;
The mean and standard deviation for each value of VARIETY is shown. You can also specify various tests to investigate whether these means are significantly different. These include the Scheffe test, Duncan’s test, Tukey’s test, and Least Significant Difference (LSD). To request a Scheffe multiple comparison test on VARIETY, type:
MEANS VARIETY/SCHEFFE;
This will then show in detail how the group means differ.
Suppose a biochemist is interested in the effect of a new herbicide on the mortality of plants. Fifty plants were placed in each of twelve pots containing nutrient solution. After ten days growth, three of the pots were sprayed with herbicide and three were left as controls. After a further ten days three more pots were sprayed and the remaining three designated as controls. Thus two factors were considered  herbicide treatment and age of plant  each treatment combination being replicated three times.
The analysis produces an analysis of variance table assessing the significance of the herbicide spraying, the age of plants and their interaction. The experimenter was also interested in calculating least significant differences for comparison of main effect means. The following statements enter the data and carry out the analysis:
K6.1 (10.90)
page 40
DATA HERB; /* DATA INPUT SPECIFYING EACH FACTOR LEVEL EXPLICITLY*/ INPUT AGE HERBICID SURVIVOR ; CARDS;
1
1
1
1
1
1
2
2
2
2
2
2
PROC ANOVA; CLASS AGE HERBICID; MODEL SURVIVOR = AGE HERBICID AGE*HERBICID; MEANS AGE HERBICID AGE*HERBICID / LSD; RUN;
This produces the following output:
1 20
1 18
1 23
2 11
2 12
2 15
1 40
1 43
1 39
2 35
2 37
2 32
K6.1 (10.90)
page 41
K6.1 (10.90)
page 42
ANALYSIS OF VARIANCE
PROCEDURE
T TESTS (LSD) FOR VARIABLE: SURVIVOR
NOTE: THIS TEST CONTROLS THE TYPE I COMPARISONWISE ERROR RATE,
NOT THE EXPERIMENTWISE ERROR RATE
ALPHA=0.05 DF=8 MSE=5.33333
CRITICAL VALUE OF T=2.30600
LEAST SIGNIFICANT DIFFERENCE=3.0747
MEANS WITH THE SAME LETTER ARE NOT SIGNIFICANTLY DIFFERENT.
T 
GROUPING 
MEAN 

N 
AGE 

A 
37.667 

6 
2 

B 
16.500 

6 
1 
ANALYSIS OF VARIANCE
PROCEDURE
T TESTS (LSD) FOR VARIABLE: SURVIVOR
NOTE: THIS TEST CONTROLS THE TYPE I COMPARISONWISE ERROR RATE,
NOT THE EXPERIMENTWISE ERROR RATE
ALPHA=0.05 DF=8 MSE=5.33333
CRITICAL VALUE OF T=2.30600
LEAST SIGNIFICANT DIFFERENCE=3.0747
MEANS WITH THE SAME LETTER ARE NOT SIGNIFICANTLY DIFFERENT.
T 
GROUPING 
MEAN 
N 

HERBICID 

A 
30.500 
6 

1 

B 
23.667 
6 

2 

ANALYSIS OF VARIANCE PROCEDURE 

MEANS 

AGE 
HERBICID 
N 

SURVIVOR 

1 
1 
3 

20.3333333 

1 
2 
3 

12.6666667 

2 
1 
3 

40.6666667 

2 
2 
3 

34.6666667 
K6.1 (10.90)
page 43
An alternative way of inputting the data using DO loops is shown below.
It saves specifying factor levels individually. The @@ symbol is used to
specify that several cases occur on one line. Note that each case requires
an explicit OUTPUT statement.
DATA HERB; /* DATA INPUT SPECIFYING EACH FACTORS AND LEVELS THROUGH LOOPS; */ DO AGE = 1 TO 2; DO HERBICID = 1 TO 2; DO REPLICAT = 1 TO 3; INPUT SURVIVOR @@; OUTPUT; END; END; END; CARDS;
20 
18 
23 
11 
12 
15 
40 
43 
39 
35 
37 
32 
RUN; PROC ANOVA; CLASS AGE HERBICID; MODEL SURVIVOR = AGE HERBICID AGE*HERBICID ; MEANS AGE HERBICID AGE*HERBICID / LSD; RUN;
General linear modelling  the GLM procedure
The GLM procedure allows analysis of the general linear model. This views analysis of variance, regression and several other techniques as transformations of the simple model described above for regression. This enables it to carry out many sophisticated analyses which are not available elsewhere in SAS. Although the specification looks very much like the REG or ANOVA procedures, the output is quite different. A very important aspect of GLM is that it can perform an analysis of variance on unbalanced designs. For balanced designs ANOVA is to be preferred.
As usual you may specify the dataset to be used with the DATA= option, for example:
PROC GLM DATA=SASUSER.GRASSES;
_{T}_{h}_{e} _{C}_{L}_{A}_{S}_{S}
statement
If variables to be used in the model are to be regarded as factors, that is having a small number of defined categories, they should be specified in
a CLASS statement. If they do not appear in a CLASS statement SAS
assumes that a regression type model is appropriate, which is not true for factors. If the CLASS statement is used then it must precede the MODEL statement.
K6.1 (10.90)
page 44
If none of the variables in the model appear in a CLASS statement regression is being used. If all the variables in the model appear in a CLASS statement analysis of variance is being used. If some of the variables in the model appear in a CLASS statement analysis of covariance is being used.
_{T}_{h}_{e} _{M}_{O}_{D}_{E}_{L}
statement
_{T}_{h}_{e} _{I}_{D} _{s}_{t}_{a}_{t}_{e}_{m}_{e}_{n}_{t}
_{T}_{h}_{e} _{M}_{E}_{A}_{N}_{S}
statement
_{T}_{h}_{e} _{R}_{A}_{N}_{D}_{O}_{M} ^{s}^{t}^{a}^{t}^{e}^{m}^{e}^{n}^{t}
_{T}_{h}_{e} _{O}_{U}_{T}_{P}_{U}_{T} ^{s}^{t}^{a}^{t}^{e}^{m}^{e}^{n}^{t}
The MODEL statement for GLM is essentially the same as for ANOVA in the way that it describes the model. However the options which can be specified are not identical. As with REG the NOINT option specifies that the model is to have no constant term. Similarly the P option asks for information to be produced on the predicted and residual values. For example:
MODEL X3 = X1 X2 / NOINT P;
There is no R option. You can specify several dependent variables to carry out a multivariate analysis.
If the P option is specified in the MODEL statement you may wish to label the cases. The ID statement specifies the variable name to be used to label the observations. For example:
ID NAME;
As with ANOVA you can request multiple comparisons on the group means, for example:
MEANS MAKER/DUNCAN;
This shows in detail how the group means differ.
The RANDOM statement specifies that a factor in the model is to be considered as a random effects factor rather than a fixed effects factor. For example, the number of days on which it rained during a trial period may be a very important factor, but it must be regarded as a random effects factor rather than a fixed effects one. If used, the RANDOM statement must appear after the MODEL statement, for example:
CLASS A B;
MODEL
RANDOM B;
Y
=
A
B;
To analyse the predicted or residual values you must write them to a SAS dataset. Having done that you can use any of the facilities of SAS, especially the graphical ones, to examine them. The OUTPUT statement allows you to write these values.
The OUTPUT statement must specify the name of a dataset. This can be a permanent dataset (for example SASUSER.PREDVALS) or a temporary one. The information to be written follows the dataset name. The keywords PREDICTED and RESIDUAL (which can be abbreviated to P and R) specify that these values are to be output and give them names. For example:
PROC GLM; MODEL Y = X Z/NOINT; OUTPUT OUT=SAVED P=PY R=RY;
K6.1 (10.90)
page 45
The output dataset contains all the variables from the input dataset (whether or not they were used in the model) as well as the names specified by P or R. If the analysis has multiple dependent variables you must specify predicted and residual variable names for each dependent variable.
_{E}_{x}_{a}_{m}_{p}_{l}_{e}
DATA MILEAGE; INPUT MPH MPG @@:
CARDS; 20 15.4 30 20.2 40 25.7
RUN; PROC GLM; MODEL MPG=MPH /P CLM; OUTPUT OUT=PP P=MPGPRED R=RESID; PROC PLOT DATA=PP; PLOT MPG*MPH=’A’ MPGPRED*MPG=’P’/OVERLAY;
60 24.8
Miscellaneous useful procedures
This section describes some further procedures which you are likely to need but which do not form a logical group. These are the procedures for sorting data, for declaring formats, and for listing the contents of a dataset.
_{S}_{o}_{r}_{t}_{i}_{n}_{g} _{a} _{d}_{a}_{t}_{a}_{s}_{e}_{t} _{} The SORT procedure is used to sort a dataset on one or more variables.
the SORT
procedure
It is necessary to have the data sorted if the BY statement is to be used on the dataset (see page 11). The general form of the SORT statement is:
PROC SORT options;
where the options include:
DATA=dataset 
specifies the dataset to be sorted. If it is omitted the most recently created dataset will be used. 
OUT=dataset 
specifies the output dataset. If it is omitted the input dataset is overwritten by the sorted version. 
NODUPLICATES 
specifies that the data is checked after it has been sorted and any exact duplicates are dropped from the output data set. This checking uses all the variables in the data, not just the ones used for the sorting. 
EQUALS 
specifies that the original order of cases is preserved if they have identical sort key values. If EQUALS is not specified then the order may be changed. 
K6.1 (10.90)
page 46
The BY statement
The BY statement specifies the variable or variables to be used as a key or keys to order the data. For example:
BY AGEGRP;
specifies that the data is to be sorted according to the values of the variable AGEGRP. Unless otherwise specified, the values are arranged with the lowest values first, that is in ascending order. To have data sorted in descending order, that is with the high values first, insert DESCENDING before the variable name, for example:
BY DESCENDING INCOME;
If more than one variable is specified the first one mentioned is the most important, the next the second most important etc. For example:
BY AGEGRP DESCENDING GRADE REGION;
sorts the data so that the first case has the lowest value of REGION within the highest value of GRADE within the lowest value of AGEGRP. The last case has the highest value of REGION within the lowest value of GRADE within the highest value of AGEGRP.
Examples 
¤ 
PROC SORT; 
BY SEX; 

RUN; 
¤ PROC SORT DATA=NATION OUT=REGION; BY REGION CITY; RUN;
_{D}_{e}_{f}_{i}_{n}_{i}_{n}_{g} _{y}_{o}_{u}_{r}
own
FORMAT
procedure
^{f}^{o}^{r}^{m}^{a}^{t}^{s} ^{} ^{t}^{h}^{e}
The VALUE statement
The FORMAT statement described on pages 9 and 11 associates a format with a variable. This may be a standard SAS format or one constructed by the user. These constructed formats are defined by the FORMAT
procedure. You must use FORMAT to give labels to individual values of
a variable, using the VALUE.
When a format is specified in the FORMAT statement it always includes
a full stop. This is how formats are recognised. Because of it, format
names end with a full stop when they appear in a FORMAT statement but they do not do so in a PROC FORMAT statement.
To label individual values of a variable or ranges of values use the VALUE statement. The format is given a name and then the values and corresponding labels are declared. For example:
VALUE YESNO 
1=’YES’ 
2=’NO’ 

3=’MISSING’; 
Ranges of values are specified using a hyphen, for example:
VALUE NATIONS
1, 316, 28=’WESTERN BLOC’ 2, 1721=’EASTERN BLOC’ 2227, 29=’NONALIGNED’;
K6.1 (10.90)
page 47
The keyword OTHER may be used to catch any values not explicitly mentioned, for example:
VALUE AGEFMT
3155=’MIDDLE’
1830=’YOUNG’
5680=’OLD’
OTHER=’MISSING’;
The keywords LOW and HIGH may be used to specify the ends of ranges, that is the lowest and highest values.
Using PUT with formats 
The PUT function recodes a variable in accordance with a format. For example: 
PROC FORMAT; FORMAT AGEGRP 018=1 

1930=2 

3150=3 

5165=4 

65100=5; 

DATA SURV; INPUT AGE SEX INCOME; AGEGRP=PUT(AGE,AGEGRP.); CARDS; 

RUN; 

The variable AGEGRP is set to 1 whenever AGE is 0 to 18, 2 whenever AGE is 19 to 30, etc. AGEGRP is a character variable even though all its values are digits. Note that the PUT function is quite different to the PUT Statement (see page 8). 

Example 
To produce a frequency table for income with the data grouped into a small number of categories, a suitable format is declared with VALUE and then the FORMAT statement is used within the FREQ procedure to assign the format to the variable: 
PROC FORMAT; VALUE INCFMT LOW1499=’<1,500’
15002499=’<2,500’
25005999=’<6,000’
60009999=’<10,000’
10000HIGH=’10,000+’;
PROC FREQ; TABLES INCOME; FORMAT INCOME INCFMT.; RUN;
This produces a frequency table with only five categories, which are labelled as described. Note the full stop after the format name in the FORMAT statement.
K6.1 (10.90)
page 48
_{P}_{r}_{i}_{n}_{t}_{i}_{n}_{g} _{t}_{h}_{e} values of variables  the PRINT procedure
The PRINT procedure prints the contents of variables, and produces simple reports by use of the BY, PAGEBY and SUM statements. The PRINT statement has the form:
PROC PRINT options;
The options include:
DATA=dataset 
specifies the dataset to be used. If it is missing the most recently created dataset is used. 
N 
outputs the number of cases at the end of the data. 
UNIFORM 
specifies that the same layout is to be used for each page. If it is not included, SAS outputs as many variables as possible on a page, which may result in different numbers of variables on different pages. 
DOUBLE 
specifies double spacing in the output. 
LABEL 
requests that the labels for the variables (see page 8) be used to head the columns of output rather than the names of the variables. 
The VAR statement
The ID statement
The BY and PAGEBY statements
The SUM statement
The VAR statement specifies the variable whose values are to be printed, for example:
VAR AGE GRADE;
If no VAR statement is used all the variables are printed.
Normally the number of the observation or case in the dataset is used to identify it in the output. The ID statement means that the values of the specified variable are to be used instead. For example:
ID NAME;
The BY statement specifies that the procedure is to operate on the subgroups defined by the values of a variable. The PAGEBY statement may be used with the BY statement to start a new page when a BY variable changes. For example:
BY
PAGEBY A;
causes a new page to be started when the value of A changes.
BY
PAGEBY B;
causes a new page to be started whenever A or B change, because PAGEBY triggers a new page for the variable specified and for any earlier variables in the BY statement list.
A
B
C;
A
B
C;
If a variable appears in a SUM statement the total of the values is also produced. If a BY statement has also been used totals are printed for the subgroups (provided there is more than one case in the subgroup).
K6.1 (10.90)
page 49
Examples
¤ PROC PRINT; RUN;
¤ PROC PRINT DOUBLE DATA=SICKNESS; VAR GRADE AGE DAYSSICK; ID NAME; BY DEPT; PAGEBY DEPT; SUM DAYSSICK; RUN;
¤ The following statements list the data by sex with income summed over men and women separately:
PROC SORT; BY SEX; RUN; PROC PRINT; VAR AGE MARSTAT NOOFCH INCOME; BY SEX; SUM INCOME; ID CASENO; RUN;
The output is as follows:
K6.1 (10.90)
page 50
 SEX=1 

CASENO 
AGE 
MARSTAT 
NOOFCH 

INCOME 

101 
35 
2 
4 

^{9}^{7}^{5}^{4} 

103 
53 
. 
0 

^{7}^{5}^{6}^{0} 

104 
39 
4 
2 

^{8}^{5}^{0}^{0} 

106 
38 
2 
3 

^{8}^{2}^{1}^{0} 

107 
49 
2 
7 

^{9}^{6}^{0}^{7} 

108 
27 
1 
0 

^{8}^{8}^{9}^{5} 

110 
21 
1 
0 

^{2}^{9}^{5}^{4} 

114 
25 
1 
0 

^{5}^{6}^{5}^{0} 

115 
80 
4 
. 

^{1}^{7}^{5}^{0} 

116 
43 
2 

Mult mai mult decât documente.
Descoperiți tot ce are Scribd de oferit, inclusiv cărți și cărți audio de la editori majori.
Anulați oricând.