Documente Academic
Documente Profesional
Documente Cultură
2. SAS Index
5. FILENAME
6. Sampling
7. PROC TRANSPOSE
9. PROC APPEND
10.Modify Statement
1
Review of SAS Processing
SAS processing is the way that the SAS language reads and transforms input data and
generates the kind of output that you request.
Processing a DATA Step: A Walkthrough
/*Sample*/
data pro;
input TeamName $ Name $ Event1 Event2 Event3;
datalines;
Yellow Sue 6 8 8
Blue Jane 9 7 8
Red John 7 7 7
Yellow Lisa 8 9 9
Red Fran 7 6 6
Blue Walter 9 8 10
;
2
top of the DATA step, and the values of variables created by INPUT and
assignment statements are reset to missing in the program data vector. Note that
variables that you read with a SET, MERGE, MODIFY, or UPDATE statement
are not reset to missing here.
6. SAS counts another iteration, reads the next record or observation, and executes
the subsequent programming statements for the current observation.
7. The DATA step terminates when SAS encounters the end-of-file in a SAS data set
or a raw data file.
Access Patterns
SAS procedures and statements can read observations in SAS data sets in one of
following patterns:
sequential processes observations one after the other, starting at the beginning of the
access file and continuing in sequence to the end of the file.
random processes observations according to the value of some indicator variable
access without processing previous observations.
BY-group groups and processes observations in order of the values of the variables
access specified in a BY statement.
multiple-pass performs two or more passes on data when required by SAS statements or
procedures.
3
Several types of resource usage statistics are reported by FULLSTIMER options,
including real time (elapsed time) and CPU time. Real time represents the clock time it
took to execute a job or step. CPU time represents the actual processing time required by
the CPU to execute the job.
The statistics reported by FULLSTIMER relate to the three critical computer resources:
I/O, memory, and CPU time including system and user CPU time. Under many
circumstances, reducing the use of any of these three resources usually results in better
throughout of a particular job and a reduction of real time used.
Improvement in I/O can come at the cost of increased memory consumption. In order to
understand the relationship between I/O and memory, it is helpful to know when data is
copied to a buffer and where I/O is measured. When you create a SAS data set using a
DATA step,
1. SAS copies the data from the input data set to a buffer in memory
2. one observation at a time is loaded into the program data vector
3. each observation is written to an output buffer when processing is complete
Page Size
Think of a buffer as a container in memory that is big enough for only one page of data.
The buffer size, or page size, determines the size of a single input/output buffer that SAS
uses to transfer data during processing. A page is the minimum number of bytes of data
that SAS moves between external storage and memory in one logical input/output
operation.
4
A page
is the unit of data transfer between the storage device and memory
includes the number of bytes used by the descriptor portion and the data values.
is fixed in size when the data set is created, either to a default value or to a user-
specified value.
The amount of data that can be transferred to one buffer in a single I/O operation is
referred to as page size. Each buffer can hold one page of data.
2) BUFSIZE=
You can use the BUFSIZE= system option or data set option to control the page size of
an output SAS data set. BUFSIZE= specifies not only the page size (in bytes), but also
the size of each buffer that is used for reading or writing the SAS data set. BUFSIZE
specifies the permanent buffer page size for processing output SAS data sets.
When creating a data set, it uses the BUFSIZE= buffer size option to determine the page
size of the data set. If you do not specify a BUFSIZE= option, SAS selects a value that
contains as many observations as possible with the least amount of wasted space. Note,
this BUFSIZE setting is specific to a SAS data set. If you increase the BUFSIZE= value,
more observations can be stored on a page, and the same amount of data can be accessed
with fewer I/Os.
Note: the product of BUFNO and BUFSIZE is the important factor in sequential I/O
performance rather than the specific value of either option. As BUFNO is increased, there
is a marked reduction in I/O time and I/O count, although the cost of buffer storage
increases. As a result, elapsed times can be significantly reduced. For example, when
5
BUFNO=16 and BUFSIZE=6144, the results are very similar to BUFNO=4 and
BUFSIZE=23040. The total number of bytes occupied by a data set equals to the page
size multiplied by the number of pages.
OPTIONS COMPRESS=YES|NO;
6
1. To compress all of the SAS data sets, you can add the following line at the beginning
of your SAS program.
options compress=yes;
2. To compress a single SAS data set, you can use SAS syntax similar to the following:
data two (compress=yes); /* creates a temporary compressed data set */
set one; /* reads a temporary data set */
run;
Note: At the end of a data step in which a compressed data set is created, SAS prints in
the *.log" file how much space is saved by compression. Generally, in most cases
compression saves space but in a very few cases, it is possible to increase space
requirements.
data one (compress=char);
length x y $2;
input x y;
datalines;
ab cd
;
Consideration
However, there is no such thing as a free lunch. Saving disk space comes at the expense
of increased CPU time to compress the data as it is written and/or decompress the data as
it is read. But if your concern is I/O, and not CPU usage, compressing your data may
improve the I/O performance of your application. After a file is compressed, the setting is
a permanent attribute of the file, when uncompressing a file, specify COMPRESS=NO
for a DATA step.
7
Advantages of compressing a file include:
-reduced storage requirements for the file
-fewer I/O operations necessary to read from or write to the data during processing.
Disadvantages of compressing a file are that
-more CPU resources are required to read a compressed file
-there are situations when the resulting file size may increase rather than decrease.
1.5.2 REUSE=
it specifies whether new observations are written to free space in compressed SAS data
sets.
Syntax
REUSE=YES | NO
YES tracks free space and reuses it whenever observations are added to an existing
compressed data set.
NO does not track free space. This is the default.
Specifying REUSE=NO results in less efficient usage of space if you delete or update
many observations in a SAS data set.
When you create a compressed file, you can also specify REUSE=YES (as a data set
option or system option) in order to track and reuse space. With REUSE=YES, new
observations are inserted in space freed when other observations are updated or deleted.
When the default REUSE=NO is in effect, new observations are appended to the existing
file.
1.5.3.1 To reduce the length of character data, thereby eliminating wasted space.
Let us Look at How SAS Assigns Lengths to Character Variables
SAS character variables store data as 1 character per byte. A SAS character variable can
be from 1 to 32,767 bytes in length.
The first reference to a variable in the DATA step defines it in the program data vector
and in the descriptor portion of the data set. For example, if the length of a character
variable called Street has not been defined and if the first value that is specified for Bloor.
The length of Street is set to 5. Then, if the next value specified for Street is Sheppard,
the value is stored as Shepp in the data set. Similarly, if the first value specified for City
is Toronto, the length of City is set to 7. If the next value for City is specified as London,
the length is still 7, and the value is padded with blanks to fill the extra space. Keep in
mind that SAS assigns a default length of 8 bytes to the variable.
data a;
length city $ 7.;
input street $ 5. city $ 10-17 ;
cards;
Bloor Toronto
Sheppard London
Main Richmond
;
8
Reducing the Length of Character Data with the LENGTH Statement
You can use a LENGTH statement to reduce the length of character variables. It is useful
to reduce the length of a character variable with a LENGTH statement when you have a
large data set that contains many character variables.
1.5.3.2 To reduce the length of numeric variables, thereby eliminating wasted space.
In addition to conserving data storage space, reduced-length numeric variables use less
I/O, both when data is written and when it is read. For a file that is read frequently, this
savings can be significant. However, in order to safely reduce the length of numeric
variables, you need to understand how SAS stores numeric data.
The minimum length for a numeric variable is 2 bytes in mainframe environments and 3
bytes in non-mainframe environments.
Significant Digits and Largest Integer by Length for SAS Variables under Windows
3 8,192 213
4 2,097,152 221
5 536,870,912 229
6 137,438,953,472 237
7 35,184,372,088,832 245
8 9,007,199,254,740,992 253
9
numeric variable only in the output data set. Numeric variables always have a length of 8
bytes in the program data vector and during processing.
You should assign reduced lengths to numeric variables to conserve data storage space
only if those variables have integer values. Fractional numbers lose precision if truncated.
For the SET statement, the KEY= option allows you to specify an index in a
DATA step to retrieve particular observations in a data file.
For the SQL procedure, an index enables the software to process certain classes of
queries more efficiently, for example, join queries. Even though an index can reduce the
10
time required to locate a set of observations, especially for a large data file, there are
costs associated with creating, storing, and maintaining the index.
*The following example creates a simple index on the Simple data set.
The index is named Division, and it contains values of the Division
variable;
11
data simple (index=(division));
set mydata;
run;
*The following example creates two simple indexes on the Simple2 data
set. The first index is named Division, and it contains values of the
Division variable. The second index is called EmpID, and it contains
unique values of the EmpID variable;
data simple2 (index=(division empid/unique));
set mydata;
run;
When you create or use an index, you might want to verify that it has been created or
used correctly. To display information in the SAS log concerning index creation or index
usage, set the value of the MSGLEVEL= system option to I.
*options msglevel=i;
data simple(index=(patid));
input patid $;
cards;
109019
290871
;
12
run; quit;
/*composite index*/;
proc datasets library=indx;
modify inx;
index create newind=(ctnum _v2_) / nomiss ;
run; quit;
Note: NOMISS excludes from the index all observations with missing values for all
index variables. UNIQUE specifies that the combination of values of the index variables
must be unique. If you specify UNIQUE and multiple observations have the same values
for the index variables, the index is not created.
The following table provides the rules of thumb for the amount of observations that you
13
may efficiently extract from a SAS data set using an index.
_________________________________________________________________________________
Subset Size Indexing Action
____________________________________________________________________
Note that when you create a composite index, the first key variable should be the most
discriminating.
2.7 IDXNAME=
Directs SAS to use a specific index to satisfy the conditions of a WHERE expression.
Because the index SAS selects might not always provide the best optimization, you can
direct SAS to use one of the candidate indexes by specifying the IDXNAME= data set
option. From the list of candidate indexes, SAS selects the one that it determines will
14
provide the best performance, or rejects all of the indexes if a sequential pass of the data
is expected to be more efficient.
data new;
set old(idxname=age);
where age < 25;
run;
The SAS System uses the specified index if the following restrictions are true:
The specified index must exist.
The specified index must be suitable by having at least its first or only variable
match a condition in the WHERE expression.
The specified index cannot conflict with BY processing requirements. That is, if a
BY statement is included, the specified index must be usable for both the BY
statement and the WHERE expression.
The specified index cannot conflict with missing value requirements. That is, if
the specified index is created with the NOMISS option so that missing values are
not maintained in the index, the WHERE expression cannot qualify any
observations that contain missing values.
2.8 IDXWHERE=
specifies whether or not SAS should use an index to process the WHERE expression, no
matter which access method SAS estimates is faster.
Syntax
YES tells SAS to choose the best index to optimize a WHERE expression, and to
disregard the possibility that a sequential search of the data set might be more resource-
efficient.
NO tells SAS to ignore all indexes and satisfy the conditions of a WHERE expression
with a sequential search of the data set.
options msglevel=i;
proc print data=sale (idxwhere=no);
where department='Sales';
/*You know that Department has the value Sales in 65% of the
observations, so it is not efficient for SAS to use an index for WHERE
processing. */
run;
15
SAS Indexes can be used to drastically reduce the computer resources needed to extract a
small subset of observations from a large SAS data set. But, before creating an index, you
must decide if one is appropriate in accordance to the criteria presented above. After
deciding that an index is appropriate, you have three tools to choose from to create one:
the DATASETS procedure, the SQL procedure, and the DATA step. You can exploit
indexes with the WHERE statement, the BY statement, or the KEY statement used in
either a SET or MODIFY statement. In doing so, you will be increasing the efficiency of
your SAS programs that use the index.
Assuming that all other requirements for optimization are met, SAS can use this index to
optimize the second condition in this WHERE expression.
Suppose your program contains a WHERE statement that has two conditions, and
suppose that each condition references one of the first two key variables in a composite
index:
Data winx;
where date='01Mar2006'd and
delivery_date='02Mar2006'd;
/*composite index defined on Delivery_Date */;
run;
Suppose your program contains a WHERE statement that has two conditions, and that
there are three key variables in a composite index(date, delivery_date, product_id):
data inx3;
where date='01jan2006'd and
product_id='450106';
run;
In above example, Date is the first key variable in the index. However, in this situation,
the composite index can be used to optimize only the first condition. The second
condition references the third key variable, Product_ID, but the WHERE expression does
not reference the second key variable, Delivery_Date. Without a reference to both the
first and second key variables, compound optimization cannot occur.
data inx4;
where delivery_date='01jan2006'd and
product_id='450106';
16
run;
Now suppose your program contains a WHERE statement that references only the second
and third key variables in the composite index, as shown above. In this situation, SAS
cannot use the index for optimization at all because the WHERE statement does not
reference the first key variable.
3.2 WHERE Conditions That Cannot Be Optimized
SAS does not use an index to process a WHERE condition that contains any of the
elements listed below:
For all of the following examples, assume that the data set has simple indexes on the
variables Date, Quarter, and Quantity.
The process of retrieving data via an index (direct access) is more complicated than
sequentially processing data, so direct access requires more CPU time per observation
retrieved than sequential access. However, for a small subset, using an index can decrease
the number of pages that SAS has to load into input buffers, which reduces the number of
I/O operations. If the subset contains enough size, such as 50% or more of the
observations, the sequential access likely to be more efficient than direct access for
WHERE processing.
For example;
*In this example, the SAS data set Retail is indexed on the variable
Order_Date. ;
17
data _null_;
set retail;
by order_date;
run;
*In this example, the SAS data set Retail is sorted on the variable
Order_Date before it is read using the DATA step.;
data _null_;
set retail;
by order_date;
run;
*In this example, the SAS data set Retail is sorted using the SORT
procedure. The data is then read using the DATA step;
proc sort data=retail; by order_date;
run;
data _null_;
set retail;
by order_date;
run;
General Recommendations
-To conserve resources, use sort order rather than an index for BY-group processing.
-Although using an index for BY-group processing is less efficient than using sort order,
it might be the best choice if resource limitations make sorting a file difficult.
5. FILENAME Statement
You already know that you can use a FILENAME statement to associate a fileref with a
single raw data file. You can also use a FILENAME statement to concatenate raw data
files by assigning a single fileref to the raw data files that you want to combine.
Syntax:
FILENAME fileref ('external-file1' 'external-file2' ...'external-filen');
where
18
fileref is any SAS name that is eight characters or fewer.
'external-file' is the physical name of an external file. The physical name is the
name that is recognized by the operating environment.
Warning: All of the file specifications must be enclosed in one set of parentheses.
If you are not familiar with the content and structure of your raw data files, you can use
PROC FSLIST to view them.
PROC FSLIST
It displays an external file for interactive browsing. This routine provides a convenient
method for examining the information stored in an external file.
proc fslist file='path';
quit;
6. Sampling Data
6.1 Method
There are several ways to obtain an unbiased, random sample:
6.1.1 Simple Random Sampling
This is the equivalent of mixing all units in the population and then drawing out items
one at the time. A random number table or number generator can be used to select the
units for the sample.
6.1.2 Stratified Sampling
This is done by first separating units into (sub) groups, such as by product design,
manufacturing plant, production machine, date of production, lot of raw material, etc.
A random sample is then taken from each group.
6.1.3 Systematic Sampling
This is conducted by taking items at fixed intervals (such as every fifth item from a list
with random starting point).
data random;
19
do i = 1 to 20;
seed=521 ;
random = uniform (seed);
output;
end;
run;
data a;
do i =1 to 1000;
output;
end;
run;
20
In simple random sampling, each unit has an equal probability of selection, and sampling
is without replacement.
/*using PROC SURVEYSELECT-Stratified Sampling*/;
proc surveyselect data=Customers method=srs n=15
seed=1953 out=Strat;
strata State Type;
run;
Note: the METHOD=SYS option requests systematic random sampling. The SEED=1234
option specifies the initial seed for random number generation. The RATE=.02 option
specifies a sampling rate of 2% for each stratum.
The method of unrestricted random sampling (METHOD=URS) selects units with equal
probability and with replacement. Because units are selected with replacement, a unit can
be selected for the sample more than once.
proc surveyselect data=a method=urs rep=1 out=as2 n=100;
run;
proc surveyselect data=a method=urs rep=1 outhits n=100 out=as3;
run;
7. PROC TRANSPOSE
Overview
The TRANSPOSE procedure creates an output data set by restructuring the values in a
SAS data set, transposing selected variables into observations. The TRANSPOSE
procedure can often eliminate the need to write a lengthy DATA step to achieve the same
result. Further, the output data set can be used in subsequent DATA or PROC steps for
analysis, reporting, or further data manipulation.
PROC TRANSPOSE does not produce printed output. To print the output data set from
the PROC TRANSPOSE step, use PROC PRINT, PROC REPORT, or another SAS
reporting tool.
A transposed variable is a variable the procedure creates by transposing the values of an
observation in the input data set into values of a variable in the output data set.
Syntax
PROC TRANSPOSE DATA= SAS-data-set
PREFIX= name
OUT= SAS-data-set
NAME= name
LABEL= name
VAR variable-list;
ID variable;
BY variable-list;
RUN;
21
Some examples
data WII_test;
input Name $9. +1 ID $ session $ Test1 midterm
Final;
datalines;
Jason 0545 1 64 71 87
Duham 1236 2 81 95 91
Jeff 1167 1 65 94 92
McBane 1230 2 63 75 80
Grant 2527 2 80 76 71
Lunds 4860 1 92 40 86
Mccain 0674 1 75 78 72
;
run;
(a)simple transpose: transpose all numerical
variables.
proc transpose data=wii_test out=out_transposed;
run;
(b) Naming Transposed Variables:
proc transpose data=wii_test out=out_transposed
name=test prefix=ID;
run;
(c) Labeling Transposed Variables
proc transpose data=wii_test out=out_transposed
name=test prefix=ID;
id id;
idlabel name;
run;
(d)by group
proc sort data=wii_test; by session; run;
proc transpose data=wii_test out=out_transposed
name=test prefix=ID;
id id;
idlabel name;
var test1 midterm final;
by session;
22
run;
Conclusion: The TRANSPOSE procedure is very useful when used to transform data
from rows to columns or columns to rows. Many are uncertain of its effects on their data,
but by using the procedure along with the various options that are available, manipulating
data can be made much easier.
data en;
input accnt balance day @@;
cards;
901 486 1 901 985 4 903 498 2 903 498 2
;
data en2;
set en end=var;
if var=0;
run;
23
stop;
run;
data b;
do i=2, 4; /*select more obs*/
set a point=i;
output;
end;
stop;
run;
In the following code sample, the DO loop assigns a value to the variable X, which is
used by the POINT= option to select every tenth observation from dat.
The POINT= option uses direct-access read mode, which means that SAS only reads
those observations that you direct it to read. In direct-access read mode, SAS does not
detect the end-of-file marker. Therefore, when you use the POINT= option in a SET
statement, you must use a STOP statement to prevent the DATA step from looping
continuously.
Data newdata;
Set dsn1;
Set dsn2 key =indexvar;
run;
24
call SYMPUT('nobs_val',put(studnt_cnt,1.));
set student nobs= studnt_cnt;
run;
%put &nobs_val;
Example:
Data gcrew;
Input Emp_id $ last_name $ 30.;
Cards ;
00632 WHITE
01483 WONG
01996 SMITH
04064 LAGON
;
RUN;
Data gsched;
Input Emp_id $ site_number $ ;
Datalines ;
00632 350
01996 425
04064 505
;
RUN;
/**take all information from data set gcrew no matter the obs has the
matching obs in the data set gsched*/
DATA all_gcrew;
Merge gcrew (in=ingcrew)
Gsched (in=ingsched);
By Emp_id ;
if ingcrew then output;
Run;
9. PROC APPEND
Concatenating SAS data sets is the process of storing observations one after another until
all the data sets and their observations have been combined into one data set.
Many users perform the concatenation process using a DATA step (as shown in the
previous examples), but there are good reasons for using the APPEND procedure. If you
use the SET statement in a DATA step to concatenate two data sets, the SAS System must
25
process all the observations in both data sets to create a new one. The APPEND
procedure bypasses the processing of data in the original data set and adds new
observations directly to the end of the original data set. Therefore, the APPEND
procedure is more efficient than the SET statement in the DATA step for concatenating
data sets because it reads only the data in the DATA= data set.
PROC APPEND
Syntax: BASE= SAS-data-set <DATA=SAS-data-set> <FORCE>
Since PROC APPEND reads only the second data set, set BASE= to the larger data set.
PROC APPEND only reads the data in the DATA= SAS data set, not the BASE= SAS
data set. PROC APPEND concatenates data sets even though there may be variables in
the BASE= data set that do not exist in the DATA= data set.
data master;
input city $ 1-11 month $10. temp;
cards;
Honolulu August 80.7
Honolulu January 72.3
Boston July 73.3
Boston January 29.2
Duluth July 65.6
Duluth January 8.5
New York August 82.7
New York January 22.3
;
data add;
input city $ 1-11 month $10. temp;
cards;
Raleigh July 77.5
Raleigh January 40.5
Miami August 82.9
Miami January 67.2
Los Angeles August 69.5
Los Angeles January 54.5
;
run;
proc append base=master data=add;
run;
In the previous example, the DATA= add contained the same variables as the BASE=
data set (master).
If you may need to append data sets when the DATA= data set contains fewer variables
than the BASE= data set. A warning message is written to the SAS log.
data master;
input city $ 1-11 month $10. temp;
cards;
Honolulu August 80.7
Honolulu January 72.3
Boston July 73.3
Boston January 29.2
Duluth July 65.6
Duluth January 8.5
New York August 82.7
New York January 22.3
26
;
data add;
input city $ 1-11 month $10. ;
cards;
Raleigh July
Raleigh January
Miami August
Miami January
Los Angeles August
Los Angeles January
;
run;
proc append base=master data=add;
run;
What are the advantages or/and disadvantages of SET statement and PROC APPEND?
1. PROC APPEND is more efficient for appending two data sets. Because PROC
APPEND performs an update in place on the BASE= data set; therefore, it just
adds the observations from the DATA= data set to the end of the BASE= data set.
The observations in the BASE= data set are not read or processed.
2. The DATA with SET does not perform an update in place, thus the original data
sets would not be damaged should the process terminate abnormally.
3. If the SAS job terminates abnormally while the APPEND procedure is processing,
the BASE= data set will be marked as damaged.
4. PROC APPEND cannot add variables to the BASE= data set, it can only add
observations to the existing structure of the BASE= data set.
5. If you need to add new variables and observations to the BASE= data set, a DATA
step with a SET statement is the solution.
Syntax
DATA SAS-data-set;
MODIFY SAS-data-set;
Statements;
RUN;
27
class=class*1.10;
run;
proc print data=a; run;
data master;
modify master trans;
by school;
run;
proc print data=master; run;
data master;
modify master trans;
by school;
run;
proc print data=master; run;
28
If duplicate values exist in both the master and transaction data sets, you can use
PROC SQL to apply the duplicate values in the transaction data set to the duplicate
values in the master data set in a one-to-one correspondence.
29