Sunteți pe pagina 1din 29

Topic Advanced Programming Techniques I

1. Optimizing System Performance

2. SAS Index

3. Using an Index for Efficient WHERE Processing

4. Using BY-Group Processing with an Index

5. FILENAME

6. Sampling

7. PROC TRANSPOSE

8. SET Statement and Beyond

9. PROC APPEND

10.Modify Statement

1
Review of SAS Processing
SAS processing is the way that the SAS language reads and transforms input data and
generates the kind of output that you request.
Processing a DATA Step: A Walkthrough
/*Sample*/
data pro;
input TeamName $ Name $ Event1 Event2 Event3;
datalines;
Yellow Sue 6 8 8
Blue Jane 9 7 8
Red John 7 7 7
Yellow Lisa 8 9 9
Red Fran 7 6 6
Blue Walter 9 8 10
;

The Compilation Phase


When you submit a DATA step for execution, SAS checks the syntax of the SAS
statements and compiles them, that is, automatically translates the statements into
machine code. In this phase, SAS identifies the type and length of each new variable, and
determines whether a type conversion is necessary for each subsequent reference to a
variable. During the compile phase, SAS creates the following three items:
input buffer is a logical area in memory into which SAS reads each record of raw
data when SAS executes an INPUT statement.
program data is a logical area in memory where SAS builds a data set, one observation
vector (PDV) at a time. When a program executes, SAS reads data values from the
input buffer or creates them by executing SAS language statements.
descriptor is information that SAS creates and maintains about each SAS data set,
information including data set attributes and variable attributes. It contains, for
example, the name of the data set and its member type, the date and time
that the data set was created, and the number, names and data types
(character or numeric) of the variables.

The Execution Phase


By default, a simple DATA step iterates once for each observation that is being created.
The flow of action in the Execution Phase of a simple DATA step is described as follows:
1. The DATA step begins with a DATA statement. Each time the DATA statement
executes, a new iteration of the DATA step begins, and the _N_ automatic variable
is incremented by 1.
2. SAS sets the newly created program variables to missing in the program data
vector (PDV).
3. SAS reads a data record from a raw data file into the input buffer, or it reads an
observation from a SAS data set directly into the program data vector. You can
use an INPUT, MERGE, SET, MODIFY, or UPDATE statement to read a record.
4. SAS executes any subsequent programming statements for the current record.
5. At the end of the statements, an output, return, and reset occur automatically. SAS
writes an observation to the SAS data set, the system automatically returns to the

2
top of the DATA step, and the values of variables created by INPUT and
assignment statements are reset to missing in the program data vector. Note that
variables that you read with a SET, MERGE, MODIFY, or UPDATE statement
are not reset to missing here.
6. SAS counts another iteration, reads the next record or observation, and executes
the subsequent programming statements for the current observation.
7. The DATA step terminates when SAS encounters the end-of-file in a SAS data set
or a raw data file.

Access Patterns

SAS procedures and statements can read observations in SAS data sets in one of
following patterns:

sequential processes observations one after the other, starting at the beginning of the
access file and continuing in sequence to the end of the file.
random processes observations according to the value of some indicator variable
access without processing previous observations.
BY-group groups and processes observations in order of the values of the variables
access specified in a BY statement.
multiple-pass performs two or more passes on data when required by SAS statements or
procedures.

1. Optimizing System Performance


1.1 Definitions Performance Statistics
All of tasks require time and space. Time and space for a computer program are
composed of CPU time, I/O time, and memory.
• I/O time: (the time your computer takes to read data into memory and write data
from the memory to your hard drive)
• Memory: the size of the work area that the CPU must devote to the operations in the
program.
• CPU time: (the time your computer takes to perform calculations, CPU-Central
Processing Unit)
• Data storage: how much space on disk or tape your data use
• Programming time: the amount of time required for the programmer to write and
maintain the program.
You can obtain these statistics by using SAS system options that can help you measure
your job's initial performance and to determine how to improve performance.

1.2 System Performance


is measured by the overall amount of I/O, memory, data storage and CPU time that your
system uses to process SAS programs.

1.3 Interpreting FULLSTIMER

3
Several types of resource usage statistics are reported by FULLSTIMER options,
including real time (elapsed time) and CPU time. Real time represents the clock time it
took to execute a job or step. CPU time represents the actual processing time required by
the CPU to execute the job.
The statistics reported by FULLSTIMER relate to the three critical computer resources:
I/O, memory, and CPU time including system and user CPU time. Under many
circumstances, reducing the use of any of these three resources usually results in better
throughout of a particular job and a reduction of real time used.

1.4 Overview of Techniques for Optimizing I/O


I/O is one of the most important factors for optimizing performance. Most SAS jobs
consist of repeated cycles of reading a particular set of data to perform various data
analysis and data manipulation tasks. To improve the performance of a SAS job, you
must reduce the number of times SAS accesses disk or tape devices and reduce the
number of times it processes the data internally.

Improvement in I/O can come at the cost of increased memory consumption. In order to
understand the relationship between I/O and memory, it is helpful to know when data is
copied to a buffer and where I/O is measured. When you create a SAS data set using a
DATA step,

1. SAS copies the data from the input data set to a buffer in memory
2. one observation at a time is loaded into the program data vector
3. each observation is written to an output buffer when processing is complete

Page Size
Think of a buffer as a container in memory that is big enough for only one page of data.
The buffer size, or page size, determines the size of a single input/output buffer that SAS
uses to transfer data during processing. A page is the minimum number of bytes of data
that SAS moves between external storage and memory in one logical input/output
operation.

4
A page
 is the unit of data transfer between the storage device and memory
 includes the number of bytes used by the descriptor portion and the data values.
 is fixed in size when the data set is created, either to a default value or to a user-
specified value.
The amount of data that can be transferred to one buffer in a single I/O operation is
referred to as page size. Each buffer can hold one page of data.

Setting the BUFNO=, and BUFSIZE= System Options


The following SAS system options can help you reduce the number of disk accesses that
are needed for SAS files, though they might increase memory usage.
1) BUFNO=
You can use the BUFNO= system or data set option to control the number of buffers that
are available for reading or writing a SAS data set. Also, BUFNO is the number of page
buffers to allocate for the data set. By increasing the number of buffers, you can control
how many pages of data are loaded into memory with each I/O transfer.
You can specify the number of buffers, its default is 1. It specifies a value from 1 to the
maximum number of buffers available in your operating environment. SAS uses the
BUFNO= option to adjust the number of open page buffers when it processes a SAS data
set. Increasing this option's value can improve your application's performance by
allowing SAS to read more data with fewer passes; however, your memory usage
increases. Therefore, the greater the number of page buffers, the more memory is
required. The buffer number is not a permanent attribute of the data set and is valid only
for the current step or SAS session.
To minimize I/O consumption, when you work with a small data set, allocate as many
buffers as there are pages in the data set so that the entire data set can be loaded into
memory.
Note: Using BUFNO= can speed up execution time by limiting the number of
input/output operations that are required for a particular SAS data set. The improvement
in execution time, however, comes at the expense of increased memory consumption.

2) BUFSIZE=
You can use the BUFSIZE= system option or data set option to control the page size of
an output SAS data set. BUFSIZE= specifies not only the page size (in bytes), but also
the size of each buffer that is used for reading or writing the SAS data set. BUFSIZE
specifies the permanent buffer page size for processing output SAS data sets.
When creating a data set, it uses the BUFSIZE= buffer size option to determine the page
size of the data set. If you do not specify a BUFSIZE= option, SAS selects a value that
contains as many observations as possible with the least amount of wasted space. Note,
this BUFSIZE setting is specific to a SAS data set. If you increase the BUFSIZE= value,
more observations can be stored on a page, and the same amount of data can be accessed
with fewer I/Os.
Note: the product of BUFNO and BUFSIZE is the important factor in sequential I/O
performance rather than the specific value of either option. As BUFNO is increased, there
is a marked reduction in I/O time and I/O count, although the cost of buffer storage
increases. As a result, elapsed times can be significantly reduced. For example, when

5
BUFNO=16 and BUFSIZE=6144, the results are very similar to BUFNO=4 and
BUFSIZE=23040. The total number of bytes occupied by a data set equals to the page
size multiplied by the number of pages.

I/O Performance Techniques


1) Read only data that is needed when using an INPUT statement.
2) Use KEEP / DROP statements or KEEP= / DROP= data set options to retain only
desired variables.
3) Use a WHERE statement or WHERE= data set option to subset rows of data.
4) Use data compression for large data sets using the COMPRESS= data set option.
5) Use indexes to optimize the retrieval of data.
6) Use the DATASETS procedure COPY statement to copy data sets with indexes.
7) Create data sets
8) Use LENGTH statements.
9) Use the OBS= and FIRSTOBS= data set options.

1.5 Techniques for Optimizing Storage Performance


1.5.1 Compression
Why?
Compressing a file is a process that reduces the number of bytes required to represent
each observation. Compressing data reduces I/O and disk space but increases CPU time.
You can compress data files to save space. Storing your data this way means that more
CPU time is needed to decompress the observations as they are made available to SAS.
When?
The length of each variable is established when the data set is created. It is possible that
the byte length of a variable is longer than that which is required to hold its value. Since
compression only works on character variables, a few of small character variables are not
good candidates for compression.
If data sets that contain many long character variables, they generally are excellent
candidates for compression.
How?
There are two methods by which to implement data set compression in the SAS system.
The SAS system has an option to compress data sets at creation by either specifying
(COMPRESS=yes) in the data step or by using OPTIONS COMPRESS=YES.
Use the COMPRESS= data set option to compress an individual file including entire
observation with all of individual variables. Use the COMPRESS= data set option only
when you are creating a SAS data file. This option is placed in parentheses adjacent to the
name of new data set being created. The system option will cause ALL datasets created to
be compressed. In my view, data set option is preferable.
For Example:
Using the COMPRESS option on temporary SAS Data Sets
The temporary SAS data sets are by default stored in the SAS WORK space. By using the
COMPRESS option in SAS, you can reduce the size of your data sets. General forms of
the COMPRESS option are:

OPTIONS COMPRESS=YES|NO;

6
1. To compress all of the SAS data sets, you can add the following line at the beginning
of your SAS program.
options compress=yes;

2. To compress a single SAS data set, you can use SAS syntax similar to the following:
data two (compress=yes); /* creates a temporary compressed data set */
set one; /* reads a temporary data set */
run;

Note: At the end of a data step in which a compressed data set is created, SAS prints in
the *.log" file how much space is saved by compression. Generally, in most cases
compression saves space but in a very few cases, it is possible to increase space
requirements.
data one (compress=char);
length x y $2;
input x y;
datalines;
ab cd
;

Consideration
However, there is no such thing as a free lunch. Saving disk space comes at the expense
of increased CPU time to compress the data as it is written and/or decompress the data as
it is read. But if your concern is I/O, and not CPU usage, compressing your data may
improve the I/O performance of your application. After a file is compressed, the setting is
a permanent attribute of the file, when uncompressing a file, specify COMPRESS=NO
for a DATA step.

Deciding Whether to Compress a Data File


Not all data files are good candidates for compression. However, compression can be
beneficial when the data file has one or more of the following properties:
A. How often is the data set going to be used?
Infrequently used data sets are better candidates for compression than data sets that are
used often.
B. How many the character variables in the data set?
Data sets without long character variables are poor candidates for compression.
C. It is large.
D. It contains many values that have repeated characters.
E. It contains many missing values.
F. It contains repeated values in variables that are physically stored next to one another.
G. In character data, the most frequently encountered repeated value is the blank. Long
text fields, such as comments and addresses, often contain repeated blanks.
Note: A data file is not a good candidate for compression if it has
1. few repeated characters
2. small physical size
3. few missing values
4. short text strings.

7
Advantages of compressing a file include:
-reduced storage requirements for the file
-fewer I/O operations necessary to read from or write to the data during processing.
Disadvantages of compressing a file are that
-more CPU resources are required to read a compressed file
-there are situations when the resulting file size may increase rather than decrease.

1.5.2 REUSE=
it specifies whether new observations are written to free space in compressed SAS data
sets.
Syntax
REUSE=YES | NO
YES tracks free space and reuses it whenever observations are added to an existing
compressed data set.
NO does not track free space. This is the default.

Specifying REUSE=NO results in less efficient usage of space if you delete or update
many observations in a SAS data set.
When you create a compressed file, you can also specify REUSE=YES (as a data set
option or system option) in order to track and reuse space. With REUSE=YES, new
observations are inserted in space freed when other observations are updated or deleted.
When the default REUSE=NO is in effect, new observations are appended to the existing
file.

1.5.3 Specifying Variable Lengths

1.5.3.1 To reduce the length of character data, thereby eliminating wasted space.
Let us Look at How SAS Assigns Lengths to Character Variables
SAS character variables store data as 1 character per byte. A SAS character variable can
be from 1 to 32,767 bytes in length.
The first reference to a variable in the DATA step defines it in the program data vector
and in the descriptor portion of the data set. For example, if the length of a character
variable called Street has not been defined and if the first value that is specified for Bloor.
The length of Street is set to 5. Then, if the next value specified for Street is Sheppard,
the value is stored as Shepp in the data set. Similarly, if the first value specified for City
is Toronto, the length of City is set to 7. If the next value for City is specified as London,
the length is still 7, and the value is padded with blanks to fill the extra space. Keep in
mind that SAS assigns a default length of 8 bytes to the variable.
data a;
length city $ 7.;
input street $ 5. city $ 10-17 ;
cards;
Bloor Toronto
Sheppard London
Main Richmond
;

8
Reducing the Length of Character Data with the LENGTH Statement
You can use a LENGTH statement to reduce the length of character variables. It is useful
to reduce the length of a character variable with a LENGTH statement when you have a
large data set that contains many character variables.

1.5.3.2 To reduce the length of numeric variables, thereby eliminating wasted space.
In addition to conserving data storage space, reduced-length numeric variables use less
I/O, both when data is written and when it is read. For a file that is read frequently, this
savings can be significant. However, in order to safely reduce the length of numeric
variables, you need to understand how SAS stores numeric data.

Let us Look at How SAS Stores Numeric Variables


A SAS numeric variable can be from 2 to 8 bytes or 3 to 8 bytes in length, depending on
your operating environment. The default length for a numeric variable is 8 bytes.

The minimum length for a numeric variable is 2 bytes in mainframe environments and 3
bytes in non-mainframe environments.

Significant Digits and Largest Integer by Length for SAS Variables under Windows

Length in Bytes Largest Integer Represented Exactly Exponential Notation

3 8,192 213

4 2,097,152 221

5 536,870,912 229

6 137,438,953,472 237

7 35,184,372,088,832 245

8 9,007,199,254,740,992 253

Reducing the Length of Numeric Variables with the LENGTH Statement


You can use a LENGTH statement to assign a length from 2 to 8 bytes to numeric
variables. Remember, the minimum length of numeric variables depends on the operating
environment. Also, keep in mind that the LENGTH statement affects the length of a

9
numeric variable only in the output data set. Numeric variables always have a length of 8
bytes in the program data vector and during processing.

You should assign reduced lengths to numeric variables to conserve data storage space
only if those variables have integer values. Fractional numbers lose precision if truncated.

1.6 Techniques for Optimizing Memory Usage


If memory is a critical resource, several techniques can reduce your dependence on
increased memory. However, most of them also increase I/O processing or CPU usage.

1.7 Techniques for Optimizing CPU Performance


1.7.1 Reducing CPU Time by Using More Memory or Reducing I/O
Executing a single stream of code takes approximately the same amount of CPU time
each time that code is executed. Optimizing CPU performance in these instances is
usually a tradeoff. For example, you might be able to reduce CPU time by using more
memory, because more information can be read and stored in one operation, but less
memory is available to other processes.

1.7.2 Storing a Compiled Program for Computation- Intensive DATA Steps


Another technique that can improve CPU performance is to store a DATA step that is
executed repeatedly as a compiled program rather than as SAS statements. Any technique
that reduces the number of I/O operations can also have a positive effect on CPU usage.
2. SAS Index
2.1 Definition
An index is an optional file that you can create for a SAS data file to provide direct
access to specific observations and is also analogous to the search function. A good index
will allow your programs to quickly access subset of SAS observations that you need
from a large SAS data set. However, for a small subset, using an index can decrease the
number of pages that SAS has to load into input buffers, which reduces the number of I/O
operations. This will dramatically improve the speed and efficiency of your SAS
programs.

2.2 Benefits of an Index


In general, SAS can use an index to improve performance in the following situations:
 For WHERE processing, an index can provide faster and more efficient access to
a subset of data.

 For BY processing, using an index to process a BY statement might not always be


more efficient than simply sorting the data file. Therefore, using an index for a
BY statement is generally for convenience, not for performance.

 For the SET statement, the KEY= option allows you to specify an index in a
DATA step to retrieve particular observations in a data file.

For the SQL procedure, an index enables the software to process certain classes of
queries more efficiently, for example, join queries. Even though an index can reduce the

10
time required to locate a set of observations, especially for a large data file, there are
costs associated with creating, storing, and maintaining the index.

The main benefits of using an index include the following:


-provides fast access to a small subset of observations
-returns values in sorted order

2.3 Type of Indexes


When you create an index, you designate which variable(s) to index. An indexed variable
is called a key variable. You can create two types of indexes:
2.3.1 A simple index, which consists of the values of one variable, which can be
character or numeric. When you create a simple index, SAS assigns the name of the key
variable as the name of the index.
2.3.2 A composite index, which consists of the values of more than one variable,
which can be character, numeric, or a combination. The values of these key variables are
concatenated to form a single value.

2.4 To Create an Index


You can either create an index when you create a data file, or create an index for an
existing data file. The data file can be either compressed or uncompressed. For each data
file, you can create one or multiple indexes. Once an index exists, SAS treats it as part of
the data file. That is, if you add or delete observations or modify values, the index is
automatically updated.

2.4.1 INDEX= data set option in the DATA statement.


Syntax:
Data SAS-data-file
(INDEX= (index-specification-1</UNIQUE><index-specification-
n</UNIQUE>>));
SAS stores the name of a composite index exactly as you specify it in the INDEX=
option. Therefore, if you want the name of your index to begin with a capital letter, you
must specify the name with an initial capital letter in the INDEX= option. You can create
an index on a SAS data file. The UNIQUE option guarantees that values for the key
variable or the combination of a composite group of variables remain unique for every
observation in the data set.

*The following example creates a simple index on the Simple data set.
The index is named Division, and it contains values of the Division
variable;

11
data simple (index=(division));
set mydata;
run;

*The following example creates two simple indexes on the Simple2 data
set. The first index is named Division, and it contains values of the
Division variable. The second index is called EmpID, and it contains
unique values of the EmpID variable;
data simple2 (index=(division empid/unique));
set mydata;
run;

*The following example creates a composite index on the Composite data


set. The index is named Empdiv, and it contains concatenated values of
the Division variable and the EmpID variable;
data composite (index=(Empdiv=(division empid)));
set mydata;
run;

When you create or use an index, you might want to verify that it has been created or
used correctly. To display information in the SAS log concerning index creation or index
usage, set the value of the MSGLEVEL= system option to I.
*options msglevel=i;
data simple(index=(patid));
input patid $;
cards;
109019
290871
;

2.4.2 Using the DATASETS Procedure


The DATASETS procedure provides statements that allow you to create and delete
indexes. In the following example, the MODIFY statement identifies the data file, the
INDEX DELETE statement deletes two indexes, and the two INDEX CREATE
statements specify the variables to index, with the first INDEX CREATE statement
specifying the options UNIQUE and NOMISS:
proc datasets library=mylib;
modify employee;
index delete salary age;
index create empnum / unique nomiss;
index create names=(lastname firstname) nomiss;
run;
Note: If you delete and create indexes in the same step, place the INDEX DELETE
statement before the INDEX CREATE statement so that space occupied by deleted
indexes can be reused during index creation.
libname indx "c:\";
***PROC DATASETS***;
proc sort data =inx out =indx.inx;
by ctnum;
run;
/*simple index*/;
proc datasets library=indx;
modify inx;
index create ctnum / unique nomiss ;

12
run; quit;
/*composite index*/;
proc datasets library=indx;
modify inx;
index create newind=(ctnum _v2_) / nomiss ;
run; quit;

Note: NOMISS excludes from the index all observations with missing values for all
index variables. UNIQUE specifies that the combination of values of the index variables
must be unique. If you specify UNIQUE and multiple observations have the same values
for the index variables, the index is not created.

2.4.3 Using the SQL Procedure


The SQL procedure supports index creation and deletion and the UNIQUE option.
The DROP INDEX statement deletes indexes. The CREATE INDEX statement specifies
the UNIQUE option, the name of the index, the target data file, and the variable(s) to be
indexed. Let us see in PROC SQL in detail.
****SQL****;
proc sql;
create unique index ctnum on indx.inx;
create index nv on indx.inx(ctnum, _v2_);
quit;

2.5 Determining Whether SAS Is Using an Index


It is not always possible or more efficient for SAS to use an existing index to access
specific observations directly. An index is not used
 with a subsetting IF statement in a DATA step
 with particular WHERE expressions
 if SAS determines it is more efficient to read the data sequentially.

2.6 Guidelines for Creating Indexes


An index exists to improve performance. However, an index conserves some resources at
the expense of others. Therefore, you must consider costs associated with creating, using,
and maintaining an index. When you are deciding whether to create an index, you must
consider CPU cost, I/O cost, buffer requirements, and disk space requirements.
Data File Considerations
 For a small data file, sequential processing is often just as efficient as index
processing. Do not create an index if the data file page count is less than three
pages. It would be faster to access the data sequentially. To see how many pages
are in a data file, use the CONTENTS procedure.
 Consider the cost of an index for a data file that is frequently changed.
 Create an index when you intend to retrieve a small subset of observations from a
large data file (recommended range 3%-33%).
 To reduce the number of I/Os performed when you create an index, first sort the
data by the key variable. Then to improve performance, maintain the data file in
sorted order by the key variable. If the data file has more than one index, sort the
data by the most frequently used key variable.

The following table provides the rules of thumb for the amount of observations that you

13
may efficiently extract from a SAS data set using an index.

_________________________________________________________________________________
Subset Size Indexing Action
____________________________________________________________________

3% - 15% An index will definitely improve program performance


16% - 20% An index will probably improve program performance
21% - 33% An index might improve or it might worsen program performance
34% - 100% An index will not improve program performance
__________________________________________________________________________

Index Use Considerations


 Keep the number of indexes per data file to a minimum to reduce disk storage and
to reduce update costs.
 Consider how often your applications will use an index. An index must be used
often in order to make up for the resources that are used in creating and
maintaining it.
 When you create an index to process a WHERE expression, do not try to create
one index that is used to satisfy all queries.
 Data set is relatively large
 Data set not frequently updated
 Usually less than 33% of entire data set
 Data frequently subset by values of the indexed variable
 Data page should be more than three pages

Key Variable Candidates


In most cases, multiple variables are used to query a data file. However, it probably
would be a mistake to index all variables in a data file, as certain variables are better
candidates than others. The ideal index key will satisfy all three criteria.
 The variables to be indexed should be those that are used most often. That is,
your application should require selecting small subsets from a large file, and the
most common selection variables should be considered as candidate key
variables.
 A variable is a good candidate for indexing when the variable can be used to
precisely identify the observations that satisfy a WHERE expression. That is, the
variable should be discriminating.

Note that when you create a composite index, the first key variable should be the most
discriminating.

2.7 IDXNAME=
Directs SAS to use a specific index to satisfy the conditions of a WHERE expression.
Because the index SAS selects might not always provide the best optimization, you can
direct SAS to use one of the candidate indexes by specifying the IDXNAME= data set
option. From the list of candidate indexes, SAS selects the one that it determines will

14
provide the best performance, or rejects all of the indexes if a sequential pass of the data
is expected to be more efficient.
data new;
set old(idxname=age);
where age < 25;
run;

The SAS System uses the specified index if the following restrictions are true:
 The specified index must exist.
 The specified index must be suitable by having at least its first or only variable
match a condition in the WHERE expression.
 The specified index cannot conflict with BY processing requirements. That is, if a
BY statement is included, the specified index must be usable for both the BY
statement and the WHERE expression.
 The specified index cannot conflict with missing value requirements. That is, if
the specified index is created with the NOMISS option so that missing values are
not maintained in the index, the WHERE expression cannot qualify any
observations that contain missing values.

2.8 IDXWHERE=
specifies whether or not SAS should use an index to process the WHERE expression, no
matter which access method SAS estimates is faster.

Syntax
YES tells SAS to choose the best index to optimize a WHERE expression, and to
disregard the possibility that a sequential search of the data set might be more resource-
efficient.
NO tells SAS to ignore all indexes and satisfy the conditions of a WHERE expression
with a sequential search of the data set.

options msglevel=i;
proc print data=sale (idxwhere=no);
where department='Sales';
/*You know that Department has the value Sales in 65% of the
observations, so it is not efficient for SAS to use an index for WHERE
processing. */
run;

2.9 Advantage and Disadvantage of Index Creation


An INDEX will:
-reduce the time required to find observations meeting conditions in a WHERE clause
-eliminate the need to sort a Data Set upon which BY group processing will be applied
Drawbacks:
-Additional CPU utilization and storage are required
-The INDEX will need to be repaired (using PROC DATASETS) if the values of the
indexed variables change.
-Using an index requires additional memory for buffers into which the index pages and
code are loaded for processing.
Conclusion:

15
SAS Indexes can be used to drastically reduce the computer resources needed to extract a
small subset of observations from a large SAS data set. But, before creating an index, you
must decide if one is appropriate in accordance to the criteria presented above. After
deciding that an index is appropriate, you have three tools to choose from to create one:
the DATASETS procedure, the SQL procedure, and the DATA step. You can exploit
indexes with the WHERE statement, the BY statement, or the KEY statement used in
either a SET or MODIFY statement. In doing so, you will be increasing the efficiency of
your SAS programs that use the index.

3. Using an Index for Efficient WHERE Processing


3.1 For WHERE processing, an index can provide faster and more efficient access to a
subset of data.
data winx;
set dates;
where date_id='02Mar2006'd;
run;
Even if a WHERE statement has multiple conditions, SAS can use either a simple index
or a composite index to optimize just one of the conditions. For example, suppose your
program contains a WHERE statement that has two conditions, and suppose that the data
set has one index, as shown below:
Data winx;
where date='01Mar2006'd and
delivery_date='02Mar2006'd;
/*simple index defined on Delivery_Date */;
run;

Assuming that all other requirements for optimization are met, SAS can use this index to
optimize the second condition in this WHERE expression.
Suppose your program contains a WHERE statement that has two conditions, and
suppose that each condition references one of the first two key variables in a composite
index:
Data winx;
where date='01Mar2006'd and
delivery_date='02Mar2006'd;
/*composite index defined on Delivery_Date */;
run;

Suppose your program contains a WHERE statement that has two conditions, and that
there are three key variables in a composite index(date, delivery_date, product_id):
data inx3;
where date='01jan2006'd and
product_id='450106';
run;
In above example, Date is the first key variable in the index. However, in this situation,
the composite index can be used to optimize only the first condition. The second
condition references the third key variable, Product_ID, but the WHERE expression does
not reference the second key variable, Delivery_Date. Without a reference to both the
first and second key variables, compound optimization cannot occur.
data inx4;
where delivery_date='01jan2006'd and
product_id='450106';

16
run;

Now suppose your program contains a WHERE statement that references only the second
and third key variables in the composite index, as shown above. In this situation, SAS
cannot use the index for optimization at all because the WHERE statement does not
reference the first key variable.
3.2 WHERE Conditions That Cannot Be Optimized
SAS does not use an index to process a WHERE condition that contains any of the
elements listed below:
For all of the following examples, assume that the data set has simple indexes on the
variables Date, Quarter, and Quantity.

Element in WHERE Condition Example


any function other than TRIM or SUBSTR where weekday(date)=2;
a SUBSTR function that searches a string beginning at any position Where
after the first substr(quarter,4,1)='1';
the sounds-like operator (=*) where quarter=*'1900Q0';
arithmetic operators where quantity=quantity +1;
a variable-to-variable condition where quantity gt threshold;

The process of retrieving data via an index (direct access) is more complicated than
sequentially processing data, so direct access requires more CPU time per observation
retrieved than sequential access. However, for a small subset, using an index can decrease
the number of pages that SAS has to load into input buffers, which reduces the number of
I/O operations. If the subset contains enough size, such as 50% or more of the
observations, the sequential access likely to be more efficient than direct access for
WHERE processing.

4. Using BY-Group Processing with an Index


BY-group processing is a method of processing observations from one or more SAS data
sets that are grouped or ordered by the values of one or more common variables. You can
use BY-group processing in both DATA steps and PROC steps.
However, using BY-group processing with an index has two disadvantages:
-It is generally less efficient than sequentially reading a sorted data set because
processing BY groups typically means retrieving the entire file.
-It requires storage space for the index.
Note:
A BY statement does not use an index if the BY statement includes the DESCENDING or
NOTSORTED option or if SAS detects that the data file is physically stored in sorted
order on the BY variables.
If you use a MODIFY statement, the data does not need to be ordered. However, your
program might run more efficiently with ordered data.

For example;
*In this example, the SAS data set Retail is indexed on the variable
Order_Date. ;

17
data _null_;
set retail;
by order_date;
run;

*In this example, the SAS data set Retail is sorted on the variable
Order_Date before it is read using the DATA step.;
data _null_;
set retail;
by order_date;
run;

*In this example, the SAS data set Retail is sorted using the SORT
procedure. The data is then read using the DATA step;
proc sort data=retail; by order_date;
run;
data _null_;
set retail;
by order_date;
run;

General Recommendations
-To conserve resources, use sort order rather than an index for BY-group processing.
-Although using an index for BY-group processing is less efficient than using sort order,
it might be the best choice if resource limitations make sorting a file difficult.

Using the NOTSORTED Option


You can also use the NOTSORTED option with a BY statement to create ordered or
grouped reports without sorting the data. The NOTSORTED option can appear anywhere
in the BY statement and is useful if you have data that is in logical categories or
groupings such as chronological order.
data a;
input age gender $ @@;
cards;
12 F 13 M 23 M 11 F 23 . 17 F
;
proc means data=a;
var age; by gender notsorted;
run;
Note:
-The NOTSORTED option turns off sequence checking. If your data is not grouped,
using the NOTSORTED option can produce a large amount of output.
-The NOTSORTED option cannot be used with the MERGE or UPDATE statements.

5. FILENAME Statement
You already know that you can use a FILENAME statement to associate a fileref with a
single raw data file. You can also use a FILENAME statement to concatenate raw data
files by assigning a single fileref to the raw data files that you want to combine.

Syntax:
FILENAME fileref ('external-file1' 'external-file2' ...'external-filen');
where

18
 fileref is any SAS name that is eight characters or fewer.
 'external-file' is the physical name of an external file. The physical name is the
name that is recognized by the operating environment.
Warning: All of the file specifications must be enclosed in one set of parentheses.

filename num1 ('F:\Advance\Advance Tech 1\f1.txt'


'F:\Advance\Advance Tech 1\f2.txt' 'F:\Advance\Advance Tech 1\f3.txt');
data num;
infile num1;
input school grade num;
run;

If you are not familiar with the content and structure of your raw data files, you can use
PROC FSLIST to view them.

PROC FSLIST
It displays an external file for interactive browsing. This routine provides a convenient
method for examining the information stored in an external file.
proc fslist file='path';
quit;

6. Sampling Data
6.1 Method
There are several ways to obtain an unbiased, random sample:
6.1.1 Simple Random Sampling
This is the equivalent of mixing all units in the population and then drawing out items
one at the time. A random number table or number generator can be used to select the
units for the sample.
6.1.2 Stratified Sampling
This is done by first separating units into (sub) groups, such as by product design,
manufacturing plant, production machine, date of production, lot of raw material, etc.
A random sample is then taken from each group.
6.1.3 Systematic Sampling
This is conducted by taking items at fixed intervals (such as every fifth item from a list
with random starting point).

6.2 GENERATING RANDOM NUMBERS


The validity of a random sample depends on generating valid random numbers. Random
numbers are usually created by a pseudo random number generator. This pseudo random
number generator creates a sequence of numbers between 0 and 1 using an arithmetic
process to create numbers. In BASE/SAS, the pseudo random number generator is
contained in the RANNUI and UNIFORM functions. These are the same function known
by two different names. The generation of random numbers is controlled by the SEED
which is a number that starts the random number generator. Random-number functions
generates streams of random numbers from an initial starting point, called a seed, that
either the user or the computer clock supplies. A seed must be a nonnegative integer with
a value less than 231-1 (or 2,147,483,647). This is useful in testing and simulation studies.

data random;

19
do i = 1 to 20;
seed=521 ;
random = uniform (seed);
output;
end;
run;

6.3 ASSOCIATING A RANDOM NUMBER WITH A RANDOM INTEGER (SPECIAL


SAS FUNCTIONS: CEIL, INT, FLOOR)
DATA FUNCTION;
DO I=1 TO 15;
X=7-I-.3;
INT=INT(X);
CEIL=CEIL(X);
FLOOR=FLOOR(X);
OUTPUT;
END;
RUN;
Three different and related functions used are: INT ( ), CEIL ( ), and FLOOR ( ). Each of
these functions converts a real number to an integer in different ways. INT ( ) takes the
integer part of a number (e.g. INT ( 7.7) = 7 ); CEIL( ) takes the next largest integer for a
number ( e.g. CEIL(7.7) = 8); and FLOOR( ) takes the next lowest integer of a number
(e.g. FLOOR(7.7) = 7).

6.4 PROC SURVEYSELECT


The SURVEYSELECT procedure provides a variety of methods for selecting probability-
based random samples. The procedure can select a simple random sample or a sample
according to a complex multistage sample design that includes stratification, clustering,
and unequal probabilities of selection.
The SURVEYSELECT procedure provides methods for both equal probability sampling
and probability proportional to size (PPS) sampling. In equal probability sampling, each
unit in the sampling frame, or in a stratum, has the same probability of being selected for
the sample. In PPS sampling, a unit's selection probability is proportional to its size
measure.
The SURVEYSELECT procedure provides the following equal probability sampling
methods:
 simple random sampling
 unrestricted random sampling (with replacement)
 systematic random sampling
 sequential random sampling

data a;
do i =1 to 1000;
output;
end;
run;

/*using PROC SURVEYSELECT-Simple Random Sampling*/;


proc surveyselect data=a method=srs n=100 out=as;
run;

20
In simple random sampling, each unit has an equal probability of selection, and sampling
is without replacement.
/*using PROC SURVEYSELECT-Stratified Sampling*/;
proc surveyselect data=Customers method=srs n=15
seed=1953 out=Strat;
strata State Type;
run;

/*using PROC SURVEYSELECT-Stratified Sampling with Control Sorting*/;


proc surveyselect data=Customers method=sys seed=1234
rate=.02 out=SampleControl;
strata State;
control Type Usage;
run;

Note: the METHOD=SYS option requests systematic random sampling. The SEED=1234
option specifies the initial seed for random number generation. The RATE=.02 option
specifies a sampling rate of 2% for each stratum.

The method of unrestricted random sampling (METHOD=URS) selects units with equal
probability and with replacement. Because units are selected with replacement, a unit can
be selected for the sample more than once.
proc surveyselect data=a method=urs rep=1 out=as2 n=100;
run;
proc surveyselect data=a method=urs rep=1 outhits n=100 out=as3;
run;

7. PROC TRANSPOSE
Overview
The TRANSPOSE procedure creates an output data set by restructuring the values in a
SAS data set, transposing selected variables into observations. The TRANSPOSE
procedure can often eliminate the need to write a lengthy DATA step to achieve the same
result. Further, the output data set can be used in subsequent DATA or PROC steps for
analysis, reporting, or further data manipulation.
PROC TRANSPOSE does not produce printed output. To print the output data set from
the PROC TRANSPOSE step, use PROC PRINT, PROC REPORT, or another SAS
reporting tool.
A transposed variable is a variable the procedure creates by transposing the values of an
observation in the input data set into values of a variable in the output data set.
Syntax
PROC TRANSPOSE DATA= SAS-data-set
PREFIX= name
OUT= SAS-data-set
NAME= name
LABEL= name
VAR variable-list;
ID variable;
BY variable-list;
RUN;

21
Some examples
data WII_test;
input Name $9. +1 ID $ session $ Test1 midterm
Final;
datalines;
Jason 0545 1 64 71 87
Duham 1236 2 81 95 91
Jeff 1167 1 65 94 92
McBane 1230 2 63 75 80
Grant 2527 2 80 76 71
Lunds 4860 1 92 40 86
Mccain 0674 1 75 78 72
;
run;
(a)simple transpose: transpose all numerical
variables.
proc transpose data=wii_test out=out_transposed;
run;
(b) Naming Transposed Variables:
proc transpose data=wii_test out=out_transposed
name=test prefix=ID;
run;
(c) Labeling Transposed Variables
proc transpose data=wii_test out=out_transposed
name=test prefix=ID;
id id;
idlabel name;
run;
(d)by group
proc sort data=wii_test; by session; run;
proc transpose data=wii_test out=out_transposed
name=test prefix=ID;
id id;
idlabel name;
var test1 midterm final;
by session;

22
run;

Conclusion: The TRANSPOSE procedure is very useful when used to transform data
from rows to columns or columns to rows. Many are uncertain of its effects on their data,
but by using the procedure along with the various options that are available, manipulating
data can be made much easier.

8. SET Statement and Beyond


The SET statement is one of the most frequently used statements in the SAS system.
Proper use of the SET statement is one of the key techniques to improving the efficiency
of SAS programs. We will look at these SET statement options as follows:
END = var
KEY = index
NOBS = var
POINT = var
IN=logic name
Note: Unlike the SET statement data set options, these options are not enclosed in
parentheses.

Using the END= Option

data en;
input accnt balance day @@;
cards;
901 486 1 901 985 4 903 498 2 903 498 2
;
data en2;
set en end=var;
if var=0;
run;

Using the POINT= Option


The value of the variable that is named by the POINT= option should be an integer that is
greater than zero and less than or equal to the number of observations in the SAS data set.
SAS uses the value to point to a specific observation in the SET statement. You must
assign this value within the program so that the POINT= variable has a value when the
SET statement begins execution. Also, in order for SAS to read different observations
into the sample, the value of the POINT= variable must change during execution of the
DATA step.
data a;
input x y @@;
cards;
901 1 902 2 903 3 904 4 905 5
;
data b;
a=2; /*select a given obs*/
set a point=a;
output;

23
stop;
run;

data b;
do i=2, 4; /*select more obs*/
set a point=i;
output;
end;
stop;
run;

In the following code sample, the DO loop assigns a value to the variable X, which is
used by the POINT= option to select every tenth observation from dat.

Notice that the following example is not a complete step.


do x=1 to 120 by 10;
set dat point=x;
output;
end;

The POINT= option uses direct-access read mode, which means that SAS only reads
those observations that you direct it to read. In direct-access read mode, SAS does not
detect the end-of-file marker. Therefore, when you use the POINT= option in a SET
statement, you must use a STOP statement to prevent the DATA step from looping
continuously.

Using the KEY= Option


The KEY = option retrieves observations from an indexed data set based on the index
key, which can be either a simple key or a composite key. If no observations are found in
the data sets that match the value of the key variable(s), then an error condition occurs.

Data newdata;
Set dsn1;
Set dsn2 key =indexvar;
run;

Using the NOBS= Option


The NOBS = option creates a variable which contains the total number of observations in
the input data set(s). If multiple data sets are listed in the SET statement, the value in the
NOBS = variable are the total number of observations in all the listed data sets.
Example :
input student_id name $;
cards;
1 John 2 Mary 3 Peter 4 Charlie 5 Sarah 6 Kate 7 Mathew 8 David 9 Clare
;
run;

/**create temporary variable studnt_cnt and assign the vlaue as number


of observations in the data set student*/
/**in the call symput we refer to the temporary varaible*/
data studnet_cnt;

24
call SYMPUT('nobs_val',put(studnt_cnt,1.));
set student nobs= studnt_cnt;
run;
%put &nobs_val;

/**use symget() function to retrieve the value of the macro variable**/


data student2;
length nobs_val2 3.;
nobs_val2=symget('nobs_val');
run;

Using the IN= option


Use the IN= option to create a Boolean variable that is set to one or 'true' to indicate
whether the data set contributed data to the current observation. When the IN= variable's
value is 1, assign the data set's name into a new variable.

Example:
Data gcrew;
Input Emp_id $ last_name $ 30.;
Cards ;
00632 WHITE
01483 WONG
01996 SMITH
04064 LAGON
;
RUN;

Data gsched;
Input Emp_id $ site_number $ ;
Datalines ;
00632 350
01996 425
04064 505
;
RUN;
/**take all information from data set gcrew no matter the obs has the
matching obs in the data set gsched*/
DATA all_gcrew;
Merge gcrew (in=ingcrew)
Gsched (in=ingsched);
By Emp_id ;
if ingcrew then output;
Run;

9. PROC APPEND
Concatenating SAS data sets is the process of storing observations one after another until
all the data sets and their observations have been combined into one data set.
Many users perform the concatenation process using a DATA step (as shown in the
previous examples), but there are good reasons for using the APPEND procedure. If you
use the SET statement in a DATA step to concatenate two data sets, the SAS System must

25
process all the observations in both data sets to create a new one. The APPEND
procedure bypasses the processing of data in the original data set and adds new
observations directly to the end of the original data set. Therefore, the APPEND
procedure is more efficient than the SET statement in the DATA step for concatenating
data sets because it reads only the data in the DATA= data set.
PROC APPEND
Syntax: BASE= SAS-data-set <DATA=SAS-data-set> <FORCE>
Since PROC APPEND reads only the second data set, set BASE= to the larger data set.
PROC APPEND only reads the data in the DATA= SAS data set, not the BASE= SAS
data set. PROC APPEND concatenates data sets even though there may be variables in
the BASE= data set that do not exist in the DATA= data set.
data master;
input city $ 1-11 month $10. temp;
cards;
Honolulu August 80.7
Honolulu January 72.3
Boston July 73.3
Boston January 29.2
Duluth July 65.6
Duluth January 8.5
New York August 82.7
New York January 22.3
;
data add;
input city $ 1-11 month $10. temp;
cards;
Raleigh July 77.5
Raleigh January 40.5
Miami August 82.9
Miami January 67.2
Los Angeles August 69.5
Los Angeles January 54.5
;
run;
proc append base=master data=add;
run;

In the previous example, the DATA= add contained the same variables as the BASE=
data set (master).

If you may need to append data sets when the DATA= data set contains fewer variables
than the BASE= data set. A warning message is written to the SAS log.
data master;
input city $ 1-11 month $10. temp;
cards;
Honolulu August 80.7
Honolulu January 72.3
Boston July 73.3
Boston January 29.2
Duluth July 65.6
Duluth January 8.5
New York August 82.7
New York January 22.3

26
;
data add;
input city $ 1-11 month $10. ;
cards;
Raleigh July
Raleigh January
Miami August
Miami January
Los Angeles August
Los Angeles January
;
run;
proc append base=master data=add;
run;

What are the advantages or/and disadvantages of SET statement and PROC APPEND?

1. PROC APPEND is more efficient for appending two data sets. Because PROC
APPEND performs an update in place on the BASE= data set; therefore, it just
adds the observations from the DATA= data set to the end of the BASE= data set.
The observations in the BASE= data set are not read or processed.
2. The DATA with SET does not perform an update in place, thus the original data
sets would not be damaged should the process terminate abnormally.
3. If the SAS job terminates abnormally while the APPEND procedure is processing,
the BASE= data set will be marked as damaged.
4. PROC APPEND cannot add variables to the BASE= data set, it can only add
observations to the existing structure of the BASE= data set.
5. If you need to add new variables and observations to the BASE= data set, a DATA
step with a SET statement is the solution.

10. Modify Statement


Using the MODIFY statement, you can update
 every observation in a data set
 observations using a transaction data set and a BY statement
 observations located using an index.

Syntax
DATA SAS-data-set;
MODIFY SAS-data-set;
Statements;
RUN;

/*every observation in a data set */


data a;
input school class;
cards;
1 30
2 30
3 30
;
data a;
modify a;

27
class=class*1.10;
run;
proc print data=a; run;

/*observations using a transaction data set and a BY statement*/


***Duplicate values of BY variables in Master file
***If duplicate values of the BY variable exist
***in the master data set, only the first observation
***in the group of duplicate values is updated;
data trans;
input school class;
cards;
1 30
2 32
;
data master;
input school class;
cards;
1 20
1 21
2 26
2 28
3 37
;

data master;
modify master trans;
by school;
run;
proc print data=master; run;

***Duplicate values of BY variables in Trans file


***If duplicate values of the BY variable exist in the transaction data
set, the duplicate values overwrite each other so that the last value in
the group of duplicate transaction values is the result in the master
data set ;
data trans;
input school class;
cards;
1 30
1 34
2 32
;
data master;
input school class;
cards;
1 20
2 26
3 23
;

data master;
modify master trans;
by school;
run;
proc print data=master; run;

28
If duplicate values exist in both the master and transaction data sets, you can use
PROC SQL to apply the duplicate values in the transaction data set to the duplicate
values in the master data set in a one-to-one correspondence.

29

S-ar putea să vă placă și