Sunteți pe pagina 1din 14

How SAS Concatenates Data Sets with the Same Variables

Let's see how SAS processes the DATA step to concatenate the data sets listed in the SET statement.

data empsall1;
set empsdk empsfr;
run;
During compilation, SAS reads the descriptor portion of the first data set, empsdk, and determines that it
has three variables. SAS also determines the attributes of the variables. Then SAS creates the PDV with
slots for the three variables.

PDV
First Gender Country
$8 $8 $8

SAS then looks at the second data set, empsfr, to see if it has additional variables that must be added to
the PDV. Here, empsfr has no additional variables, so SAS makes no further changes to the PDV. At the
bottom of the DATA step, the compilation phase is complete, and the descriptor portion of the new SAS
data set empsall1 is created.

empsall1
First Gender Country

Now SAS is ready to execute the DATA step and create the data portion of the output data set. To start,
SAS initializes the PDV. This means that SAS sets the value of each variable to missing. Remember that
SAS makes a pass, or iteration, through the DATA step for each observation that's read from an input
data set. Consider this: Which observation does SAS look at first? SAS reads the first observation
in empsdk, the first data set specified in the SET statement. SAS reads the values directly into the PDV.
At the bottom of the DATA step, SAS writes the data from the PDV to the output data set as the first
observation.

empsall1
First Gender Country
Lars M Denmark

Now SAS returns to the top of the DATA step for the next iteration. Because SAS continues reading
observations from the same input data set, SAS does not reinitialize the PDV. SAS reads the second
observation in empsdk into the PDV and then writes the data to the output data set as the second
observation.

empsall1
First Gender Country
Lars M Denmark
Kari F Denmark

Returning to the top of the DATA step, SAS now reads the third observation from empsdk into the PDV
and then writes it to the output data set. At the top of the DATA step, SAS reaches the end of the file.

empsall1
First Gender Country
Lars M Denmark
Kari F Denmark
Jonas M Denmark

SAS reinitializes the PDV before switching to the second data set.

PDV
First Gender Country

Now SAS reads the first observation from the data set empsfr into the PDV, and then writes the values to
the output data set as the fourth observation. SAS reads in and writes out each observation in the second
data set until it reaches the end of that file. When SAS finishes executing the DATA step, the output data
set is complete.

empsall1
First Gender Country
Lars M Denmark
Kari F Denmark
Jonas M Denmark
Pierre M France
Sophie F France

Knowing the Structure and Contents of Your Data

When you combine data sets horizontally, or match-merge data sets, you might want to ask the question:
What is the relationship between observations in the input data sets? The observations can be related in
several different ways.

In a one-to-one relationship, a single observation in one data set is related to one, and only one,
observation in another data set based on the values of one or more common variables. For example,
suppose two data sets contain employee identification numbers for the same group of employees. Each
employee ID number appears once in each data set, and each observation in one data set has one
matching observation in the other data set.
one two
A B ID ID D E
1 1
2 2
3 3

In a one-to-many relationship, a single observation in one data set is related to one or more observations
in another data set.

one two
A B ID ID D E
1 1
2 1
2

In a many-to-one relationship, multiple observations in one data set are related to one observation in
another data set.

one two
A B ID ID D E
1 1
1 2
2

In a many-to-many relationship, multiple observations in one data set are related to multiple observations
in another data set.

one two
A B ID ID D E
1 1
1 1
2 2

Sometimes, the data sets have non-matches. At least one observation in one of the data sets is unrelated
to any observation in another data set based on the values of one or more common variables.

one two
A B ID ID D E
1 2
2 3
4 4

Now take a look at the data sets for your scenario: empsau and phoneh.

empsau phoneh
First Gender EmpID EmpID Phone
Togar M 121150 121150 +61(2)5555-1793
Kylie F 121151 121151 +61(2)5555-1849
Birin M 121152 121152 +61(2)5555-1665

Think about this: Which variable can you use to match-merge these data sets? You can use
the EmpID variable for the match-merge. And what about this: Do these data sets have a one-to-one
relationship? Yes, each data set contains the same three employee ID numbers. One last question: How
many variables will the new data set empsauh contain? Empsauh will contain four variables: the first two
variables come from the empsau data set. The third variable, the BY variable, is common to the two input
data sets. And the last variable comes from the phoneh data set.

empsauh
First Gender EmpID Phone
Togar M 121150 +61(2)5555-1793
Kylie F 121151 +61(2)5555-1849
Birin M 121152 +61(2)5555-1665

The MERGE and BY Statements in the DATA Step

You can use the DATA step to merge multiple data sets into a single data set. Instead of the SET
statement, you use the MERGE statement.

DATA SAS-data-set;
MERGE SAS-data-set1 SAS-data-set2 ...;
BY <DESCENDING> BY-variable(s);
<additional SAS statements>
RUN;

The MERGE statement joins observations from two or more SAS data sets into single observations, so
you must specify at least two data sets in the MERGE statement. If you specify only one data set, SAS
treats the MERGE statement like a SET statement.

In this example, we'll specify the data sets empsau and phoneh as the data sets to merge. Next, the BY
statement indicates a match-merge. You specify the common variable or variables to match, which in this
case is EmpID.
data empsauh;
merge empsau phoneh;
by EmpID;
run;
The BY variables must be common to all data sets, and the data sets must be sorted by the variables
listed in the BY statement. What can you use to sort the data sets? You can use PROC SORT to sort
the emspau and phoneh data sets by the common variable EmpID. Fortunately, the two data sets that
you're working with are already sorted on the BY variable, EmpID.

How SAS Performs a One-to-Many Match-Merge

To match-merge these data sets, you write the same kind of DATA step that you wrote before.

data empphones;
merge empsau phones;
by EmpID;
run;
The DATA statement identifies the output data set as empphones. The MERGE statement lists the two
input data sets, and the BY statement specifies the BY variable EmpID, which SAS uses to combine the
observations. All observations that have the same value of the BY variable are in the same BY group.
Notice that you don't need to sort these data sets because they are already in order by EmpID.

Let's see how SAS processes this DATA step when the data sets have a one-to-many relationship. At the
end of the compilation phase, SAS has created the PDV as well as the descriptor portion of the output
data set. The output data set isn't shown here; you'll have a chance to see it later. The PDV has five
variables: First and Gender from empsau; EmpID, which appears in both data sets;
and Type and Phone from phones.

PDV
Fir Gend EmpI Typ Pho
st er D e ne

To start, SAS sets the values in the PDV to missing.

PDV
Fir Gend EmpI Typ Pho
st er D e ne
.

Now, at the start of the execution phase, SAS is ready to combine observations. SAS looks at the first
observation in each of the two data sets to determine which BY group should appear first in the output
data set. Here's a question. Do the EmpID values match? Yes. These two observations have the same
BY value, so they are in the same BY group.
empsau phones
First Gender EmpID EmpID Type Phone
Togar M 121150 121150 Home +61(2)5555-1793
Kylie F 121151 121150 Work +61(2)5555-1794
Birin M 121152 121151 Home +61(2)5555-1849
121152 Work +61(2)5555-1850
121152 Home +61(2)5555-1665
121152 Cell +61(2)5555-1666
data empphones;
merge empsau phones;
by EmpID;
run;
The DATA step reads the values from the two data sets into the PDV, in the order they appear in the
MERGE statement. The PDV now contains data for Togar's home phone number.

PDV
First Gender EmpID Type Phone
Togar M 121150 Home +61(2)5555-1793

SAS then writes the contents of the PDV to the output data set as the first observation. SAS is now ready
to start another iteration of the DATA step.

At the beginning of each DATA step iteration, SAS reinitializes any new variables in the PDV. In this
example, SAS does not reinitialize any variables, because they all come from the input data sets.
However, if the DATA step had an assignment statement that created new variables, SAS would reset the
values of the new variables to missing in the PDV.

SAS now moves to the second observation in each data set. Do the EmpID values match?

No, they don't. Does either EmpID match the EmpID in the PDV? Yes. The second observation
in phones is in the same BY group. The second observation in phones contains Togar's work phone
number. The observation in empsau has a different BY value. This observation is in a different BY group,
and SAS will process it in the next iteration.

Now, SAS reads the values of Type and Phone from the observation in phones into the PDV. These new
values replace the previous values of Type and Phone in the PDV. However, the values of First, Gender,
and EmpID remain the same as before.

PDV
First Gender EmpID Type Phone
Togar M 121150 Work +61(2)5555-1794

Finally, SAS writes the values in the PDV to the output data set, creating the second observation.

Once again, SAS retains the values in the PDV. In empsau, SAS is still looking at the second observation
because this data has not been read to the PDV. However, in phones, SAS moves down to the third
observation.

empsau phones
First Gender EmpID EmpID Type Phone
Togar M 121150 121150 Home +61(2)5555-1793
Kylie F 121151 121150 Work +61(2)5555-1794
Birin M 121152 121151 Home +61(2)5555-1849
121152 Work +61(2)5555-1850
121152 Home +61(2)5555-1665
121152 Cell +61(2)5555-1666

Do the EmpID values match? Yes, these observations are in the same BY group. Does
either EmpID match the EmpID in the PDV? No. These observations are in a new BY group, so SAS sets
all the values in the PDV to missing.

PDV
Fir Gend EmpI Typ Pho
st er D e ne
.

SAS reads the values from the empsau observation, and then the phones observation, into the PDV.

PDV
First Gender EmpID Type Phone
Kylie F 121151 Home +61(2)5555-1849

Then SAS writes the PDV values to the output data set as the third observation. The DATA step continues
executing in this way until SAS reaches the end of file for both data sets. Here is the final output data set.
Notice that the output data set has multiple observations for each employee.

empphones
First Gender EmpID Type Phone
Togar M 121150 Home +61(2)5555-1793
Togar M 121150 Work +61(2)5555-1794
Kylie F 121151 Home +61(2)5555-1849
Birin M 121152 Work +61(2)5555-1850
Birin M 121152 Home +61(2)5555-1665
Birin M 121152 Cell +61(2)5555-1666
How SAS Match-Merges Data Sets with Non-Matches

By default, the DATA step includes both the matching and non-matching observations in a merged data
set. Let's see how SAS processes the DATA step in this scenario. We'll start at the beginning of the
execution phase. SAS has already created the PDV and the descriptor portion of the output data set with
the four variables First, Gender, EmpID, and Phone.

empsau phonec
First Gender EmpID EmpID Phone
Togar M 121150 121150 +61(2)5555-1795
Kylie F 121151 121152 +61(2)5555-1667
Birin M 121152 121153 +61(2)5555-1348
data empsauc;
merge empsau phonec;
by EmpID;
run;
SAS looks at the first observation in each data set to determine which BY group should appear first.
These two observations are in the same BY group. So SAS reads the values from the current observation
in each data set into the PDV.

PDV
First Gender EmpID Phone
Togar M 121150 +61(2)5555-1795

Then SAS writes the contents of the PDV to the output data set as the first observation.

empsauc
First Gender EmpID Phone
.

The values remain in the PDV as SAS begins the next iteration of the DATA step.

PDV
First Gender EmpID Phone
Togar M 121150 +61(2)5555-1795

SAS now looks at the second observation in both data sets.


empsau phonec
First Gender EmpID EmpID Phone
Togar M 121150 121150 +61(2)5555-1795
Kylie F 121151 121152 +61(2)5555-1667
Birin M 121152 121153 +61(2)5555-1348

Do these EmpID values match? No, they don't. Does either EmpID match the EmpID in the PDV? No.
Neither of these observations is in the same BY group as the PDV. This is the first non-matching
observation that SAS has identified. Because current observations are not in the same BY group as in the
PDV, SAS reinitializes the PDV.

PDV
Firs Gende EmpI Phon
t r D e
.

Now think about this. Which EmpID value comes first sequentially? In the current observations,
the EmpID value ending in 151 comes before the value ending in 152. So SAS reads the second
observation in empsau into the PDV. In the PDV, Phone is still set to missing because there is no phone
number for this employee in phonec.

PDV
Firs Gende EmpI Phon
t r D e
Kyli F 121151
e

SAS writes the data in the PDV to the output data set as the second observation.

empsauc
First Gender EmpID Phone
Togar M 121150 +61(2)5555-1795
Kylie F 121151

Once again, SAS returns to the top of the DATA step and moves down to the third observation
in empsau.

empsau phonec
First Gender EmpID EmpID Phone
Togar M 121150 121150 +61(2)5555-1795
Kylie F 121151 121152 +61(2)5555-1667
Birin M 121152 121153 +61(2)5555-1348

In phonec, SAS is still looking at the second observation. Do the EmpID values match? Yes, they do.
Does either EmpID match the EmpID in the PDV? No. These two observations are not in the same BY
group as in the PDV, so SAS reinitializes the PDV.

PDV
Firs Gende EmpI Phon
t r D e
.

Then, SAS reads the values from the empsau observation, and then the phonec observation into the
PDV.

PDV
First Gender EmpID Phone
Birin M 121152 +61(2)5555-1667

SAS writes the data to the output data set as the third observation.

empsauc
First Gender EmpID Phone
Togar M 121150 +61(2)5555-1795
Kylie F 121151
Birin M 121152 +61(2)5555-1667

Once again, SAS returns to the top of the DATA step. SAS has reached the end of the file in empsau, but
not in phonec. SAS looks at the third observation in phonec.

empsau phonec
First Gender EmpID EmpID Phone
Togar M 121150 121150 +61(2)5555-1795
Kylie F 121151 121152 +61(2)5555-1667
Birin M 121152 121153 +61(2)5555-1348

Does this EmpID match the EmpID in the PDV?

PDV
First Gender EmpID Phone
Birin M 121152 +61(2)5555-1667

No, this observation is not in the same BY group, so SAS reinitializes the PDV.

PDV
Firs Gende EmpI Phon
t r D e
.

SAS reads the values from the phonec observation into the PDV,

PDV
First Gender EmpID Phone
121153 +61(2)5555-1348

and then writes the data to the output data set as the fourth observation.

empsauc
First Gender EmpID Phone
Togar M 121150 +61(2)5555-1795
Kylie F 121151
Birin M 121152 +61(2)5555-1667
121153 +61(2)5555-1348

SAS returns to the top of the DATA step. SAS has reached the end of file in both data sets.

Using the IN= Data Set Option

You can use the IN= data set option in a MERGE statement to identify which input data sets contributed
to each observation in your output. After a SAS data set name, you specify the IN= option in parentheses,
followed by a valid SAS variable name.

MERGE SAS-data-set (IN=variable)...

When you specify the IN= option after an input data set in the MERGE statement, SAS creates a
temporary numeric variable that indicates whether the data set contributed data to the current
observation. The temporary variable has two possible values. If the value of the variable is 0, it indicates
that the data set did not contribute to the current observation. If the value of the variable is 1, the data set
did contribute to the current observation.

In the this example, the IN= option is specified after each of the input data sets.
data empsauc;
merge empsau(in=Emps);
phonec(in=Cell);
by EmpID;
run;
We want to know when either of these data sets contributes to the current observations. We've chosen
the variables names Emps and Cell. Here's another example using just E and P as the variable names.
data empsauc;
merge empsau(in=E);
phonec(in=P;
by EmpID;
run;
This last example shows how you can use the IN= option on just one of the data sets in a MERGE
statement.
data empsauc;
merge empsau(in=AU);
phonec;
by EmpID;
run;

How SAS Processes the IN= Data Set Option

Here's how SAS processes the IN= data set option in the DATA step.

empsau phonec
First Gender EmpID EmpID Phone
Togar M 121150 121150 +61(2)5555-1795
Kylie F 121151 121152 +61(2)5555-1667
Birin M 121152 121153 +61(2)5555-1348

data empsauc;
merge empsau(in=Emps)
phonec(in=Cell);
phonec;
by EmpID;
run;
During the execution phase, SAS creates a temporary variable in the PDV for each
instance of the IN= data set option in your code. Each time SAS reads data into the
PDV, SAS assigns a value to the temporary variables Emps and Cell to indicate whether
the associated data set contributed data to the current observation.
PDV
Fi Gen Emp Em Pho C
rst der ID ps ne ell

In the first iteration, both data sets contributed to the data that is in the PDV, so the
value of both temporary variables is 1. We have a match.
PDV
First Gender EmpID Emps Phone Cell
Togar M 121150 1 +61(2)5555-1795 1

In the second iteration, the data set phonec did not contribute, so the value of the
temporary variable Cell is set to 0. We have a non-match.

empsau phonec
First Gender EmpID EmpID Phone
Togar M 121150 121150 +61(2)5555-1795
Kylie F 121151 121152 +61(2)5555-1667
Birin M 121152 121153 +61(2)5555-1348

PDV
First Gender EmpID Emps Phone Cell
Kylie F 121151 1 0

In the third iteration, both data sets contributed to the data that is in the PDV, so SAS
assigns the value 1 to both Emps and Cell. We have another match.

empsau phonec
First Gender EmpID EmpID Phone
Togar M 121150 121150 +61(2)5555-1795
Kylie F 121151 121152 +61(2)5555-1667
Birin M 121152 121153 +61(2)5555-1348

PDV
First Gender EmpID Emps Phone Cell
Birin M 121152 1 +61(2)5555-1667 1

Take a moment to look at the values after the fourth iteration.

empsau phonec
First Gender EmpID EmpID Phone
Togar M 121150 121150 +61(2)5555-1795
Kylie F 121151 121152 +61(2)5555-1667
Birin M 121152 121153 +61(2)5555-1348

What are the values of Emps and Cell for this data? This data is a non-match that comes
from phonec but not from empsau. SAS sets the variable Emps to 0 and Cell to 1.
PDV
First Gender EmpID Emps Phone Cell
121153 0 +61(2)5555-1348 1

You might be wondering whether variables that are created with the IN= data set option
appear in the output data set. These variables are only available during execution. As
you can see by looking at the partial output data set here, SAS does not write these
temporary variables to the output data set.

empsauc
First Gender EmpID Phone
Togar M 121150 +61(2)5555-1795
Kylie F 121151

S-ar putea să vă placă și