Documente Academic
Documente Profesional
Documente Cultură
Robert G. Brookshire
For most social scientists, data come m a rectangular form similar to a spreadsheet,
where columns represent variables and rows are observations. For a few years now,
however, a more complex view of data has been evolving in the fields of computer
science and management information systems. The purposes of this paper are to in-
troduce this relational view of data to social scientists and to argue that this way of
looking at data can be much more powerful than the traditional view. The first part
of the paper introduces the terminology and concepts of the relational model. This
is followed by a discussion of relational operators, normalization, and the entity-
Social Science Computer Review 11:2, Summer 1993. Copyright CO 1993 by Duke Um-
versity Press. ccc o894-4393~93~$1~50~
198 tools that researchers should maintain. Second, users of data ought
to be able to communicate with the designers and managers of data-
bases, the people who are custodians of the data files. Having a com-
mon set of concepts and language promotes the good working rela-
tionships that are necessary for ongoing research. Third, for many
kinds of data that social scientists must analyze, the traditional flat
file is inadequate. Data that encompass several different levels or
units of analysis, or contain time-dependent measures of varying
length, can be represented only very clumsily with rectangular data
structures. The relational model can simplify the storage and re-
trieval of these kinds of data. Finally, the relational model is a
somewhat different way of looking at data, one that can lead to
more comprehensive, flexible, and insightful kinds of data analysis.
Relational Operators
Tables in relational database are manipulated through the use of
a
which has one less row than President because a relation may not
have duplicate rows. Similarly, the operation project Presi-
dent(State) would yield:
State
New York
New Jersey
California
Select. The select operator, also called theta-select, is used to sub-
set the rows of a table according to an equality or inequality condi-
tion. The operation select President(B_Term < 1920) yields:
201 F-Name L-Name State B-Term
Theodore Roosevelt New York 1901
Woodrow Wilson New Jersey 1913
Millard Fillmore New York 1850
join. The join operator, also called theta-join, is used to join the
rows of two relations according to an equality or inequality condi-
tion. If we have a relation called States, as in the following:
S-Name Admitted
New York 1788
New Jersey 1787
Arizona 1912
Hawaii 1959
the operation join President(State =
S_Name)States would result
in:
Relation P Relation VP
F_Name L_Name Born F_Name L-Name Born
George Washington 1732 John Adams 1735
John Adams 1735 Thomas Jefferson 1743
James Madison 1751 Aaron Burr 1756
James Monroe 1758
The union of P and VP would give the relation:
F_Name L_Name Born
George Washington 1732
John Adams 1735
James Madison 1751
James Monroe 1758
Thomas Jefferson 1743
Aaron Burr 1756
Their intersection yields the relation:
F_Name L_Name Born
John Adams 1735
Their difference produces the relation:
F_Name L_Name Born
George Washington 1732
James Madison 1751
James Monroe 1758
Division. Relational division is rather complicated and is best ex-
plained through the use of an example. Consider the following re-
lation, Committee, which shows Senate committees and their
members:
C-Name M_Name M_Rank Party
Judiciary Kennedy 2 D
Armed Services Kennedy 4 D
Small Business Nunn 2 D
Armed Services Nunn i D
Armed Services Warner 1 R
Foreign Relations Kassebaum 3 R
203 We have a second relation, Senators, which has only some names of
senators:
M-Name
Kennedy
Nunn
Warner
The division operation is equivalent to asking the question, &dquo;What
committee contains all the members listed in the relation Sena-
tors?&dquo; The divisor is Senators, the dividend is Committee, and the
result is the quotient:
C-Name M_Rank Party
Armed Services 4 D
Armed Services 1 D
Armed Services 1 R
The remainder of the division operation is what is left of Commit-
tee :
ing and deleting keys, because a change to a primary key must also
be reflected in changes to any foreign keys that are equivalent to it.
There are many other operators, the description of which is beyond
the scope of this paper.
Normalization
Normalization is the process of designing a database to eliminate
certain kinds of redundancy in the information that is maintained
in the relations. To this end, certain rules have been defined for sev-
eral &dquo;normal forms&dquo; for relations. These normal forms, from the
simplest to the most complex, are first, second, and third normal
forms; Boyce-Codd normal form; and fourth and fifth normal forms.
Each form adds additional requirements to the one that precedes it.
For example, a relation in second normal form meets all the require-
ments of first normal form, plus some others. A detailed exposition
of these topics with an annotated bibliography can be found in Date
(1986), and a good brief overview is available in Kent (1983).
204 First normal form. First normal form (INF) is the basic form for
relational data. To be in iNF, a relation must have no repeating rows
and each row must have the same number of columns. Most social
science data are in first normal form, although there are a few com-
monly used data sets that are not.
Second and third normal forms. Second and third normal forms
deal with relations in which the primary key is composed of more
than one field. Consider the following relation, Alliances:
Country Alliance Date Capital
United States OAS 1948 Washington, DC
Mexico OAS 1948 Mexico City
United States NATO 1949 Washington, DC
Canada NATO 1949 Ottawa
The primary key for this relation is the combination of the two col-
umns Country and Alliance. The column Capital, however, con-
tains redundant information in that the capital must be repeated for
each occurrence of each country. Second normal form (2NF) removes
this redundancy. To put this data into 2NF, the relation should be
decomposed into two relations:
Country Alliance Date
United States OAS 1948
Mexico OAS 1948
United States NATO 1949
Canada NATO 1949
and
Country Capital
United States Washington, DC
Mexico Mexico City
Canada Ottawa
More formally, in only columns that contain information about
2NF
the entity defined in the key should be contained in the relation.
Because Capital provides information only about the country, not
the country and its alliance, it should be stored in another relation.
In 3NF, this requirement is extended to include columns that con-
tain information about nonkey data as well. Consider the following
relation, which contains information about senators:
Senator Party State Capital
Kennedy D Massachusetts Boston
Kerry D Massachusetts Boston
Nunn D Georgia Atlanta
Dole R Kansas Topeka
Kassebaum R Kansas Topeka
Warner R Virginia Richmond
205 Because the relation contains data about senators, the key is the
column Senator. The column Capital, however, does not provide in-
formation about the senators, but about the states they represent.
It is redundant to repeat the capital for each occurrence of state.
To put this relation into 3NF, it should be decomposed into two
relations:
Senator Party State
Kennedy D Massachusetts
Kerry D Massachusetts
Nunn D Georgia
Dole R Kansas
Kassebaum R Kansas
Warner R Virginia
and
State Capital
Massachusetts Boston
Georgia Atlanta
Kansas Topeka
Virginia Richmond
In order to be in 3NF, then, each column in each row must &dquo;pro-
vide afact about the key, the whole key, and nothing but the key&dquo;
(Kent, 1983, p. 120). The truly devout add to this definition, &dquo;so help
me, Codd.&dquo; For both 2NF and 3NF, if the original relation is required
for analysis, it can be easily reconstituted by using the relational
operators on the two new relations.
Boyce-Codd normal form. Boyce-Codd normal form (BCNF) is an
extension of 3NF. Consider the relation:
Senator Committee Chairman
Kennedy Judiciary Biden
Kennedy Armed Services Nunn
Warner Armed Services Nunn
Kassebaum Foreign Relations Pell
This relation has two possible keys, the combination of Senator and
Committee, and the combination of Committee and Chairman. If
we choose one of these combinations as the primary key, we are
and
Country Language
Canada English
Canada French
Switzerland French
Switzerland Italian
Switzerland German
United Kingdom English
207 Note that the blank entries in the relations are eliminated when the
data are transformed to 4NF. As with all the previous examples, we
can recover the original relations, if necessary, through operation on
the relations in 4NF.
Fifth normal form. Consider the relationships between importers
and exporters of agricultural products. A country may export a prod-
uct to one or more countries. A country may import a product from
one or more countries. Countries may import or export more than
one product, and many countries can import or export the same
Second,
Producer Importer
United States Russia
United States China
Australia Russia
France Russia
France Japan
Third,
Importer Product
Russia wheat
China wheat
Russia rice
Japan wheat
208 As with the other normal form relations, the original relation can
be recovered through operations on the decomposed relations. A re-
lation in 5NF is also in INF, 2NF, 3NF, BCNF, and 4NF.
Entity-Relationship Diagrams
As canbe seen from these small examples, the depiction of rela-
tional data can quickly become quite cumbersome. Several tech-
niques have been developed to diagram relational data, the most
popular of which is the entity-relationship diagram (Chen, 1976).
The symbols used in the entity-relationship diagram are not stan-
dardized. This paper employs those presented by Eliason (1990),
which are similar to Chens.
In an entity-relationship diagram (ERD), an entity is anything for
which data are stored. It is symbolized by a box. A relationship be-
tween two entities is symbolized by a diamond that is connected to
the two entities. Labels in the symbols identify the entities and
their relationships. The lines connecting the entities and relations
are labeled with 1, N, or M to indicate the degree of the relationship.
Both N and M indicate more than one and are used to indicate that
the degree of the relationship is not necessarily equal on both sides.
The way these symbols are used will become clear in the example
below.
Each entity and each relationship represents a relation or table.
Chen (1976) called these entity relations and relationship relations.
A list of the columns or attributes of the entity and relationship
relations can be presented in the diagram, with the primary keys
identified by having their names underlined. Figure 1 presents an
example of an ERD.
In Figure 1, each representative serves on more than one (M) com-
mittee, and each committee has more than one representative (N)
serving on it. On the other hand, each representative represents
only one state, whereas a state may have more than one (M) repre-
sentative.
This database will contain four tables or relations, one for each
entity and one for the relationship between Committee and Repre-
sentative. The State relation, whose primary key is S-Name, con-
tains in addition the columns, or attributes, Capital and Population.
The Representative relation, whose primary key is R-Name, also
contains the columns Age, Party, and District and the foreign key
S-Name. The Committee relation, whose primary key is C-Name,
also has the attributes Chair and Mtg-Rm.
The relation Serves On has three columns. Two contain the pri-
mary key, which is composed of R-Name and C-Name. The rela-
tion also has the column Rank. The relation Represents contains
only the key composed of R-Name and S-Name. It is not necessary
to store this relation, because we can project it from Representative.
Through the use of relational operators, we could gather a lot of
209 R_Name, C_Name,
Rank
An Example Database
The National Crime Surveys are conducted by the Bureau of the
Census for the Justice Departments Bureau of Justice Statistics and
contain information about the crimes suffered by households and
210 individuals (U.S. Department of Justice, Bureau of Justice Statistics,
1991). As distributed by the icPSR, the data are composed of three
types of records: data about households, individuals, and criminal
incidents. The data are organized hierarchically. Each household
record is associated (by a common variable) with one or more indi-
vidual records that describe the persons age 12 and older who com-
pose the household. The individual records are likewise associated
with incident records, which describe criminal incidents suffered
by the individuals in the household. Not every individual has an
incident record, but some individuals have many of them.
This hierarchical structure is a reasonably efficient way to store
the data but does not correspond to the logical structure of the
data.l The logical structure of the data is shown in an entity-rela-
tionship diagram in Figure 2. Each household is composed of one or
more individuals. Each individual may have been a victim of one or
more criminal incidents. Some incidents, such as a burglary, are
suffered by the household as a whole, however, rather than by in-
dividuals separately. How should these incidents be treated?
Under the hierarchical model used by the ICPSR, households are
composed of individuals, and individuals suffer criminal incidents.
Every incident, then, has to be tied to a single, specific individual.
For crimes against individuals, such as assault, if there is more than
one victim in a particular incident, there will be a separate incident
record for each victim. Crimes against households, however, are
represented by a single individual record, which is associated either
with the main respondent for the household or with the individual
who reports it.
The relational model, in contrast, allows the household to relate
directly to the incident without the intervention of the individual,
Discussion
Thinking about data in a relational form has many benefits. Rather
than forcing all the attributes (variables) in a relation (data set) to be
characteristics of one type of record (case), a relational view of data
allows a database to contain many types of records, some of which
describe entities like voters, countries, senators, and so on, whereas
others describe the relationships among these entities. This concept
of keeping the data that describe the relationships among entities
separate from the data about the entities themselves can be very
liberating and cause us to look at our data in new ways.
The study of multinational corporations involves two entities,
the corporations and the countries in which the corporations do
business. The attributes of countries should obviously be stored in
one relation, and the attributes of the corporations, as multination-
Conclusion
This paper has only scratched the surface of the relational model of
data. It has introduced the major concepts of the model, however.
This should enable readers to communicate clearly with database
managers and other computer professionals. It has also provided the
tools so that readers can design reasonably complex relation data-
bases of their own. Finally, I hope that it has suggested ways in
which the relational model can help to change the way we view
data. It is especially important that we learn to include relation-
ships in our view of data, as objects of study that deserve to have,
and can have, data all their own.
Notes
Robert G Brookshire is assistant professor of information and decision sciences at
References
Chen, P. P. 1976 The entity-relationship model—toward a unified view of data. ACM
Transactions on Database Systems 1 (1) 19-36.
213 Codd, E. F. 1970. The relational model of data for large shared data banks. Commu-
nications of the ACM 13 (6). 377-87.
—.
1979. Extending the relational database model to capture more meaning. ACM
Transactions on Database Systems 4 (4): 457-75.
—.
1982. Relational database: A practical foundation for productivity. Commu-
nications of the ACM 25 (2). 109-17.
—.
1990. The relational model for database management: Version . Reading,
2
MA: Addison-Wesley
Date, C. J. 1986. An introduction to database systems. Vol. 1, 4th ed. Reading, MA:
Addison-Wesley.
Eliason, A. L. 1990. Systems development Analysis, design and implementation. 2d
ed. Glenview, IL: Scott, Foresman.
Kent, W. 1983. A simple guide to five normal forms in relational database theory.
Communications of the ACM 26 (2): 120-25.
U.S. Dept. of Justice, Bureau of Justice Statistics. 1991. National crime surveys Na-
tional sample, 1986-1990 (near-term data). 3d ICPSR ed Ann Arbor: Inter-university
Consortium for Political and Social Research.
Vaughan-Nichols, S. J. 1990. Relational databases: The real story. Byte 15 (13): 321-
25.