Documente Academic
Documente Profesional
Documente Cultură
LAUREN B. DOYLE
System Development Corporation, Santa Monica, California
similarly each other label correlates maximally with one other label. Figure 2
shows a somewhat lesser degree of dependence, because some labels, such as
"Thomas" and "John", have fruitful coordinations with several other labels.
It becomes evident that masses of data which consistently lead to the kind
of coordination behavior shown in Figure 2 permit and encourage the formation
mmmmmm
mmmmmm
m
\ \
Ill
,d
~f
amimn
d,
mmmmmmmmm
mmmmmmmg
\\
immmmmmmm
Fig. 2 A Degree of DependenceLeadingto Hierarchy Formation
558 LAUREN B. DOYLE
of hierarchies. Under the category " T h o m a s " are subcategories such as "Jeffer-
son" and " H u x l e y " . When one begins to encounter large numbers of cases
where there are Jeffersons which are not Thomases, or Shakespeares which are
not Williams, then one finds hierarchical classification cumbersome, though
still useful.
Of course, in this example there is admittedly no significance in the fact that
Beecham, Huxley, and Jefferson have the same first name, and therefore no point
in dealing with the hierarchical s t r u c t u r e - - i t is sufficient to deal only with the
subcategories in this case. However, and this is the important point about
hierarchical schemes, whenever the labels for information items consistently
configure themselves as in Figure 2, there is undoubtedly some significance to
the categories as well as to the subcategories. In such a case, the subcategorized
items are empirically established as a species of the categorized items, and thus
hierarchies become capable of providing semantic guidance to searchers.
Now let us enlarge the number of labels and tallies in the Figure 2 example in
order to compare it quantitatively with the results of the random label assign-
ment described earlier. Suppose we assume 1000 labels, 250 of which are first
names, and 750 of which are last names. Let us also assume that we have con-
trived, as in the Figure 2 example, to choose names such t h a t each first name
"belongs" to several other last names, in that several well-known people have
the same first name.
We will go through the text of books and magazines until a million occurrences
of names corresponding to combinations of the 1000 labels have been picked out.
As in the random assignment example, there are 499,500 possible categories
(500,500 if such names as " H u m b e r t H u m b e r t " are taken into account). How-
ever, because of the extreme statistical dependence which has been in, posed, we
now find that only 750 of the 499,500 categories will contain tallies (this assumes
that the names have been chosen cleverly enough that such cross-connections
as "Pancho Shakespeare" or "Ludwig van Bonaparte" are never encountered
in text). The categories with tallies should have an average of 1333 each, and
the mode should be in the neighborhood of 1333. This contrasts with a mode of
2 in the random assignment example. B y traveling from the statistically inde-
pendent extreme to the statistically dependent extreme, we have decreased
precision by a factor of about 650.
W h a t can one do to make the retrieval precision of data with statistically
dependent attributes as good as that for data with independent attributes? One
way is to increase the number of ordinates. For example, in looking for the name
" T h o m a s Jefferson", we could also look for some third attribute (such as "Monti-
cello" or "Louisiana Purchase") to allow us to make three-term instead of two-
term coordinations.
There are an infinite number of possible third attributes for each of the 750
frequently occurring names, but let's assume t h a t only k of them actually
apply to each name, on the average. If we stick to the extreme degree of statistical
dependence we initially chose (e.g. "Monticello" would never be associated in
text with any other name except "Jefferson"), precision will probably be im-
proved by a factor of k. If k is not large, precision will still be poor and a fourth
S E M A N T I C R O A n MAPS FOR L I T E R A T U R E S E A R C H E R S 559
I I I . Topical Intersections
The statistically dependent placement of words in text (and hence of labels
on documents) is a natural consequence of the way people think and communi-
cate. The preceding section has pictured hypothetical examples to show t h a t
word association can have detrimental effects on retrieval system performance.
There is, however, no reason why phenomena whose causes are traceable to the
nature of h u m a n thought should adversely affect the retrieval of information.
Such phenomena can indeed be put to work to aid retrieval: the same cognitive
processes which lead to the highly correlated placement of words in text can
also lead to the recognition of such pairs b y literature searchers. And so we are
going to take a close look at highly correlated word p a i r s - - t o see if we can get
any ideas about how to use them.
Table I shows ten "topical intersections" formed b y word distributions in
600 abstracts of SDC 5 internal documents. A topical intersection is a library
subset in which every document contains both of two designated topical words or
terms.
Since abstracts are brief summaries of articles, it is assumed t h a t whenever a
content word appears even once in the abstract, it is a topical clue to the con-
tents of the article. Five hundred such topical words were selected for study,
though at the time of selection no particular experiments were in mind. A com-
puter program was written to process the keypunched text of the 600 abstracts
and to print out data on all topical intersections, i.e. on all possible pairings of
the 500 words with each other. The data were: a~, the n u m b e r of abstracts in
which one word of a pair appears; no, the number of abstracts in which the
other word of the pair appears; and a3, the number of abstracts in which both
words appear.
( If one assumes that the probability of occurrence of one word in an abstract
is totally independent of the probability of occurrence of the other word, one
can compute the expected size of the intersection as ala2/A, the expected n u m b e r
5 System Development Corporation
TABLE 1
Column 1 5 6 7* 8* 9 10
Totals 85 78 23 10 13
* Column 7 includes cases where both words are m the same sentence; colmnn 8, cases where words are in different sentences.
SEMANTIC ROAD MAPS FOR LITERATURE SEARCHERS 563
programs in columns 5 and 6, where it was syntactically tied to the word "mode[".
Other words for which this was true were "manual", "problem", "production",
and "analysis". Meanings in columns 5 and 6 were specific and consistent.
Indeed, the size of the numbers in these columns shows that the unexpectedly
large sizes of the intersections are due mainly to these adjacent or near-adjacent
occurrences. We draw the following tentative conclusions:
(a) Some of the strongest word correlations in text are correlations in adjacent
or near-adjacent placement.
(b) A likely cognitive explanation for this is the tendency of the mind to
handle word pairs (and often larger groups of adjacent words) as if they were
one word; naturally the meanings of the component words will be dependent
because they are subordinate parts of a larger semantic unit.
Cases without close syntactic connection between component words were
assigned to columns 7-10, partly on the basis of whether semantic relationships
agreed with those in columns 5 and 6. Columns 7 and 8 contain instances where
the component words were used in the same senses as in columns 5 and 6, but
where the "adjective" component was omitted because an antecedent made
things clear. As an example, the phrase "program revision and checkout of the
new model" could have been written "program revision and checkout of the
new program model", which of course would not have been necessary. Columns
9 and 10 contain cases where different relationships existed between the com-
ponent words, either because one of the words did not pertain to the other
component, on an antecedent basis or otherwise, or because one of the com-
ponents had an entirely different meaning.
If the strongest correlations encountered are those of ad2acent placement of
words in text, some light is shed on the remarks in Section II about coordination
indexing. Adjacent word correlations, we would expect, would be the first ones to
disturb the operation of a coordination system, their effects setting in even in
small collections. Fortunately, these effects are also easiest to handle--by the
simple expedient of forming a term out of the two components.
In the 600-abstract "library" of Table I, even adjacent correlations would
not lead to intersections large enough to constitute a bother. The intersections
of Table I are atypically large in this collection because the component words
occurred in at least 50 abstracts. As practiced information searchers know, these
are the kinds of words to avoid using, if possible; they are "poor precision"
words. In the 600-abstract library, typically consulted two-word intersections
would contain about three abstracts each. However, if a library retains a special-
ized nature, intersections will grow in proportion to library size (probably
somewhat less than direct proportion). At the 5000-document level, formation of
two- and three-word terms would seem to be desirable, leading to greater pre-
cision per search ordinate and less work by the searcher.
What happens if the specialized library grows to 50,000 documents? Are there
residual correlations which cannot be dealt with by term formation? Inspection
of some of the nonadjacent correlations in the 600 abstracts seems to indicate
that there are (see also [9]). The 600-abstract corpus is not large enough to
SEMANTIC ROAD MAPS FOR LITERATURE SEARCHERS 565
Problems(s)
350 9I-
Operat(e,s) " ~
~. ("e) ,
II Oon,s) '
(A,r)O,v,s,on(s)v I I v ~ ~....
Input(s)
V ~~LI- Dtreif(s'ed)(or)(l°n)
250-- Field Function(s) Area
~ l,At (ly)
Flight(s) S,te(s)
Meeti| ng(s)Request(s'ed)V
(rag) "~-~ A 'ier /A'"u'Pe°tA 65-
Evaloahon
Tesl(s,ed)
200-- ~(mg) 52-
Modlf(ymed)
A Ocahon)
A~rcraft
Faclht (y, les) !
Package(s)
Feedbacl
150-- /V • V %~ Lib'rary •. 39-
of above
(Continuations
~ curves)
Conference
n (s
~
~
~'~,~
Generat(e,s) L'
(or)|~
26-
RCAF Tabular(e,s'~'~'"
50~ (atlon) 13-
Addendum
I0- )-
FIG. 3
Solid Curve Index Number of other word types coexisting with the words plotted.
Dashed Curve Index- N u m b e r of abstracts containing the word
NOTE" Points on solid curve occur in order of frequency of a b s t r a c t s c o n t a i m n g t h e word.
These frequencies are shown by the dashed curve for all but the ten m o s t widely distrib-
u t e d words. In general, the lower the point on the sohd curve (with respect to its neigh-
boring points) the b e t t e r it is as a subject indicator. Thus, words such as " a d d e n d u m "
a n d " t a b u l a t i o n " are revealed to be highly particular subject indicators, at least for this
library. (See discussion )
568 LAUREN B. DOYLE
These words, however, should suffice for an initial test of the measure, and we
can count the number of word types for any given word by counting the number
of topical intersections in which the word is involved. This is done in Figure 3
for more than 100 of the most frequent of the 500 selected words. N and P could
not be held constant, but the measure can be obtained by dividing the solid
curve (number of topical intersections) by the value of some function of the
dashed curve (number of abstracts in which a word appears) at the ordinate
which represents a given word. Words are arranged from left to right in order
of frequency. Note that not all of the points on the solid curve are labelled, but
only the extreme values, to give readers an idea of which subject words turned
out to be the "best" (i.e. troughs on the curve) and which "worst" (peaks).
Figure 3 admittedly does not answer the question, "Is this a good test for
subiect words?" It is not clear why one should think of "characteristic" and
"request" as being better subject words than "equipment" and "input". How-
ever, an answer of sorts is possible because six months before it was possible to
count topical intersections, the author subjectively partitioned words into five
categories, in accordance with what their value as subject words seemed to be.
We are therefore able to compare these value judgments with the results of the
topical intersection count:
* The noncorrespondence in over-all form between the two Figure 3 curves indicates
t h a t direct p r o p o r t m n a l i t y does not apply. At the same time, ~t is not known w h a t the
functional relationship should be. Therefore, the author " d i s t o r t e d " the values t a k e n from
the dashed curve, (mostly those from the upper end where a " s a t u r a t i o n effect" appears
to set in) in order to bring its gross form into line with t h a t of the solid curve, and
"aboves" and " b e l o w s " were picked on this basis.
Only one out of eight of the words judged best by the author were above
average in number of topical intersections (and therefore poor). Words judged
worst had more topical intersections than average in two out of three cases.
The author was at first reluctant to include Figure 3 and a discussion thereof
in this article, not only because it did not seem to have a direct bearing on the
trend of the discussion, but also because the experiment was mainly a matter
of personal curiosity satisfaction, and not a carefully planned and reasoned
experiment. However, two realizations led to a change of heart:
a. The hypothesis that the best subject words should be those with the
SEMANTIC ROAD MAPS FOR LITERATURE SEARCHERS 569
strongest proximal correlations is much to the point, because these are the very
words that people will try to use in literature searching, with obvious implica-
tions for retrieval precision. In other words, if the hypothesis is true, we are
assured that the residual correlations left over after elimination of adjacent
correlations (by term consolidation) will operate in the same way as the adjacent
ones did, i.e. the word combinations which correlate most strongly will be the
ones most often used in search request statements.
b. It is possible that the "judgment of the computer" as to what constitutes
a good subject word is actually better than that of a human. The cases where
the results of the "topical intersection test" of Figure 3 differed from the judg-
ment of the author may not have been so much due to the insufficiency of the
test as to the faulty human judgment involved. A word like "meeting", though
capable of being used in many senses, may actually be used predominantly in a
single sense in a given corpus, and may therefore be a useful, accurate ordinate
for retrieval. How could human judgment possibly single out such a case?
Now it is true that "a good subject word" is not merely a selective word but
one which also has meaning to human searchers which is related to what it
selects. Therefore exploiting the "judgment of the computer" depends on our
being able to inform searchers what the predominant sense of the word actually
is. Section V describes one way in which this might be done indirectly.
This section has analyzed some topical intersections and has come up with
the conclusion that there are at least three more or less independent kinds of
redundancy in text: language redundancy, documentation redundancy, and
reality redundancy. We think we know what to do about the first two kinds in
formulating a retrieval system. For language redundancy, we combine the
offending words into terms. For documentation redundancy, we combine the
offending documents into bundles. What can we do with reality redundancy?
This will be the topic of the following section.
relationship between the terms of his request. This is one of the reasons (though
not the major reason) why searchers using coordination systems tend to use
meaningful, familiar combinations.
When the "unexpected" reveals itself in the text of a document, what form
does it take? Does it involve new combinations of words? Usually no. A physical
scientist, for example, who reports a new phenomenon does so in terms of the
usual words that people in his field would use. Another scientist, to inform
himself about this revelation, could dig up this report only by following standard
search paths.
This is why hierarchies have been, and still are, important. They provide
familiar conceptual grooves by which people can steer themselves to areas of
interest. Sometimes, however, the "unexpected" does occur on a search request
level. It is quite true that the recent expansion of knowledge, especially in inter-
disciplinary directions, has given rise to unexpected new pathways of access to
information. Provision for these pathways has proved exceedingly troublesome
for maintainers of clasmfication systems.
Actually, the new pathways are only initially unexpected. If the pathways
persist, they soon will be in the "expected" category in the minds of literature
searchers. Those who have devised methods such as coordination indexing,
where everything is potentially related to everything else, have possibly over-
emphasized the importance of unexpected relationships. Though there have
been and will continue to be a vast number of new and unexpected conjunctions
of ideas, it will nevertheless be true that at any given moment the number of
unexpected relationships will be quite small in comparison to the number of
familiar ones. For a layman conducting a search, the number of unexpected
conjunctions may be appreciable; but not large enough to preclude tipoff signals
to the searcher as a normal part of system operation.
For these reasons, it is probably psychologically sound to give a literature
searcher some kind of a structure, whether hierarchical or otherwise, rather than
to give him an infinitely flexible and therefore totally disorganized array of
terms. Such a structure might have to be revised every so often in order to keep
up with the changing topical spectrum; in the past this has not been feasible,
but we hope computers can change the picture. The hbrarian's need for flexibility
of organization and the searcher's need for semantic guidance are difficult to
reconcile only in comtemporary practice, and not in principle.
We are ready to discuss the setting up of a master framework, a kind of
"semantic road map" that people can use as a guide to the literature. The
framework will presumably be derived partially or wholly by computer analysis
of text on a library-sized scale. We do not know exactly what will be involved in
such use of a computer, but this need not prohibit discussion of basic principles
and above all should not inhibit empirical efforts to find out what the principles
are.
We would like the framework to express whatever we can find out about the "reality
redundancy" in documentation since, as has been pointed out earlier, the very
cognitive processes which cause this redundancy will also enable hterature searchers
to read correct meaning into the redundant combinations.
SEMANTIC ROAD MAPS FOR LITERATURE SEARCHERS 571
In some areas, document word frequencies s will provide a basis for organization;
in other areas meaningful organization will not be possible, and simple serial
searching and recognition by humans of automatically generated proxies for
documents will have to be relied upon. Some arteriole-sized elements will be
"unexpected", but, as reasoned .earlier, they will not be unexpected for long in
the minds of searchers.
The capillaries are the smallest elements, and should have the most interesting
function. Ideally, we hope these linkages will answer the searcher's question:
" W h a t does this document talk about that no other document talks about?"
An unusual link such as fish-transuranium just about answers the question.
Unfortunately, in most cases, individual documents will only reinforce already
well-populated linkages; and in many other cases unique linkages will be formed,
but will be due to unusual choices of words by authors.
Generally speaking, our hope that capillaries will allow us to make clean-cut
discriminations between topically close documents will not work out, and some
work must be done by the searcher. The main task at this level, therefore, is to
give the searcher enough information about each undiscriminated document to
permit him to narrow his focus by recognition.
Before ending this section on associations, it is necessary to tie up an important
loose end. Several times earlier it was stated that in a coordination system one
cannot cope with lack of precision by bringing additional ordinates into play;
bringing in a fourth ordinate, or even a third ordinate, is overwhelmingly likely
to lead to inaccuracy, that is, the rejection of relevant documents.
The cause of such a universal state of affairs resides in the fact that both
librarians and authors generate words and terms from their own "psychological
association maps." If librarians assigning descriptors to documents religiously
stick to words used by an author, then the pattern of descriptor assignment is
determined by the associations of authors. If librarians improvise, adding
descriptors judged relevant, then the associations of librarians determine de-
scriptor patterns. In either case, a decision has to be made about "where to
stop" in assigning labels.
If labels were not associated, such a decision would be easy. But, unfortunately,
labels imply each other, in an associative sense. " E y e " implies "glasses" and
"production" implies "factory". With one search ordinate, the implications (and
hence the empirical associations in text) are not very strong. " E y e " can imply
"hurricane", "needle", "retina", and many other terms which are grossly
unrelated to each other and therefore might give valid discrimination in search-
ing. Suppose, however, one has three search ordinates, such as "camera",
"satellite", and "weather". Now certain other labels become very strongly
implied, such as "Tiros", "stabilization", "telemetering", etc. "Stabilization",
for example, refers to the elementary fact that one needs to point a camera in
order to take a picture. The searcher who discovers that his three search ordinates
8 If one deals with, say, the dozen most frequent content words in a document, an enor-
mous amount of information (i e. in the sense of "number of bits") is at hand, more than
adequate to position any document proxy in the proper place (or places) on a map
574 L A U R E N B. D O Y L E
have yielded a hundred references, and who tries to use "stabilization" (the
aspect which interests him) as a means of narrowing down, assumes that the
motivation (of the librarian) to apply that label must have been proportional
to the extent to which stabilization was discussed.
If every author writing a document could be caused to either fully discuss
something or not discuss it, such an assumption would be justified. In reality,
chances are that nearly half of the 100 authors in this case make some reference to
the stabilization problem, and very few of them have stabilization as a principal
topic. Thus the probability that a librarian will apply the label "stabilization" to
a document is likely to be governed, we hypothesize, more strongly by
the philosophy of label application of that librarian on that particular day than
by its appropriateness (which after all can be judged only in awareness of the
contents of the other 99 documents; such an awareness is difficult enough in a
hierarchical system, but it is doubly difficult in a system where librarians are
out of touch with subcategories).
Thus, a single word is too intrinsically crude as a discriminating device. The
reason it is not crude as a second ordinate and often not crude as a third ordinate
is that it takes two and sometimes three ordinates to get one "in the right ball-
park" on the association map implicit in the pattern of label assignment. Once
the position on this map has been determined, there is more lost than gained by
the blind use of further ordinates.
Now it is time to take a look at "what it might be like" to search the literature
in 1965. The sample system about to be presented is completely arbitrary, except
insofar as it follows the principles outlined herein. The final form and use of
association maps can be determined only by extensive research and experimenta-
tion.
Shown in Figure 4 are a "capillary" region of an association map (upper left)
and some proxies of documents from which the associations in this capillary were
derived (middle and lower right). These maps are true empirical maps, derived
by easily programmable heuristics from the same data (600 abstracts) that went
to make up Table I. In the lower left-hand corner is a "psychological" association
map made in advance of the empirical map for purposes of comparison of form.
The map was made by the author, more or less as a doodle, imagining the associa-
tions that might be found in a corpus of articles which talk about the weather
(e.g. on agriculture, meteorology, navigation, military operations, etc.). Un-
fortunately, the thought did not occur (until too late) to use the same semantic
starting point and reservoir of material as was used in the empirical map.
The empirical map was generated by finding the eight words correlating most
strongly with " o u t p u t " in occurrence in the same abstracts, and then finding in
turn three other words correlating with each of the eight words and also positively
correlating with " o u t p u t " . The rejects among these secondary associations
(listed at the upper right of Figure 4) had either negative or uncertain positive
correlations with " o u t p u t " , even though they had high correlations with those
words directly associated with " o u t p u t " .
In general, the correlations were proximal. Adjacent correlations are indicated
by double-lined arrows pointing toward the second word of the pair. Three-word
terms such as "adjacent manual facilities" can be sometimes picked out. A line
crossing the base of the arrow at right angles indicates that the word at the base
is ahvays the beginning of a term; in other words, there cannot be the term "ad-
jacent manual inputs", for example. The numbers on the paths connecting with
" o u t p u t " show the number of abstracts containing both words of the connected
pair.
The five proxies are made up mainly of words from the empirical map shown.
The notation is similar, with the added feature that sometimes it is possible to
introduce syntactic connectives (again by means of easily programmable heuris-
tics) ;9 in such a case, single-lined arrows are used to show which is the word fol-
lowing the connective. The proxies were derived by applying simple word selec-
tion priority rules to the corresponding abstracts; in a real system the proxies
would probably be formed out of the most frequent topical words in the cor-
responding documents.
Assume that a literature searcher is seated at a TV-like display console, which
9 An example of such a heuristic is a routine which computes the number of letters sepa-
rating two words in text; if the number of letters is not greater than, say, 15, and if there
are no intervening punctuation marks, the corresponding words can be extracted as "con-
nectives." The heuristics actually employed would probably have to be more complicated
than this.
576 LAUREN B. D O Y L E
he can operate by means of buttons and typewriter keys. Assume, also, that
somehow (we are not equipped to discuss how) he is steered through the
"arteries" and "arterioles" of a enormous empirical map and that it becomes
evident to him that "output" is a word which will lead to documents of interest.
(note: the word "output", being common, would have to exist at many points on
EMPIRICALASSOCIATIONMAP
2ndorderassoclahon
(Arr)Defense REJECPS
/ Req....... ts(s) ~ Fac,l,t,es
COMPILATION I Sector(s) (Negabveoruncertain
pOSlhvecorrelation
I N ADJACENT ~" J¢ withOUTPUT)
I
Ground-- DATA.~ ,~ ~
\o RI ~ / Inputs--lntelllgPnce
~//MANUAL
'° Inputs--Room
S,mulatmn
1%..%1
- RADAR ;
T P~mat(s)
OUTPU~ 27 /
Function--Supervisor
Functlon--Pos(tlon
AN/GPST2/ 7/ / \~ ~ / Capabdtty--Frame
Capablhty--Mlsode
Evaluahon--Cnterlon
/ / CAPABILITY FUNCTION(S)~Pe...... ' Evaluatmn--KCADS
E clse / I Ad~p,at.on/ I XCnmb.t
,or / E,u,pm.nt Evaluation--Boston
Prelect Modihcabon Plow SSTP Center Evaluahon--Room
Evaluatlen--tmcoln(Lab)
Data--lechmcal
FIVEDOCUMENTPROXIES
E E
MANUAL l [~] NewYork,~.-Sector
OUTPUT
\ ~ INPUTS~,-iR;;m']
~ L.__..a '1"
e . . . . . . | andds Req.......
i to
/
Air Defense Air ifense COMPleTION
Personnel t ~ I MANUAL
FUNCTIONS--,,,-~y;~s] Sector COMPILATION\ | / D. . .A. . .,_T_ / r
.System| ~ [ /MANUAL
, XcCenter
SSTP "--:',
ombat.,~ [SA?'i:
.... ,.m
xc;L2X!,/\
A A--......_OUTPUTS__ iNPUTS
....... \ I /
OBTPUTS I
from ~ t~,.1. INPUTS
RADAR/
r. J
PSYCHOLOGICALASSOCIATIONMAP
Requirements
Seed
A~r Tropical I
/ \ / Relatwe /Fuel COMP,LAT,ON ADJACENT
Phnt(,ng)/TEMP~ATURE\ / J
\
,~,..FreezOng)
HUMIDITY /ICE - - Equipment
DtTA\r;c-i / I Eacd.,.es
I ~ \ ' 1 •;d'ts/MANUALf
\ Seas°hiSn~w X / / /Control Sector ~\!l' / /
\ "CqLD~','WEAIHER--VISIBILIIY- - Ceiling ..|_~ OUTPUTS--
\ ISAOEI
l.°__J
FROST
/\ ~ "WIND~ " Velmty
r-----~
Y,e,d<s)l /\ --so,,
. .Program
[~]
. . . . . . . . . "~
Lu-~Y-
!
-d
MANUAL
Forecast FRONT STORM(Y,S)
/\/1\
Baromet(er) (Low) Flight Radar
Oelvelopmenti OU,TPUT~
I = ! \of~NPUT
(nc) Pressure Group I I '~
a t M |Ii "JEquipment
~sem.,er
I F'*'71
,PGDA,
~-~-~
L ........ •
FXG 4
SEMANTIC ROAD MAPS FOR LITERATURE SEARCHERS 577
the map; but the searcher has been led to his particular occurrence of " o u t p u t "
by a process superficially resembling coordination.)
The controls of his console enable him to bring the Figure 4 map onto the
screen. He is now able to see primary and secondary associations to the word
" o u t p u t " , and the chances are that some of these will "ring a bell." Note that,
instead of depending on his imagination to think up a search request, he is de-
pending on his recognzlion of semantic relationships. We have put him in close
touch with the contents of the library, quahtatively and quantitatively.
When he singles out a particular intersection for further investigation, he is
informed how many documents are in the intersection; in other words, he knows
whether he has arrived at a capillary or is still in an arteriole, by the size of this
number, and he then knows whether to try for a map with smaller intersections
or to start requesting individual document proxies. In this case, assume he singles
out the intersection "output-manual", hoping that he will get some information
on the output of manual air defense sectors.
Using typewriter keys, he indicates his interest in this intersection by typing
"manual". The computer program can, at this point, generate the proxies and
put them in storage slots ready to be called onto the screen, either collectively or
individually, by the searcher. Of the five proxies shown in Figure 4, the top three
refer to "manual inputs" rather than "manual sectors", and the word " o u t p u t s "
refers to still other things. The proxy at the lower right uses "manual" in a com-
pletely irrelevant sense, and also contains several words not associated with the
" o u t p u t " capillary in any way. The proxy labelled "304" is the only one that has
a chance of being relevant. It does not say "manual sectors", but it does say
"manual facilities", a more general t e r m - - t h i s is an example of a document which
might be passed over in a coordination search involving use of the ordinate
"manual sector". Here is a case illustrating the advantage of recognition over
search request formulation.
As it turns out, document 304 appears to talk about outputs of SAGE DC's
(Direction Centers) and not that of manual facilities. However, by the very
mention of "manual facilities" the document has a chance of being relevant, and
the searcher has the option of calling in more information about the document.
At this point, it is quite appropriate to make use of abstracts (auto or otherwise).
In a literature search like this, the author m a y individually inspect 100 or more
proxies, in various topical intersections of more or less equal promise, and m a y
pick out a dozen or so proxies of potential interest, reading the corresponding 12
abstracts.
One might think, " B u t surely it's easier to read 88 abstracts than to look at 88
of those things in Figure 4." Such a contention would ignore several factors:
(1) The proxies can be evaluated at a glance when one gets used to the nota-
tion and knows what to look for; the eye lights immediately on the words "man-
ual" and " o u t p u t " , which are always in the same spatial location with respect to
each other. What we are after at this stage of the retrieval process is speedy and
confident rejection of irrelevant documents. Rejections, in most cases, should
578 LAUREN B. DOYLE
take less time than reading the corresponding abstracts (which, remember, all
would contain the words "manual" and "output".
(2) P r o x i e s can be designed to c o n s p i c u o u s l y i n d i c a t e t h e t h i n g s a b o u t a d o c u -
m e n t w h i c h m a k e it different f r o m its closest t o p i c a l r e l a t i v e s ; n o t e t h a t for each
p r o x y in F i g u r e 4, a d o t t e d line is d r a w n a r o u n d all w o r d s n o t a p p e a r i n g in t h e
c a p i l l a r y d i s p l a y . I n s y s t e m s where m o r e i n f o r m a t i o n a b o u t a d o c u m e n t is a v a i l -
a b l e in s t o r a g e t h a n is shown in t h e p r o x y , s e a r c h e r s can r e q u e s t as m u c h a d d i -
t i o n a l i n f o r m a t i o n as is r e q u i r e d to m a k e d i s t i n c t i o n s b e t w e e n r e l a t e d d o c u m e n t s .
(3) T h e proxies are s u b j e c t to flexible c o n t r o l in b o t h g e n e r a t i o n a n d g r o u p i n g .
S u c h flexibility lends itself to e v o l u t i o n a r y processes.
M e t h o d s of this k i n d n o t o n l y h e l p c l a r i f y t h e p i c t u r e in a specific search, b u t
c a n serve as t h e basis of h i g h l y s y s t e m a t i c a n d r a p i d browsing. I n such a w a y we,
indeed, fulfill t h e goal d e s c r i b e d in Section I, of " i n c r e a s i n g t h e m e n t a l c o n t a c t
b e t w e e n t h e r e a d e r a n d t h e i n f o r m a t i o n s t o r e . " W e give t h e s e a r c h e r t h e a d -
v a n t a g e o u s illusion of p r o c e e d i n g a l o n g o r d i n a t e s of m e a n i n g in his b r o w s i n g ,
r a t h e r t h a n along t h e u s u a l l y i r r e l e v a n t o r d i n a t e s of p a g e l e n g t h a n d shelf loca-
tion.
REFERENCES
1. MOOERS, C . N . The next twenty years in information retrmval Amer. Documentatwn
11 (1960), 229-236.
2. LvRN, H. P. Auto-encoding of documents for information retrieval systems. In
Modern Trends ~n Documentation, Proceedings of a symposium held at the University
of Southern California, April, 1958, Pergamon Press Ltd., New York, pp. 45-58.
3a. EDMUNDSON,H. P., OSWALD,V. A., JR., AND WYLLYS, R . E . Automatic indexing and
abstracting of the contents of documents. Planning Research Corp. Document PRC
R-126, ASTIA AD No. 231606, Los Angeles, 1959.
3b. EDMVNDSON, H. P. AND W~LLYS, R . E . Automatic abstracting and indexing--survey
and recommendations. Comm. A C M 4 (May, 1961).
4. TA~BE, MO~T~MEm Machines and classification in the organization of information.
Documentation, Inc., Technical Report No. 2, ASTIA AD No. 22426, Washington,
D. C., 1953.
5. FOSKET$, D . J . The construction of a faceted classification for a special subject.
Preprints of papers for the International Conference on Scientific Information,
Area 5, Washington, D. C., Nov., 1958, 53-74.
6. VICKERY, B . C . Developments in subject indexing, J Documentatwn 2 (1955), 1-12
7. BARTON, A. R., SCHA~Z, V. L , AND CAPLAN, L N. Information retrieval on a high-
speed computer. Proc. Western Joint Comp Conf., San Francisco, Mar., 1959, 77-80.
8. JONKER, FREDERICK. The descriptive continuum: a generalized theory of indexing.
Preprints of papers for the International Conference on Scmntific Information,
Area 6, Washington, D. C., pp 21--41.
9. DOYLE, L. B. Programmed interpretation of text as a basis for information re-
trieval systems. Proc. Western Joint Comp. Conf , San Francisco, Mar., 1959, 60-63.
10. RICHMOND, PHYLLIS A. Hierarchical definition. Amer. Documentatwn 11 (Apr. 1960),
91-96.
11. HERNER, SAVL. Retrieval questions most susceptible to mechanization. Third Insti-
tute on Information Storage and Retrieval, Washington, D. C., February, 1961 (in
press).