Sunteți pe pagina 1din 30

Thesaurus Construction

1
Thesaurus defined
 a thesaurus is a reference work that lists words grouped
together according to similarity of meaning (containing
synonyms and sometimes antonyms), in contrast to a
dictionary, which provides definitions for words, and
generally lists them in alphabetical order.
 • a compilation of terms showing synonymous,
hierarchical, and other relationships and dependencies, the
function of which is to provide a standardized, controlled
vocabulary for information storage and retrieval

2
Purpose of a thesaurus
 To provide a map of a given field of knowledge, indicating how
concepts or ideas about concepts are related to one another, which
helps an indexer or a searcher to understand the structure of the field.
 To provide a standard vocabulary for a given subject field which will
ensure that indexers are consistent when they are making index entries to
an information storage and retrieval system.
 To provide a system of references between terms which will ensure
that only one term from a set of synonyms is used for indexing one
concept.

3
 To provide a guide for users of the systems
so that they choose the correct term for a
subject search.
 A desirable purpose is to provide a means
by which the use of terms in a given subject
field may be standardized.

4
Features of thesauri
1) Coordination level
2) Term relationships
3) Number of entries for each term
4) Specificity of vocabulary
5) Control on term frequency of class members
6) Normalization of vocabulary

5
Coordination level
1) The construction of phrases from individual terms.
2) Two coordination options : pre-coordination and post-
coordination.
3) A precoordinated thesaurus can contain phrases. The
advantage is that the vocabulary is very precise.
4) The disadvantage is that the searcher has to be aware of
the phrase construction rules employed.
5) Precoordination is more common in manually constructed
thesauri.

6
7) A postcoordinated thesaurus does not allow phrases.
Instead, phrases are constructed while searching.
8) The advantage is that the user need not worry about the
exact ordering of the words in phrase.
9) The disadvantage is that search precision may fall.
10) Automatic thesaurus construction usually implies
postcoordination.

7
Term Relationships

Three categories of term relationships:


(a) Equivalence relationships
(b) Hierarchical relationships
(c) Nonhierarchical relationships

8
Equivalence relationships
Equivalence relations include both synonymy and
quasi-synonymy.
For example:genetics and heredity; harshness and
tenderness

9
 storage batteries UF secondary  stability UF instability,
batteries instability USE stability
 secondary batteries USE storage
batteries
 UF(Used For)
 USE

10
Hierarchical relationships
A typical example of a hierarchical relation is
genus-species,such as ”dog” and “german
shepherd.”

11
Nonhierarchical relationships

Nonhierarchical relationships also identify


conceptually related terms. There are many
examples including :thing—part such as “bus” an
“seat”;thing—attribute such as “rose” and
“fragrance”.
12
 windows RT houses
 skates RT skating
 railway construction RT railway
 Seawater RT corrosion

13
Wang, Vandendorpe, and Evens (1985) provide an
alternative classification of term relationships
consisting of:
(1)parts—wholes
(2)collocation relations
(3)paradigmatic relations
(4)taxonomy and synonymy
(5)antonymy relations

14
(1)Parts-wholes

Parts and wholes include examples such as set—


element;count—mass.

15
(2)Collocation relations

Collection relates words that frequently co-occur


in the same phrase or sentence.

16
(3)Paradigmatic relations
 Paradigmatic relations relate words that
have the same semantic core like “moon”
and “lunar” and are somewhat similar to
Aitchison and Gilchrist’s quasi-synonymy
relationship.

17
(4)Taxonomy and synonymy
 Taxonomy and synonymy are self-
explanatory and refer to the classical
relations between terms.

18
(5) antonymy relations

19
2.3 Number of entries for each term
1. It is in general preferable to have a single entry
for each thesaurus term.However ,this is seldom
achieved due to the presence of homographs—
words with multiple meanings.
2. In a manually constructed thesaurus such as
INSPEC, this problem is resolved by the use of
parenthetical qualifiers, as in the pair of
homographs, bonds (chemical) and bonds
(adhesive).
3. However, this is hard to achieve automatically.

20
(homographs)
 Mercury (metal)、 Mercury(planet)

21
Specificity of vocabulary
1. a function of the precision associated with the component
terms.
2. A highly specific vocabulary is able to express the subject
in great depth and detail.This promotes precision in
retrieval.
3. The disadvantage is that the size of the vocabulary grows.
Also, specific terms tend to change more rapidly than
general terms.
4. There, such vocabularies tend to require more regular
maintenance.
5. High specificity implies a high coordination level and user
has to be more concerned with the rules for phrase
22
construction.
Control on term frequency of class members
1. Salton and McGill have stated that in order to maintain a good
match between documents and queries, it is necessary to ensure
that terms included in the same thesaurus class have roughly
equal frequencies.
2. The total frequency in each class should also be roughly similar.
3. These constraints are imposed to ensure that the probability of a
match between a query and a document is the same across
classes.
4. Terms within the same class should be equally specific, and the
specificity across classed should also be the same.

23
Normalization of vocabulary
1. There are other rules to direct issues such as the singularity of
terms), the ordering of terms within phrases, spelling, capitalization,
transliteration, abbreviations, initials, acronyms, and punctuation.
2. The advantage is that variant forms are mapped into base
expressions, thereby bringing consistency to the vocabulary.
3. The disadvantage is that, in order to be used effectively, the user has
to be well aware of the normalization rules used.

24
Manual Thesaurus Construction
 Define the boundaries of the subject area
– Identify central subject areas and peripheral ones
– Partition the domain into divisions or subareas
 Identify desired characteristics
 Collect terms for each subarea
– Sources from index, encyclopedia, handbook, textbook,
journal, abstract, catalog, existing thesaurus or
vocabulary systems
– Including: subject expert and potential user

25
Manual Thesaurus Construction
(continued)
 Analyze each term for its related vocabulary
– Including synonyms, broader and narrower term,
definition and scope note
 Organize term and relationship into hierarchical
structure
 Review or refine for consistency
 Invert the structured thesaurus to produce an
alphabetical arrangement of entries
 Test the thesaurus

26
Manual Thesaurus Construction
(continued)
 Conclusion:
– Involve a group of individuals and a variety of
resources
– Need to be maintained to ensure viability and
effectiveness
– Reflect any changes in the terminology of the
area
=>An art and a science

27
Automatic Thesaurus Construction
 From document collections
– Use a collection of documents as the source for
thesaurus construction
– Apply statistical procedures to identify
important terms as well as relationships
– Use computationally simpler methods to
identify the more important semantic
knowledge

28
Automatic Thesaurus Construction
(continued)
 Merge existing thesaurus
– Merge two or more thesauri into a single unit
– Merger should not violate the integrity of any
component thesaurus
– e.g. augment MeSH from SNOMED

29
Automatic Thesaurus Construction
(continued)
 User generated thesaurus
– Uses of term relationship in search strategies
– Capture knowledge from user’s search
– e.g. TEGEN (Thesaurus Generating system)
 The types of Boolean operators between terms
 The type of query modification
 User feedback included

30

S-ar putea să vă placă și