Sunteți pe pagina 1din 17

This article was downloaded by: [Stony Brook University]

On: 19 October 2014, At: 16:35


Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK

Cataloging & Classification Quarterly


Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/wccq20

The Issue of Word Division in Cataloging Chinese


Language Titles
a b
Jie Huang MLIS & Kathleen J. M. Haynes PhD
a
Catalog Department , University of Oklahoma Libraries, University of Oklahoma , Norman,
OK, 73019 E-mail:
b
School of Library and Information Studies , University of Oklahoma , Norman, OK, 73019
Published online: 03 Feb 2009.

To cite this article: Jie Huang MLIS & Kathleen J. M. Haynes PhD (2004) The Issue of Word Division in Cataloging Chinese
Language Titles, Cataloging & Classification Quarterly, 38:1, 27-42, DOI: 10.1300/J104v38n01_04

To link to this article: http://dx.doi.org/10.1300/J104v38n01_04

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the
Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and
are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and
should be independently verified with primary sources of information. Taylor and Francis shall not be liable for
any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever
or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of
the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
The Issue of Word Division
in Cataloging Chinese Language Titles
Jie Huang
Kathleen J. M. Haynes
Downloaded by [Stony Brook University] at 16:35 19 October 2014

ABSTRACT. This study addresses how syllable or word division in


bibliographic records of Chinese materials affects title keyword
searches. Title keyword searches with both syllable division and word
division are conducted in OCLC, RLIN, and Peking University Library
(PKUL), and results are compared in terms of recall and precision. It is
found that with both OCLC and RLIN, the recall and precision percentages
vary greatly if the syllables of a keyword in the search are aggregated or not.
In contrast, for PKUL, the recall and precision percentages remain high and
the same in both ways. The findings suggest that PKUL has two advan-
tages over OCLC and RLIN that would reduce human errors in word divi-
sion in cataloging and searching. [Article copies available for a fee from
The Haworth Document Delivery Service: 1-800-HAWORTH. E-mail address:
<docdelivery@haworthpress.com> Website: <http://www.HaworthPress.com>
© 2004 by The Haworth Press, Inc. All rights reserved.]

KEYWORDS. Syllable division, word division, Chinese romanization,


title keyword search, recall, precision

Jie Huang, MLIS, is Cataloger and Assistant Professor, Catalog Department, Uni-
versity of Oklahoma Libraries, University of Oklahoma, Norman, OK 73019 (E-mail:
lilyh@ou.edu). Kathleen J. M. Haynes, PhD, is Professor Emerita, School of Library
and Information Studies, University of Oklahoma, Norman, OK 73019.
Cataloging & Classification Quarterly, Vol. 38(1) 2004
http://www.haworthpress.com/web/CCQ
 2004 by The Haworth Press, Inc. All rights reserved.
Digital Object Identifier: 10.1300/J104v38n01_04 27
28 CATALOGING & CLASSIFICATION QUARTERLY

After over two decades of discussion and comparison, the Library of Con-
gress (LC) arrived at a historic decision, namely to adopt Pinyin as its standard
system for Chinese romanization and convert all the Chinese documents from
Wade-Giles (WG) to Pinyin. While this move has received loud applause and
cheering ovation in the library community and beyond, a potential problem
has been noted with the adopted approach to conversion in terms of word divi-
sion. For decades, LC had been hesitant to convert Chinese records from
Wade-Giles to Pinyin because of lack of internationally accepted standardiza-
tion in word division. The concern is that inconsistencies in title word division
Downloaded by [Stony Brook University] at 16:35 19 October 2014

in cataloging (and indexing thereafter) will cause problems in retrieving. This


study addresses the issue of word division of Chinese language materials,
which has aroused much discussion (e.g., Arsenault 2000a, 2000b, 2002;
Groom 1997; Huang 2002; Mair 2000, 2001a, 2001b; Melzer 1996a, 1996b,
1997; Studwell et al. 1993). The common purpose of the discussion is to find
the most effective and efficient approach to serving end users of Chinese lan-
guage materials. It is hoped that this study will eventually contribute to the im-
provement of effectiveness and efficiency of access to these materials.

LITERATURE REVIEW

The bulk of literature related to Chinese romanization concentrates on is-


sues in three areas: personal names, place names, and titles (e.g., Anderson
1980; Lin 1988; Tao and Cole 1991; Lau and Wang 1991, 1993; Harrison
1992; Hu 1994; Hiatt 1998; Teng 1998; Arsenault 1998, 2000a, 2000b, 2002).
Of these three areas, the third one, the romanization of titles, poses the most
difficulty. Specifically, the difficulty exists in the issue of word division,
which was a major obstacle to conversion from Wade-Giles to Pinyin.
By face value, “word division” simply means “to divide words from each
other.” In a written English sentence, every word is divided by a space from
the word preceding or following it. In a Chinese written sentence, however,
characters are rendered next to each other, with no visual cues as to where a
word starts or ends. From the linguistic point of view, a Chinese character can
be a Chinese word, but it can also be a morpheme within a Chinese word. To
put it differently, a Chinese word can consist of one character and it is then a
monosyllabic word; it can also consist of more than one character and it is then
a disyllabic or polysyllabic word. Very often, a disyllabic or polysyllabic word
is a compound word, with each character as a morpheme of the compound. A
Chinese compound can consist of more than two morphemes, but most Chi-
nese compounds consist of only two morphemes represented by two charac-
ters in the written and they are disyllabic in the spoken language.
Jie Huang and Kathleen J. M. Haynes 29

A compound in Chinese, no matter how many syllables or characters it has,


is still a single word defined by its “syntactic and semantic independence and
integrity” (Li and Thompson 1981, 13). To illustrate the point the examples
given in (1), cited from Xiandai Hanyu Cidian (Modern Chinese dictionary)
and Han-Ying Cidian (A Chinese-English dictionary), should suffice:

(1) a. ‘east’
b. ‘west’
c. ‘thing; (referring to a person or animal) creature’
Downloaded by [Stony Brook University] at 16:35 19 October 2014

In (1a) and (1b), d‘ng and xi are both monosyllabic one-character words,
respectively meaning “east” and “west.” In (1c), dongxi, which means “east
and west” character by character, is a disyllabic two-character compound
word meaning “thing.”
In Pinyin, where the Chinese language is romanized with alphabetic letters,
the normal practice is to separate words, including compound words, from
each other with a space. In the case of compounds, there is no space or hyphen
between the syllables. The problem with this practice is that occasionally peo-
ple are not so sure whether a certain disyllabic or polysyllabic unit is actually a
compound word or a phrase, so there arises the issue of word division. But in
general, a good Chinese dictionary published in China should provide neces-
sary information. One may ask the question why Pinyin should practice word
division in the first place, instead of separating all the individual syllables just
like Wade-Giles. The answer to this question is that word division reduces the
chance of ambiguity caused by homophony. In Chinese it is very common that
different characters have the same sound. For example, given in (2) and (3) be-
low are some characters cited from Xiandai Hanyu Cidian (Modern Chi-
nese dictionary) and Han-Ying Cidian (A Chinese-English dictionary),
which are also monosyllabic words, that happen to share the same pronuncia-
tion with ! d‘ng ‘east’ and " xi ‘west,’ with the same or different tones.

(2) a. ‘east’
b. ‘winter’
c. ‘radon’
d. ‘direct; superintend; director; trustee; a surname’
e. ‘understand; know’
f. ‘move; stir’
g. ‘freeze; jelly; be frostbitten’
h. ‘hole; cave; cavity’
i. ‘fear’
30 CATALOGING & CLASSIFICATION QUARTERLY

(3) a. ‘west’
b. ‘inhale; breathe in; draw; suck (liquids); absorb; attract’
c. ‘practice; exercise; review; be used to; habit; a surname’
d. ‘seat; place; feast; banquet; dinner; a surname’
e. ‘wash; bathe; baptize; redress; kill and loot; develop (photos)’
f. ‘happy; delighted; pleased; a happy event; pregnancy; like’
g. ‘drama; play; show; make fun of; joke’
h. ‘system; series; department (in a college); relate to; tie; fasten’
i. ‘thin; slender; fine; exquisite; delicate; careful; detailed; minute’
Downloaded by [Stony Brook University] at 16:35 19 October 2014

Note that the above lists are just for illustrative purpose and they are by far
from being exhaustive. A count of the entries in the popular dictionary,
Xiandai Hanyu Cidian (Modern Chinese dictionary), which does not include
many rare characters and words, suggests that twenty-two characters, regard-
less of their tone, share the sound of dong. The number of characters that share
the sound of xi hits one hundred and twenty-five. Therefore, when Chinese
characters are romanized into Pinyin, the probability of ambiguity increases
greatly, especially when diacritic tone marks are left out, which is the normal
practice in cataloging Chinese materials in the library community. Aggregat-
ing syllables together into disyllabic or polysyllabic words, i.e., practicing
word division, can decrease homophonous ambiguity greatly. It actually
solves the problem of homophonous ambiguity 95% of the time (Arsenault
2000, 38). For instance, the syllables dong and xi (regardless of the tone) are
each shared pronunciation of many characters. So when they are separated,
they each can represent many characters (or one-character words) with various
meanings. Thus, there is a high probability of ambiguity. However, once they
are combined into a single unit dongxi, this disyllabic sound can represent
only two compound words: d‘ngxi !" ‘thing’ and dòngxi #$ ‘know
clearly; understand thoroughly.’ As a result, there is a very low probability of
ambiguity. That is why aggregating individual syllables into disyllabic or
polysyllabic compounds is a more desirable practice than leaving them sepa-
rate in cataloging Chinese titles. It should greatly improve readability of the
romanized titles.
In his recommendation regarding the issue of word division, the chair of the
LC Pinyin Task Group, Melzer (1996b), suggests that, in the absence of an in-
ternational standard, LC will follow the practice of the National Library of
Australia in separating all individual syllables from each other except in the
cases of personal names, geographic locations, and certain proper nouns. The
essence of this recommendation for word division is actually to opt for sylla-
ble division over word division. There is no doubt that this is the most eco-
nomic and trouble-saving approach to conversion, because Wade-Giles is
characterized by the same separation of individual syllables. The question is,
Jie Huang and Kathleen J. M. Haynes 31

however, if this is the best approach for users. Some have argued that non-ag-
gregated titles, that is, titles with syllable division, would provide more access
points and, therefore, have a greater possibility of being found (see, e.g.,
Studwell, Wang, and Wu 1993; Lo and Miller 1991). On the other hand, Mair
(2000, 2001a, 2001b) raises serious criticisms against the currently adopted
approach. Among many reasons cited, the strongest one against syllable divi-
sion consists in the fact that Pinyin without word division greatly increases the
amount of ambiguity caused by homophony (see also, Huang 2002 for a de-
tailed discussion). Arsenault (2000a) investigates if following the polysyllabic
Downloaded by [Stony Brook University] at 16:35 19 October 2014

method (i.e., word-division searches) significantly improves retrieval effi-


ciency (i.e., shorter complete-task time) and effectiveness (i.e., higher success
rate of retrieving items sought), over the monosyllabic method (i.e., sylla-
ble-division searches), in item-specific or known-item searching within online
bibliographic databases. The findings suggest that aggregation of monosylla-
bles in the Chinese-language title fields does improve retrieval efficiency in
database title searches.

RESEARCH DESIGN AND PROCEDURE

For this study two research questions were raised: (1) How does syllable or
word division in bibliographic records of Chinese materials affect title key-
word searches? (2) To what extent do differences in search results arise from
different ways of cataloging Chinese materials with Pinyin? To answer these
questions three databases, OCLC, RLIN, and Peking University Library
(PKUL), were selected for this research. The first two are large union catalog
databases popular among academic research libraries in the United States.
Both of them contain large pools of bibliographic records of Chinese materi-
als. RLIN, however, is different from OCLC in that a great proportion of its
Chinese-language records already contain aggregator characters with which
syllables forming words are joined with a “soft” link (Arsenault 2000a, 92,
116; Mair 2001a). The advantage of this feature of RLIN is that it supports
“word searches,” namely, searches in which syllables of words are aggregated
as one search term. In addition to OCLC and RLIN, a database in China, where
Pinyin originates, was examined to gain a comparative perspective. Therefore,
PKUL was selected as the third database to study. According to its web site,
PKUL is the largest university library in Asia, holding a collection of over 4.6
million. It differs from both OCLC and RLIN in an important aspect. That is,
at PKUL, Pinyin only serves as one of the modes in retrieving whereas its Chi-
nese records shown on the computer screen are rendered in Chinese charac-
ters. On the other hand, in both OCLC and RLIN the Chinese records are all in
32 CATALOGING & CLASSIFICATION QUARTERLY

Pinyin. Both of these databases may also provide Chinese-character software


support, but this feature is not widely available. This difference means that
word division in cataloging is not an issue with PKUL as it is with OCLC and
RLIN. But, how PKUL handles Pinyin as a search mode and whether word or
syllable division in title keyword searches affects results are the questions that
interest the researchers.
Three keywords were selected as search terms in the title mode in the three
databases. These are not very common words so that the pools of records re-
trieved were not too large. On the other hand, they are not so rare that they yield
Downloaded by [Stony Brook University] at 16:35 19 October 2014

very small pools of records for study. The three keywords were chosen from
three different semantic fields: shukdào %&‘paddy rice,’ yíngy1ng '( ‘nu-
trition,’ and xi¨cí )* ‘rhetoric.’ Each of these three words is composed of
two syllables. According to the popular Chinese dictionary Xiandai Hanyu
Cidian (Modern Chinese dictionary), those syllables are shared by varying
numbers of characters, or homophones, when the tones are ignored. To be
more exact, there are six characters sharing the syllable shui, 24 sharing dao,
58 sharing ying, 32 sharing yang, 26 sharing xiu, and 30 sharing ci. The fact
that multiple characters or homophones share the same syllable is represented
symbolically in (4).
(4) a. shui = X1, X2, X3, … X6
b. dao = Y1, Y2, Y3, … Y24
c. ying = X1, X2, X3, … X58
d. yang = Y1, Y2, Y3, … Y32
e. xiu = X1, X2, X3, … X26
f. ci = Y1, Y2, Y3, … Y30

However, when the syllables are aggregated, the chances of homophony are
very low even when the tones are ignored, as listed below:

(5) a. shuidao: (i) “paddy rice”; (ii) “waterway; water course; water route”
b. yingyang: (i) “nutrition”; (ii) “Mexican silver dollar (rarely used)”
c. xiuci: (i) “rhetoric”

In (5a), both (i) and (ii) are relatively common referents. In (5b), only (i) is a
common referent while (ii) is an extremely rare one. The aggregated syllables
in (5c) can only mean “rhetoric.” Again, (4) and (5) illustrate the importance
of aggregation, that is, combination of syllables into lexical units, in the
romanization of Chinese. It greatly reduces chances of ambiguity caused by
homophony.
The title keyword searches of the above three words were conducted re-
spectively in OCLC, RLIN, and PKUL by using both word-division and sylla-
ble-division methods. A total of 6,632 records were retrieved in eighteen
Jie Huang and Kathleen J. M. Haynes 33

searches. The results were examined to distinguish relevant records from irrel-
evant ones. The criterion for determining if a record is relevant or irrelevant was
to see if the record title or the series title (if the record is one volume of a series)
contains the keyword searched, e.g., shuidao or shui dao. Basically, there are
two kinds of irrelevant records: (a) those titles that contain homophonous
words; and (b) those titles that contain homophonous syllables. Again, taking
shuidao or shui dao for instance, the examples in (3) illustrate relevant and ir-
relevant records. In these examples, the first line presents a title as it appears in
the record: namely, the title is divided into syllables, with each syllable
Downloaded by [Stony Brook University] at 16:35 19 October 2014

separated from the next by a space. The second line presents the same title
with syllables aggregated into words. Beneath it, the third line provides a
word-by-word gloss and the fourth and last line is an English translation in
single quotes. The bold type highlights the parts where homophony takes
place.

(6) a. Shui dao zai pei


Shuidao zaipei
paddy-rice cultivation
‘The cultivation of paddy rice’

b. Ru hai shui dao ji hua


Ru hai shuidao jihua
into ocean water-courses plan
‘Plan for water courses into the ocean’

c. Cong shan shui dao shan shui hua


Cong shanshui dao shanshuihua
from mountain-and-waters to mountain-and-water paintings
‘From mountains and waters to mountain-and-water paintings (i.e., landscape
paintings’

d. Shui li dian li ji shu bao dao


Shuili dianli jishu baodao
hydraulic electric technology report
‘Report on hydraulic and electric technology’

e. Cong huo dao shui


Cong huo dao shui
from fire to water
‘From fire to water’

f. Wo bu zhi dao wo shi shui


Wo bu zhidao wo shi shui
I not know I am who
‘I don’t know who I am’
34 CATALOGING & CLASSIFICATION QUARTERLY

In (6a) is a title about the cultivation of paddy rice, so it is a relevant record. In


(6b) shui dao means “water courses” or “waterways” which is homophonous
with the word that means “paddy rice.” It is therefore retrieved but is an irrele-
vant record. In (6c) shui is part of a compound that means “mountains and wa-
ters” whereas dao is a preposition meaning “to.” So it is again an irrelevant
record. By the same token, (6d-f) all contain syllables homophonous with shui
dao ‘paddy rice.’ So they are all irrelevant records too. The titles from PKUL
are all in Chinese characters and are easy to tell whether they are relevant or ir-
relevant.
Downloaded by [Stony Brook University] at 16:35 19 October 2014

After the relevant records are distinguished from the irrelevant ones, the
percentages for recall and precision are calculated. Recall and precision are
the two most common measures for evaluating effectiveness in information
retrieval (see, e.g., Hagler 1997; Harman 1997; Jones and Willett 1997b; Keen
1997; Lancaster 1997, 1998; Saracevic et al. 1997). Generally speaking, recall
refers to “the ability to retrieve useful items” whereas precision refers to “the
ability to avoid useless ones” (Lancaster 1998, 3). Recall and precision are of-
ten defined by the following formulas (Harman 1997, 251-52):

number of relevant items retrieved


Recall = ————————————————————
total number of relevant items in collection

number of relevant items retrieved


Precision = –––––––––––––––––––––––––––––
total number of items retrieved

While it is not difficult to determine the percentage for precision, which is ob-
tained by having the total number retrieved that is relevant divided by the total
number retrieved, according to the formula, it is not so easy to determine what
the percentage is for recall. There is practically no way to know how many rel-
evant records are not retrieved from the database. However, a simple criterion
is applicable for the limited purpose of the present study. As Lancaster points
out, relevance in information retrieval can and should be defined by “the satis-
faction of some information need” (1998, 3). Users of libraries “convert an in-
formation need into some form of ‘search strategy,’” no matter how simple or
elaborate that strategy may be (Lancaster 1998, 1-2). For this study, in which
“keyword in the title” method is employed in the searches, all those records
whose titles contain the keyword searched were, therefore, regarded as rele-
vant, because they satisfied the specific information need of the searches. In
other words, if a title (including series title) does not contain the keyword
searched, it is considered as irrelevant. When the keyword in Pinyin, i.e., the
Jie Huang and Kathleen J. M. Haynes 35

two syllables, is entered in a non-aggregated fashion, the pool of records re-


trieved should, at least theoretically, cover all those titles (including series ti-
tles) in the database that contain these two syllables, no matter in what way
and regardless of the meaning, as illustrated by (6) earlier. The total retrieved
in this manner minus those irrelevant records should be the total possible rele-
vant records in the database that can be retrieved. So, for the purpose of the
present study, this number should exhaust all relevant records in the database
and, therefore, constitutes 100% recall of the records whose titles contain the
keyword searched.
Downloaded by [Stony Brook University] at 16:35 19 October 2014

RESULTS AND ANALYSIS

The results obtained from the three databases for each of the three key-
words searched are presented in Tables 1-3. As shown in Table 1, “OCLC
shuidao” yielded a 100% for precision, but its recall percentage is very poor,
only 6.5%, which is derived from the total retrieved (13) divided by the total
relevant in the database (200). Note that the total relevant in the database (200)
is derived from the total retrieved for “OCLC shui dao” (505) minus the total
irrelevant retrieved (305). The total of 305 irrelevant records retrieved is due
to the sources of error demonstrated by (6) earlier. Of these 305, notably, 131
titles contain the homophonous word shuidao ‘waterway; water course; water
route’ (see 3b for an example). The remaining 174 are due to sources of error
demonstrated by (6c-f).
At this point, it is worth digressing to ask a question: If the Chinese records
in OCLC are cataloged with non-aggregated Pinyin, why can we retrieve 13
records after we search on shuidao, which has two syllables aggregated into
one unit? The researchers examined all 13 records and found that the cataloger
had included an “other title” field that is rendered in aggregated Pinyin, i.e.,
with word division, while their “title” proper is rendered in non-aggregated
Pinyin, i.e., with syllable division. That is why they were retrieved after the
aggregated shuidao is entered.
In RLIN, as shown in Table 1, the aggregated shuidao has yielded a total of
127 retrieved records, 98 of them being relevant. Precision reaches 77.2%
while recall is as high as 87.5%. The relatively high percentages in both recall
and precision with RLIN is due to the fact that “a great proportion of the Chi-
nese-language records in that database contain aggregator characters in the
Romanized fields” (Arsenault 2000a, 116). It is worth noting that, when the
two syllables are separated in another search in RLIN, although recall im-
proves from 87.5% (98 relevant records) to 100% (112 relevant records), pre-
36 CATALOGING & CLASSIFICATION QUARTERLY

TABLE 1. Results for Searches on Shuidao and Shui Dao in the Three Data-
bases

Database/keyword Retrieved Relevant Precision % Recall %


OCLC shuidao 13 13 13/13 = 100% 13/200 = 6.5%
OCLC shui dao 505 200 200/505 = 39.6% 100%
RLIN shuidao 127 98 98/127 = 77.2% 98/112 = 87.5%
RLIN shui dao 481 112 112/481 = 23.3% 100%
PKUL shuidao 69 63 63/69 = 91.3% 100%
Downloaded by [Stony Brook University] at 16:35 19 October 2014

PKUL shui dao 69 63 63/69 = 91.3% 100%

TABLE 2. Results for Searches on Yingyang and Ying Yang in the Three Data-
bases

Database/keyword Retrieved Relevant Precision % Recall %


OCLC yingyang 46 14 14/46 = 30.4% 14/512 = 2.7%
OCLC ying yang 1033 512 512/1033 = 49.6% 100%
RLIN yingyang 165 110 110/165 = 66.7% 110/391 = 28.1%
RLIN ying yang 1208 391 391/1208 = 32.4% 100%
PKUL yingyang 85 84 84/85 = 98.8% 100%
PKUL ying yang 85 84 84/85 = 98.8% 100%

TABLE 3. Results for Searches on Xiuci and Xiu Ci in the Three Databases

Database/keyword Retrieved Relevant Precision % Recall %


OCLC xiuci 34 34 34/34 = 100% 34/561 = 6.1%
OCLC xiu ci 1004 561 561/1004 = 55.9% 100%
RLIN xiuci 290 290 290/290 = 100% 100%
RLIN xiu ci 1004 10 10/1004 = 0.97% 10/290 = 3.4%
PKUL xiuci 207 203 203/207 = 98.1% 100%
PKUL xiu ci 207 203 203/207 = 98.1% 100%

cision falls considerably from 77.2% to 23.3%. That is to say, one has to go
through 481 records to find 112 of them relevant.
The most remarkable finding is that, with the third database, that of PKUL,
the results remain exactly the same whether the two syllables are joined
(shuidao) or separated (shui dao). In both cases, the total retrieved is 69, with
63 of them being relevant. Thus, both recall and precision percentages are very
Jie Huang and Kathleen J. M. Haynes 37

high, 100% and 91.3% respectively. These results show that, with this data-
base, it does not matter how the end-user enters the Pinyin keyword.
Compared with Table 1, two things are distinctly different in Table 2 and
Table 3. First, for “OCLC yingyang” in Table 2, precision percentage is quite
low (30.4%) while it is 100% for “OCLC shuidao.” This is because 32 records
out of the total 46 retrieved are about a historic place called Yingyang, which is
homophonous with yingyang ‘nutrition’ under search. Second, it is worth re-
minding again that in Table 3 “RLIN xiu ci” actually produced as many as
15,942 records. Only the first 1,004 records were used for analysis. Surpris-
Downloaded by [Stony Brook University] at 16:35 19 October 2014

ingly, just 10 records out of the first 1,004 were relevant, by chance or not,
resulting in an extremely low precision percentage (0.97%). The recall per-
centage is very low as well (3.4%). In sharp contrast, it is shown in Tables 2
and 3 again that PKUL database is not sensitive to aggregation or non-aggre-
gation of the two syllables. Recall percentage is 100% for both, and precision
percentage is 98.8% and 98.1% respectively in Table 2 and Table 3.
In summary, the study shows that with both OCLC and RLIN whether syl-
lables are aggregated or not in the search makes a marked difference to the re-
sults. In OCLC, aggregation of the syllables of the keyword result in very high
precision. It is 100% precision for both shuidao ‘paddy rice’ and xiuci ‘rheto-
ric.’ Although the counterpart percentage for yingyang ‘nutrition’ is relatively
low, only 30.4%, this relatively low percentage results from the fact that all the
“irrelevant” records are about a place of historical interest, Yingyang, which is
homophonous with yingyang ‘nutrition.’ If we ignore this compounding fac-
tor, the precision for yingyang is also 100%. On the other hand, when the two
syllables of the keywords are aggregated, the recall percentages are very poor:
6.5% for shuidao ‘paddy rice,’ 2.7% for yingyang ‘nutrition,’ and 6.1% for
xiuci ‘rhetoric.’ This is because, as noted earlier, OCLC title keyword searches
in aggregated Pinyin only retrieve those records that are cataloged with an
“other title” field in aggregated Pinyin. But the overwhelming majority of Chi-
nese records do not have “other titles” in aggregated Pinyin in OCLC.
In RLIN, whether the syllables are aggregated into words in title keyword
searches also makes a noticeable difference in precision percentages. When
the syllables of the searched keywords are aggregated, the precision percent-
ages are 77.2% for shuidao ‘paddy rice,’ 66.7% for yingyang ‘nutrition,’ and
100% for xiuci ‘rhetoric.’ But the corresponding percentages drop to 23.3%
for shui dao ‘paddy rice,’ 32.4% for ying yang ‘nutrition,’ and 0.97% for xiu ci
‘rhetoric’ when the syllables of the keywords are not aggregated. One fact that
works against searching title keywords in non-aggregated Pinyin in both
OCLC and RLIN is that it usually retrieves large numbers of records with rela-
tively poor precision that may not be practically manageable. Thus, in this
study, the total retrieved records are 505 from OCLC and 481 from RLIN for
38 CATALOGING & CLASSIFICATION QUARTERLY

shui dao ‘paddy rice,’ 1,033 from OCLC and 1,208 from RLIN for ying yang
‘nutrition,’ and 1,004 from OCLC and 15,942 from RLIN for xiu ci ‘rhetoric.’
These results show that the argument that non-aggregated titles, i.e., titles with
syllable division, would provide more access points and, therefore, have a
greater possibility of being found (see, e.g., Studwell, Wang, and Wu 1993; Lo
and Miller 1991) is actually flawed, because it really means destroying the
balance between recall and precision, resulting in impractically large num-
bers of titles retrieved with very poor ratio of precision. It is therefore not
user-friendly at all.
Downloaded by [Stony Brook University] at 16:35 19 October 2014

The most significant finding of this study is that with PKUL the marked
difference in aggregation observed in OCLC and RLIN does not exist.
There, aggregation or not of the syllables in “keywords in title” searches is
inconsequential. The results remain exactly the same: 69 retrieved with 91.3%
precision for shuidao or shui dao; 85 retrieved with 98.8% precision for
yingyang or ying yang; and 207 retrieved with 98.1% for xiuci or xiu ci. For the
purpose of this study, it can be assumed that its recall approaches 100% too.
All factors considered, PKUL produces the best results of all three databases.
At this point, a question to ask is: Why does PKUL yield much higher recall
and precision percentages than OCLC and RLIN, ignoring the issue of aggre-
gation in entering keywords in “keywords in title” searches? Readers are re-
ferred back to (6), where the bold-typed parts in the example titles can be
abstracted into the following configuration characterized by presence and/or
absence of two features: sequence and adjacency. If X precedes Y, then they
are sequential (i.e., Sequence: Yes). If, instead, Y precedes X, then they are not
sequential (i.e., Sequence: No). Adjacency here means X and Y are not sepa-
rated by another element or other elements (e.g., XZY), excluding the space.

(7) Configuration Sequence Adjacency


a. X Y (= XY1) Yes Yes
b. X Y (= XY2) Yes Yes
c. X Y (= WX Y) Yes Yes
d. X …Y Yes No
e. YX No Yes
f. Y…X No No

This configuration of X and Y is interpreted as follows. In (7a), “X Y” or


“XY1” stands for shuidao ‘paddy rice’ in (6a), which is a right target for re-
trieval. In (7b), “X Y” or “XY2” stands for shuidao ‘waterway; water course;
water route’ in (6b), which is not a right target for retrieval but is a “reasonable
error” because X and Y are both sequential and adjacent. In (7c), “X Y” or,
more exactly, “WX Y” stands for shanshui dao ‘mountains-and-waters to…’
in (6c), which is not a right target for retrieval either but is again a “reasonable
error” because X and Y are both sequential and adjacent (note: the space does
Jie Huang and Kathleen J. M. Haynes 39

not affect their status as being adjacent to each other). In (7d-f), however, the
combinations of X and Y are not right targets for retrieval, and they are “unrea-
sonable errors” due to the absence of sequence and/or adjacency.
In both OCLC and RLIN, when “XY” is entered in a title keyword search,
the databases will treat it as one keyword. On the other hand, when “X Y” (i.e.,
with a space in between) is entered, the databases will treat this as two separate
keywords, regardless of their sequence and/or adjacency. Therefore, all the rec-
ords with the configuration of (7a-f), instantiated by (6a-f), will be retrieved.
That is why the precision percentage is low for both OCLC and RLIN when
Downloaded by [Stony Brook University] at 16:35 19 October 2014

Pinyin keywords are entered in a non-aggregated fashion. In general, when the


two syllables of the search term are not aggregated in these two databases, the
user is faced with the problem that an impractically large number of records is
retrieved with a relatively low precision percentage. In other words, the bal-
ance between recall and precision is lost. This is because these databases actu-
ally treat “X Y” as two separate keywords “X” and “Y.” For that reason, they
retrieve all the records whose titles in part match each of the four cases in (8)
characterized by two features: sequence and adjacency (ignoring the space).
Obviously, only those titles that in part match (8a) have the potential of being a
relevant record whereas all those matching cases (8b-d) will be irrelevant rec-
ords.
(8) a. XY (Sequence: Yes; Adjacency: Yes)
b. X …Y (Sequence: Yes; Adjacency: No)
c. YX (Sequence: No; Adjacency: Yes)
d. Y…X (Sequence: No; Adjacency: No)

In PKUL, in contrast, when both “XY” and “X Y,” i.e., with a space be-
tween the two syllables or not, are entered in a title keyword search, the data-
base will impose two conditions, i.e., both sequence and adjacency (ignoring
the space). That is to say, both sequence and adjacency have to be fulfilled if a
particular record is retrieved. In other words, the PKUL database will only re-
trieve those records with the configuration of (8a) while filtering out (8b-d).
That is why its precision percentage is very high compared with OCLC and
RLIN.
It is worth pointing out that, although only three keywords were studied and
a quantitative method was used, the difference found between PKUL on the
one hand and OCLC and RLIN on the other is qualitative, not quantitative.
That is, in all three tests, there is absolutely no difference in retrieving results
with PKUL no matter whether the two syllables of the keywords were aggre-
gated or not, whereas this is not the case with either OCLC or RLIN. The same
results are expected if other keywords are tested. The difference in retrieving
results reflects a fundamental difference in handling cataloging and indexing.
40 CATALOGING & CLASSIFICATION QUARTERLY

CONCLUSION

The findings of this study shed light on the issue of word division under de-
bate in the library community concerned with Chinese records. Catalogers
may interpret and segment titles in wrong ways, resulting in inconsistencies in
cataloging. Furthermore, users may enter search terms in wrongly aggregated
Pinyin. These inconsistencies in word division between cataloger-generated
records, and between these records and user-input queries, will arguably affect
retrieval in a negative way. The PKUL approach to cataloging and retrieving,
Downloaded by [Stony Brook University] at 16:35 19 October 2014

characterized by sequence and adjacency, resolves many of the problems as


discussed. It is strongly recommended that this approach to cataloging and re-
trieving be adopted by the library community at large.
Regarding the issue of word division, there is no doubt that aggregation of
syllables into words should greatly benefit library users by raising the read-
ability of Chinese titles in the Pinyin mode. At present, both OCLC and RLIN
already offer the optional feature of providing the software support for reading
Chinese characters. With this feature available, aggregation of syllables into
words may not be as crucial. In reality, however, this feature of reading Chi-
nese original scripts is still not widely available in the library community.
Most end users rely on the Pinyin mode in their searches. Therefore, word di-
vision is a crucial factor that will directly affect effectiveness and efficiency of
searches by end users. The importance of word division in cataloging and in-
dexing still needs to be studied, and it is hoped that the library community con-
cerned with Chinese language materials will continue research that will lead to
a better understanding of the issue.

Received: December, 2002


Revised: February, 2003
Accepted: May, 2003

WORKS CITED

Anderson, James. 1980. Cataloging and Classification of Chinese Language Library


Materials. In Cataloging and classification of non-Western materials: Concerns, is-
sues, & practices, ed. Mohammed M. Aman, 93-129. Phoenix: Oryx Press.
Arsenault, Clément. 1998. “Conversion of Wade-Giles to Pinyin: An estimation of ef-
ficiency improvement in retrieval for item-specific OPAC searches.” Canadian
Journal of Information and Library Science 23: 1-28.
______. 2000a. Word division in the transcription of Chinese script in the title fields of
bibliographic records. Ph.D. diss., University of Toronto.
______. 2000b. Testing the impact of syllable aggregation in romanized fields of Chi-
nese language bibliographic records. In Dynamism and stability in knowledge orga-
Jie Huang and Kathleen J. M. Haynes 41

nization, eds. Clare Beghtol, Lynne C. Howarth, and Nancy J. Williamson, 143-49.
Wèurzburg, Germany: Bergon Verlag.
______. 2002. “Pinyin romanization for OPAC retrieval: Is everyone being served?”
Information Technology and Libraries 21: 45-50.
Groom, Linda. 1997. “Converting Wade-Giles cataloging to Pinyin: The development
and implementation of a conversion program for the Australian National CJK ser-
vice.” Library Resources & Technical Services 41: 254-63.
Hagler, Ronald. 1997. The bibliographic record and information technology (3rd ed.).
Chicago and London: American Library Association/Ottawa: Canadian Library
Association.
Harman, Donna. 1997. The TREC conferences. In Readings in information retrieval,
Downloaded by [Stony Brook University] at 16:35 19 October 2014

eds. Karen Spark Jones and Peter Willett, 247-256. San Francisco, CA: Morgan
Kaufmann Publishers.
Harrison, Scott Edward. 1992. “Chinese names in English.” Cataloging & Classifica-
tion Quarterly 15: 3-14.
Hiatt, Robert Miller. 1998. “Chinese place names.” Chinese Librarianship: An Inter-
national Electronic Journal 5: 1-5. Retrieved April 14, 2002 on the World Wide
Web: http://www.whiteclouds.com/iclc/cliej/cl5hiatt.htm.
Hu, Qianli. 1994. “How to distinguish and catalog Chinese personal names.” Catalog-
ing & Classification Quarterly 19: 29-60.
Huang, Jie. 2002. The issue of word division in cataloging Chinese language materials.
Master’s thesis. The University of Oklahoma.
Jones, Karen Sparck, and Peter Willett, eds. 1997a. Readings in information retrieval.
San Francisco, CA: Morgan Kaufmann Publishers.
Keen, E. Michael. 1997. Presenting results of experimental retrieval comparisons. In
Readings in information retrieval, eds. Karen Spark Jones and Peter Willett,
217-22. San Francisco, CA: Morgan Kaufmann Publishers.
Lancaster, F. W. 1997. MEDLARS: Report on the evaluation of its operating effi-
ciency. In Readings in information retrieval, eds. Karen Spark Jones and Peter
Willett, 223-46. San Francisco, CA: Morgan Kaufmann Publishers.
______. 1998. Indexing and abstracting in theory and practice (2nd ed.). Champaign,
IL: University of Illinois Graduate School of Library and Information Science.
Lau, Shuk-Fong, and Vicky Wang. 1991. “Chinese personal names and titles: Prob-
lems in cataloging and retrieval.” Cataloging & Classification Quarterly 13: 45-65.
______. 1993. “Chinese personal names and titles: Issues in cataloging and retrieval.”
Encyclopedia of Library and Information Science 52: 47-64.
Li, Charles N., and Sandra A. Thompson. 1981. Mandarin Chinese: A functional refer-
ence grammar. Berkeley, CA: University of California Press.
Lin, Joseph C. 1988. “Chinese names containing a non-Chinese given name.” Catalog-
ing & Classification Quarterly 9: 69-81.
Lo, Karl K., and R. Bruce Miller. 1991. “Computers and romanization of Chinese bib-
liographic records.” Information Technology and Libraries 10: 221-33.
Mair, Victor H. 2000. “Pinyin orthographical rules for libraries.” Chinese Librarian-
ship: An International Electronic Journal 10: 1-3. Retrieved April 14, 2002 on the
World Wide Web: http://www.whiteclouds.com/iclc/cliej/cl10mair.htm.
______. 2001a. “Pinyin orthographical rules for libraries: A follow-up.” Chinese Li-
brarianship: An International Electronic Journal 11: 1-7. Retrieved April 14, 2002
on the World Wide Web: http://www.whiteclouds.com/iclc/cliej/cl11mair.htm.
42 CATALOGING & CLASSIFICATION QUARTERLY

______. 2001b. “Pinyin orthographical rules for libraries: A recent literature review.”
Chinese librarianship: An International Electronic Journal 11: 1-3. Retrieved
April 14, 2002 on the World Wide Web: http://www.whiteclouds.com/iclc/cliej/
cl11mair2.htm.
Melzer, Philip A. 1996a. “Pinyin romanization: New developments and possibilities.”
Journal of East Asian Libraries 109: 91-92.
______. 1996b. “Pinyin romanization: Word division recommendation.” Chinese Li-
brarianship: An International Electronic Journal 2: 1-3. Retrieved April 14, 2002
on the World Wide Web: http://www.whiteclouds.com/iclc/cliej/cl2phil.htm.
______. 1997. “Library of Congress converting to Pinyin for Chinese romanization.”
Chinese Librarianship: An International Electronic Journal 4, 1-2. Retrieved
Downloaded by [Stony Brook University] at 16:35 19 October 2014

April 14, 2002 on the World Wide Web: http://www.whiteclouds.com/iclc/cliej/


cl4phil2.htm.
Saracevic, Tefko, Paul Kantor, Alice Y. Chamis, and Donna Trivison. 1997. A study of
information seeking and retrieving: Background and methodology. In Readings in
information retrieval, eds. Karen Spark Jones and Peter Willett, 175-90. San Fran-
cisco, CA: Morgan Kaufmann Publishers.
Studwell, William E., Rui Wang, and Hong Wu. 1993. “A tale of two decades: The
controversy over the choice of a Chinese language romanization system in Ameri-
can cataloging practice.” Cataloging & Classification Quarterly 18: 117-24.
Tao, Hanyu, and Charles Cole. 1990. “Wade-Giles or Hanyu Pinyin: Practical issues in
the transliteration of Chinese titles and proper names.” Cataloging & Classification
Quarterly 12: 105-17.
Teng, Ju-yen. 1998. “A few thoughts on Hiatt’s three principles on Chinese place
names.” Chinese Librarianship: An International Electronic Journal 5: 1-2. Re-
trieved April 14, 2002 on the World Wide Web: http://www.whiteclouds.com/
iclc/cliej/cl5teng.htm.

S-ar putea să vă placă și