Sunteți pe pagina 1din 19

Jointly published by Akadmiai Kiad, Budapest Scientometrics,

and Springer, Dordrecht Vol. 65, No. 2 (2005) 245263

A comparison of two bibliometric methods


for mapping of the research front
BO JARNEVING

Swedish School of Library and Information Science, Bors (Sweden)

This paper builds on previous research concerned with the classification and specialty mapping
of research fields. Two methods are put to test in order to decide if significant differences as to
mapping results of the research front of a science field occur when compared. The first method
was based on document co-citation analysis where papers citing co-citation clusters were assumed
to reflect the research front. The second method was bibliographic coupling where likewise citing
papers were assumed to reflect the research front. The application of these methods resulted in two
different types of aggregations of papers: (1) groups of papers citing clusters of co-cited works and
(2) clusters of bibliographically coupled papers. The comparision of the two methods as to
mapping results was pursued by matching word profiles of groups of papers citing a particular co-
citation cluster with word profiles of clusters of bibliographically coupled papers. Findings
suggested that the research front was portrayed in two considerably different ways by the methods
applied. It was concluded that the results in this study would support a further comparative study
of these methods on a more detailed and qualitative ground. The original data set encompassed
73,379 articles from the fifty most cited environmental science journals listed in Journal Citation
Report, science edition downloaded from the Science Citation Index on CD-ROM.

Introduction

The objective of this research is to compare two bibliometric methods for the
mapping of the subject content of a fields research front. Thus, this paper pertains to
the corpus of research concerned with the classification and specialty mapping of
research fields that initially were launched in the 1970s by the Institute for Scientific
Information (ISI). In BRAAM et al. (1991), with reference to PRICE (1965), the authors
describe the constitution of the research front by referring to the way researchers direct
references to a small and select part of the more recent literature. Cocitation analysis is
then seen as a way of identifying high-density areas in a citation network by clustering
highly co-cited documents, constituting the intellectual base of a discipline. Thus, these
high-density areas mirror the research front documents which could be identified and
grouped by their citing relations to co-citation clusters and their subject content
elaborated and compared through word profile analysis (ibid.). Another definition of the

Received June 13, 2005


Address for correspondence:
BO JARNEVING
Swedish School of Library and Information Science, 501 90 Bors, Sweden
E-mail: bo.jarneving@hb.se

01389130/US $ 20.00
Copyright 2005 Akadmiai Kiad, Budapest
All rights reserved
B. JARNEVING: Mapping of the research front

research front is given in PERSSON (1994), where the research front is said to consist of
those documents that are similar in terms of citing the same literature, that is, similar in
the meaning of being bibliographically coupled by common references, and current,
relatively highly cited documents are seen as the intellectual base.
According to FRANKLIN & JOHNSTON (1988, p. 328), a co-citation bibliometric
model defines coherent research problem areas by classifying and grouping current
scientific papers through their common referencing to clusters of highly cited and
highly co-cited works. The basic unit of this model is the co-citation cluster, which is
made up of two components: (1) a set of highly cited and co-cited referenced works
called the base literature and (2) a set of articles that referenced those, called the
specialtys published current literature. What are actually clustered are the cited works.
Furthermore, the base literature is considered to represent the cores of theories and
methods and the citing articles describe the research front of the problem area at the
time period under investigation. With reference to this model, one could in a similar
way describe the constituents of a cluster containing bibliographically coupled papers
as: (1) a set of referenced works of which each is common to at least one pair of source
articles and (2) the set of articles that referenced them. The difference in this case is that
what actually are clustered are the citing articles. Then the citing articles could be said
to be part of a research front and the cited works constitute the base literature. In both
cases the clustering of the literature is originally based on a referencing consensus.
In this paper, two measures of document similarity are applied: bibliographic
coupling strength and co-citation strength. With a starting point in the clustering of
papers related by bibliographic coupling and in the clustering of co-cited documents,
two methods are put to test, resulting in two different types of aggregations of papers:
(1) groups of papers citing clusters of co-cited papers, i.e., a specialtys current
literature and (2) clusters of bibliographically coupled papers. What is compared then
are groups of current citing papers that originate from the approximate same period of
time. In the case of papers grouped by citing a particular co-citation cluster, the only
condition to fulfil is to cite a reference in that co-citation cluster. Also, we could expect
that groups of papers citing co-citation clusters are not pairwise disjoint, hence papers
citing more than one co-citation cluster might be considered bibliographically coupled
on a higher level.
Conclusively, these different principles of grouping might lead to different mapping
results when applied and the question is if these different results would deviate
considerably from one another. If so, this would imply good reasons to further
investigate qualitative differences of results emanating from each method and to
elaborate underlying causes on a more detailed level. However, the purpose of this
paper only encompasses the quantitative and preliminary study of such a possible
significant deviation, hence this study is solely indicative.

246 Scientometrics 65 (2005)


B. JARNEVING: Mapping of the research front

It should be underlined that this study is based on the assumption that the
aggregation of significant title words in cluster word profiles could describe the subject
content of a cluster in a valid way.

Method

Two methods for the mapping of the research front are compared: (1) the clustering
of co-cited references were papers citing a particular co-citation cluster are grouped
through this association and (2) clustering of bibliographically coupled papers. The
comparision of the two methods as to mapping results is pursued by matching word
profiles of groups of papers citing a particular co-citation cluster with word profiles of
clusters of bibliographically coupled papers. In the following text of the method part,
basic elaborations of the two underlying methods, co-citation analysis and bibliographic
coupling are presented and the method for word profile analysis is described as well as
the cluster method applied. Furthermore, the practical application of methods including
threshold settings for noise reduction purposes are accounted for. By way of conclusion,
the method part ends with a thorough description of data collection, features of data
including features of resulting distributions when thresholds accounted for are applied.

Bibliographic coupling

Bibliographic coupling was introduced by KESSLER to the scientific society through


a number of papers at the beginning of the 1960s, starting with two reports from
Massachusetts Institute of Technology (MIT), Lincoln Laboratory, in 1960 and 1961. It
was primarily described as a method for grouping technical and scientific papers,
facilitating scientific information provision and document retrieval. In the first report a
general outline of the context in which an indexing method concerned with countable
indicators based on references might operate was given, and in the second, the
definition of bibliographic coupling was stated as: a single item of reference shared
by two papers is defined as a unit of coupling between them. Based on this unit, two
graded criteria of coupling were defined.
Criterion A A number of papers constitute a related group GA if each member
of the group has at least one reference (one coupling unit) in common with a
given test paper, Po. The coupling strength between Po and any member of GA is
measured by the number of coupling units between them. GnA is that portion of
GA that is linked to Po through n coupling units. (According to this criterion
there need not be any coupling between the members of GA, only between them
and Po).

Scientometrics 65 (2005) 247


B. JARNEVING: Mapping of the research front

Criterion B A number of papers constitute a related group GB if each member


of the group has at least one coupling unit with every other member of the group.
The coupling strength of GB is measured by the number of coupling units
between its members. Criterion B differs from criterion A in that it forms a closed
structure of interrelated papers, whereas criterion A forms an open structure of
papers related to a test paper.
Hence, considering groups or clusters of papers associated by bibliographic
coupling units part of a larger network, these could be considered sub-graphs and
groups of papers fulfilling the conditions of criterion A could be considered incomplete
sub-graphs whereas groups of papers fulfilling the conditions of criterion B could be
considered complete subgraphs with a maximal level of internal connectedness.

Co-citation

Co-citation, a measure related to bibliographic coupling, was introduced in 1973 by


SMALL. This form of document coupling was defined as the frequency with which two
documents are cited together. The co-citation strength is then defined as the number of
identical citing items. Small also gives a more formal definition of co-citation:
If A is the set of papers which sites document a and B is the set which sites b,
then A B is the set which sites both a and b. The number of elements in
A B , that is n (A B) , is the co-citation frequency. The relative co-citation
frequency could be defined as n ( A B) n ( A B).
Small stated that unlike bibliographic coupling which links source documents, co-
citation links cited documents and is, therefore, analogous to a measure of descriptor or
word association. Measuring co-citation strength the degree of relationship or
association between papers as perceived by the population of citing authors is
measured. Hence, to be strongly co-cited a large number of authors must cite two earlier
works. Drawing on the analogy regarding descriptor or word association, Small argued
that due to the dependence on authors, co-citation patterns can change over time, just as
vocabulary co-occurrences can change over time as subject fields evolve. Furthermore,
Small noted that bibliographic coupling is a fixed and permanent relationship because it
depends on references contained in coupled documents. That is, once two documents
are published their coupling is established through their references, whereas the co-
citation strength between any two documents will vary over time. Another notable
difference between these two modes of coupling is when two papers are frequently co-
cited; they are also necessarily frequently cited individually as well. This means, some
aspect of quality, prominence or visibility could be assigned to those documents that are

248 Scientometrics 65 (2005)


B. JARNEVING: Mapping of the research front

selected for the analysis of co-citation relations as some statistical stability, in terms of a
satisfactory number of co-citations, is necessary for a meaningful analysis. No such
aspect is necessarily associated with the analysis of bibliographic coupling relations.

Cluster analysis

Cluster analysis involves techniques that produce classifications from data that are
initially unclassified. The problem that cluster analysis is designated to solve is
typically the following one: given a sample of n objects, each of which has a score on p
variables, device a scheme for grouping the objects into classes so that similar ones
are in the same class. The method must be completely numerical and the number of
classes unknown (MANLY, 1994, p. 128). Cluster analysis methods have evolved from
many disciplines and the appropriateness of any cluster method is related to each
discipline through the research questions and types of data thought to be useful in
building a classification. Hence, in the process of selecting a suitable cluster method,
the applications and advantages of different cluster methods applied in the field of
bibliometrics, as reflected by its literature, have been considered. In the bibliometric
context, not seldom, hierarchical agglomerative methods have been applied. These
methods start with a matrix of distances or proximity values showing similarity or
dissimilarity. All objects (here papers) begin alone in groups of size one and groups that
are close (similar) together are merged. By the way that we define close we can then
identify the different hierarchical agglomurative cluster methods. In this study, a single
link routine implemented in the bibliometric software, Bibexcel, was applied. The
defining feature of this method is that the distance between groups is defined as that of
the closest pair of individuals, where only pairs consisting of one individual from each
group are considered (EVERITT et al., p. 57, 2001). Single link methods are easy to
program and to apply to large data sets which is probably why it has been used in
bibliometric research when the task has been to break down large amounts of biblio-
graphic data (LEYDESDORFF, 1987). Also, the single link method seems to have been
successfully used by several researchers in the context of document co-citation analysis,
(e.g. SMALL & GRIFFITH, 1974; SMALL & SWEENEY, 1985; BRAAM et al., 1991).
It should be noted that the application of the single link method for the clustering of
co-cited references and bibliographically coupled papers mostly leads to clusters that
could be considered incomplete subgraphs, hence the internal cluster structure should in
most cases be less coherent in terms of the general level of connectedness in relation to
clusters or groups that could be considered complete sub-graphs, as would be the case
of clusters of bibliographically coupled papers that fulfil the conditions of KESSLERs

Scientometrics 65 (2005) 249


B. JARNEVING: Mapping of the research front

criterion B (1963). However, in this study there is no objective of fulfilling these


criteria, and bibliographic coupling is solely regarded a method for associating similar
documents with one another as is co-citation analysis.

Applications of methods

The starting point of the practical application of the method of bibliographic


coupling was to acquire a large number of relevant cited references and sources without
adding noise to the information. The first step was to exclude all references cited only
once as these can not contribute to a bibliographic coupling. This is foremost carried out
in order to speed up the computing. Next, all bibliographic couplings were computed.
As the significance or value of a bibliographic coupling unit between two papers could
be assumed to be inversely related to the combined lenghts of the reference lists of both
papers, a function that normalizes for the length of reference lists is needed. We define
such a function as:
rij
CS ij =
(ri r j )

where
CSij = coupling strength between paper i and paper j
rij = the number of references common to both i and j
ri = the number of references in the reference list of paper i
rj = the number of references in the reference list of paper j

The interval is [0, 1] and ri = rj = rij gives the maximum value. (1)
Finally, the threshold of normalized coupling strength was set to include all pairs of
bibliographically coupled papers within the third quartile when sorted by normalized
coupling strength. The list of pairs of bibliographically coupled papers was then applied
as input data to the cluster routine.
The first two steps in co-citation analysis are to decide on a citation threshold and a
co-citation threshold. The objective is primarily to avoid coincidental citations and co-
citations and to ensure that an adequate similarity structure is obtained. The following
design for reducing of noise was applied:
1. First, all random citations were excluded by deleting all references cited
only once.

The creation of clusters that could be considered complete sub-graphs is, e.g., feasible by the application of
the complete linkage cluster method where the largest distance between a candidate object and any object of
the existing cluster is saught, meaning that any candidate must be within a certain level of similarity to all
members of that cluster.

250 Scientometrics 65 (2005)


B. JARNEVING: Mapping of the research front

2. Next, all citation frequencies were converted to fractional counts, meaning


that each reference was assigned a weight corresponding to the length of
the reference list. This was deemed necessary as considerable variations
was found.
3. The first value preceding the third quartile in a list of the rank ordered sums
of fractional counts was then sought and used as a citation threshold.
In the next stage the co-citation strength for all pairs of references was calculated.
As the significance or value of a co-citation could be assumed to be inversely related to
the citation frequency of a cited paper, a function that normalizes for the citation
frequencies is needed. We define such a function as:
coci, j
CSi, j =
(citi cit j )

where
CSi,j = co-citation strength between document i and j
coci,j = the number of co-citations between document i and j
citi = the number of citations for document i
citj= the number of citations for document j

The interval is [0, 1] and citi = citj = citij gives the maximum value. (2)
Finally, the threshold of normalized co-citation strength was set to include all pairs
of co-cited papers within the third quartile when sorted by normalized co-citation
strength. The list of pairs of co-cited papers was then applied as input data to the cluster
routine.

Word-profile analysis

Before any comparisons of the intellectual content of clusters could be made, the
question of how to operationalize the concept of subject-similarity between clusters had
to be dealt with. In FRANKLIN & JOHNSTON (1988, p. 329) some examples of which
elements in bibliographic descriptions to use when indexing a research area are given:
author names, organizations and significant words and phrases from titles. To this list
one could also add references, journal names, index terms and classification codes.
Such elements describe the content of the single document and, when aggregated, the
content of a specific database or a discipline. These are by nature appropriate elements
to use when building a science field map (NOYONS, 1999, p. 18), and are appropriate
when classifying a field of science into disciplines and sub disciplines (specialties). One
of the reasons for using title words is that there is no drop out, as opposed to key-words,
indexing terms et cetera, as each bibliographic description contains a complete title.

Scientometrics 65 (2005) 251


B. JARNEVING: Mapping of the research front

Furthermore, title words have a high topicality and usually aim at a precise description
of the subject content of the article. The disadvantages of using title words is that title
words have to be standardized when counted, and also carefully selected as not all
words are meaningful in the sense of describing a topic when treated as single words.
Thus, content bearing title words from the citing articles, assembled to word profiles for
each cluster, were assumed to describe the subject content and subject relatedness
between clusters. By creating a word profile for each selected cluster based on selected
title words from clustered journal articles, the subject similarity between clusters could
be established. In the first stage all obviously insignificant words such as prepositions,
pronouns etc., were excluded through a stop-wordlist and some verbs were also
excluded when they were considered of no direct relevance to the content of the articles.
In order to aggregate words with a similar meaning without including words with a
different meaning, a frequency list with all the remaining title words were manually
checked. Words with a similar meaning were then standardized to one form (PETERS et
al., 1995) (Table 1). The problem of homonyms and synonyms and the effect of
different approaches to the unification of words into one form, made it important to
carefully corroborate this unification with full texts and bibliographic descriptions,
hence any automated method like the application of a stemming algorithm was
considered unsufficient.

Table 1. Example of the title word standardization


Unedited title words Standardized word form
Biodegradability Biodegrad
Biodegradable Biodegrad
Biodegradation Biodegrad
Biodegrade Biodegrad

As the sizes of clusters as well as the length of titles affect the likelihood of common
title words, a function that normalizes for the number words contained in each cluster is
needed. We define such a function as:
t i, j
Simi, j =
(ti t j )

where
Simi, j = the similarity between cluster i and cluster j
ti,j = the number of unique title words common to both cluster i and cluster j
ti = the number of unique title words in cluster i
tj= the number of unique title words in cluster j
The interval is [0, 1] and ti = tj = tij gives the maximum value. (3)

252 Scientometrics 65 (2005)


B. JARNEVING: Mapping of the research front

Data

The compilation of a population of journal articles aimed at the creation of a set of


articles representative for the field selected for the test of methods, not too restricted in
terms of number of units or in time, in order to possibly increase the generality of
findings. Hence, the process of sampling had its point of departure in a large set of
papers encompassing a decades publication. From the Science Citation Index on CD-
ROM, a total of 73,379 articles from the fifty most cited environmental science journals
listed in Journal Citation Report (JCR), science edition, 2000 were downloaded in a
predefined format and gathered in one file. A small proportion of articles did not
contain references and were therefore excluded. N of the population of selected
environmental journal articles was then 72,372. In order to achieve a manageable set of
articles, the original set had to be diminished. This was achieved by partitioning the
initial file in ten subsets, each containing the total number of articles for a certain
publication year (from 1991 to 2000, inclusively). Then, from each subset,
approximately ten percent of the total number of articles in each was randomly selected.
Finally all sets were gathered in one file containing a total of 7,239 articles with their
references. In order to appreciate to what degree the random sampling reflected the
impact of journal titles with reference to the population of articles, from the chi-squared
distance between one vector from the population and one from the sample, having the
same dimensionality, an association measure was calculated. This measure, suggested
in AHLGREN et al. (2003) as an association measure in the context of author co-citation
analysis, measures the distance between two author rows A1 and A2 in a co-citation
matrix (CCij) and is given by

2
N
CC1k CC 2 k
d 2 ( A1 , A2 ) =
k =1 CC1 CC 2
N
where CC j = C ji with j = 1,2
i =1

The squared distance between the two vectors was 0.000341246, a seemingly low
value. An association measure is then achieved by subtracting the squared distance from
the maximum value for any N: 2 d 2 ( A1 , A2 ) . Hence, the association value between
these vectors is 1.999658754. Thus, we can conclude that the random set of articles
reflects the original set of articles in terms of the relative distribution of articles over
journal titles to a great extent.

The vectors of the same dimensionality referred to in the text are the journal frequencies derived from (1) the
original set of articles and the journal frequencies derived from (2) the segmented set of articles. For both
vectors the dimensions are 50, corresponding to the number of unique journal titles.

Scientometrics 65 (2005) 253


B. JARNEVING: Mapping of the research front

All further analysis was based on this segmented set of publications. An important
aspect in all bibliometric analyses of this type is whether the field of science under
investigation could be reflected solely by journal type sources or not. This should be a
prerequisite, as monographs are not indexed in the ISI citation databases as source
documents, though they appear as cited references. In this case, the approximate
proportion of document types, non-journals and journals, was 49,000/194,000 for non-
journals and 135,000/194,000 for journals. There was a drop-out of nearly 10,000
references due to clerical variations not covered for by the software applied for these
calculations. Though, less exact, we still get a clear indication of this fields dependence
on the journal as a channel for formal, scientific communication.
In general, the objective of quantitative data-analysis is to provide a sparse sketch of
data, reduced to a few parameters, aiming at the description of the underlying structure
and in most cases, bibliometric data presents highly skewed distributions. Here the
evenness of citations distributed over references in the initial sample is tested measuring
the match between a power law distribution and the observed frequency data. The
application of Paretos law was used to estimate the evenness of the distribution of
citations over references. Using the program Lotka, 1.02 (ROUSSEAU & ROUSSEAU,
2000) the most suitable parameters for a distribution of the form f(y)=C/y is
determined, where f(y) denotes the relative number of sources (references) with
production (citation frequency) y and C and are parameters depending on the subject
field being analyzed. These parameters are not independent but related through the
C
requirement that =1. The higher the beta-value, the higher the concentration.
k =1 k
It was found that the C-value was 0.7468 and the beta-value 2.5065. Here about 75
percent of all references are cited only once and only a few references are highly cited.
This indicates a large portion of noise in the sample of references and a small
proportion of central works. In fact, only 1.96 percent of all references are cited ten
times or more during the period. In order to be able to make a valid comparison
between environmental sciences and another field of science, a subset of the
environmental initial file was created in order to match a set of articles assigned the
journal subject category Applied Chemistry in the JCR science edition. Thus a set
containing 29,477 environmental articles, published between 19912000 inclusively
was established. The first eight journals were selected from the total citations ranked journals
in the 2001 science edition of JCR, and compared with a set containing 26,449 articles from
the field of Applied Chemistry, published between 1991 and 2000 inclusively, in the ten
first journals ranked by total citations in the 2001 science edition of JCR. The subset
of the environmental file was given a C-value of 0.7557 and a beta-value of 2.5488.
This could then be compared with the values of the set of articles from the field of
Applied Chemistry, which had a C-value of 0,746 and a beta-value of 2.5029.

254 Scientometrics 65 (2005)


B. JARNEVING: Mapping of the research front

Clearly, the evenness or concentration is pretty much the same for both science fields,
suggesting that the field of environmental sciences does not deviate to any greater
extent from other science fields in this aspect.

Partitions and distributions

Preparing data for the computing of couplings, all random citations were first
excluded by deleting all references cited only once, which reduced 160,706 unique
references to 19,138 and N references were 52,677. The clustering of bibliographic
couplings was based on a total of 7,239 source articles. The resulting output of this
operation was 22,621 coupled pairs of source articles. Setting the threshold of
normalized coupling strength to include all pairs within the third quartile a total of
5,655 pairs were included in the further analysis and a total of 3,898 source articles
were bibliographically coupled. Excluding all clusters containing less than three source
articles, resulted in an additional loss of 624 source articles and total of 3,274 source
articles were clustered and distributed over 430 clusters. Finally, only clusters with at
least ten documents were included, which resulted in a final set of 88 clusters (Table 2).
The total number of source articles belonging to the set of selected clusters where 1,691.
Both the arithmetic mean and the median publication year for these articles were 1996.

Table 2. The distribution of clusters of


bibliographically coupled papers over cluster sizes
Cluster size Frequency
70-77 1
60-69 3
50-59 0
40-49 1
30-39 5
20-29 17
10-19 61
0-9 342

Using the first value preceding the third quartile in a list of rank ordered sums of
fractional counts as a citation threshold, reduced 14,353 unique references, to a
remaining 4,785. The normalised value of co-citation strength preceding q3 was used as
the co-citation threshold, and a total of 6,575 pairs were finally selected for clustering.
The resulting cluster structure encompassed a total of 687 clusters of varying sizes
(Table 3). In all, 4,173 references were clustered. Finally, a cluster size threshold of 9
was set which resulted in a final cluster set of 96 co-citation clusters, containing 1,138
cited references.

Scientometrics 65 (2005) 255


B. JARNEVING: Mapping of the research front

Table 3. The distribution of co-citation clusters


over cluster sizes
Cluster size Frequency
20 4
19 2
18 1
17 4
16 5
15 7
14 6
13 9
12 12
11 24
10 22
9 30
8 38
7 64
6 85
5 97
4 107
3 170

Next, the number of papers citing references in the 96 co-citation clusters was 2,411
of which 2,094 were unique. Both the arithmetic mean and the median publication year
for these papers were 1996. Half of all groups of citing publications are within the
interval of 20-39 publications inclusive (Table 4). Comparing the sizes of groups of
papers citing the selected 96 co-citation clusters with the size distribution of
bibliographic coupling clusters, we can appreciate that most bibliographic coupling
clusters are within a lower size interval when regarding the cluster size threshold of nine
papers (Table 2 and Table 4). Of interest is also to estimate to what degree source
articles cite references in more than one cluster. It was found that a vast majority cited
the references of a certain cluster exclusively (Table 5). Such central source
publications would probably better reflect the specific character of the subject content
of clusters, as opposed to peripheral sources citing several different clusters (BRAAM et
al., 1991). Computing the intersection between the set of papers citing co-citation
clusters and the set of papers in clusters of bibliographically coupled papers, 612
articles distributed over 167 clusters were found. Of the 1,691 papers belonging to
clusters of bibliographically coupled papers 36 percent were also citing at least one of
the selected co-citation clusters. Of 2,094 articles citing at least one of the selected co-
citation clusters, 29 percent could be found in clusters of bibliographically coupled
papers.

The papers that were part of at least one group of papers that cite references in at least one co-citation cluster
and part of one cluster with bibliographically coupled papers.

256 Scientometrics 65 (2005)


B. JARNEVING: Mapping of the research front

Table 4. The distribution of groups of papers citing co-citation clusters


over group sizes
Group size # groups
60-68 1
50-59 3
40-49 5
30-39 12
20-29 48
10-19 26
0-9 1

Table 5. The distribution of papers over


specific numbers of cited co-citation clusters
# papers Percent papers # clusters cited
1808 86 1
257 12 2
28 1 3
1 0 4

Result

Computing the word profile similarities between the 88 clusters of bibliographically


coupled papers and the 96 groups of papers that cite a particular co-citation cluster, a
total of 8,420 links were found. The difference between the maximum number of links
and the total number of links computed was 28. The distribution of normalized word
profile values was found to be skewed, with a small proportion of cluster-group pairs
with a strong similarity (Figure 1). Approximately 80 percent of all normalized word
profile values within the range of 0.200.67 inclusive are contained within the upper
four percentiles.
Some assumptions can be made. If the methods should be relatively equal, then the
result of the application of one method could generally be mirrored by the application of
the other method and the results of applications should be two sets of corresponding
groups of citing papers and each such group should be mirrored by a cluster or group of
papers pertaining to the corresponding method and in some cases by more than one
cluster or group when different clusters or groups of papers with similar subject content
reflect the same specialty. Here, two cases resulting in two such sets were examined. In
the first case, each cluster of bibliographically coupled papers was matched with each
group of papers that cite a particular co-citation cluster and was paired with the group to

The maximum number of links is obtained by multiplying the number of co-citation clusters with the
number of bibliographic coupling clusters, including zero similarities.

Scientometrics 65 (2005) 257


B. JARNEVING: Mapping of the research front

which it had its highest value of word profile similarity. In the second case, each group
of papers citing a particular co-citation cluster was matched with each cluster of
bibliographically coupled papers and was paired with the cluster to which it had its
highest value of word profile similarity (Table 6). In both cases all pairs should have a
word profile similarity greater than a stipulated value in order to be accepted.

Figure 1. The word profile similarity between 88 clusters of bibliographically coupled papers and 96 groups
of papers citing a particular co-citation cluster. The curve shows the distribution of normalized word profile
values at percentiles of the total number of pairs and is based on a total of 8,420 cluster-group pairs

By rule of thumb, it was decided that clusters were similar enough within the interval
of 0.30 to the maximum score inclusive (0.67). Applying these criteria, in the first case
36 clusters of bibliographically coupled papers were reflected by at least one group of
papers citing a particular co-citation cluster. This meant that 41 percent of the clusters
of bibliographically coupled papers were mirrored by groups of papers citing a
particular co-citation cluster. The proportion of clusters of bibliographically coupled
papers being paired with a common group of papers citing a particular co-citation
cluster was 53 percent. As for the second case, 32 groups of papers citing a particular
co-citation cluster formed pairs with clusters of bibliographically coupled papers. This
meant that 33 percent of the groups of papers citing a particular co-citation cluster
clusters were mirrored by clusters of bibliographically coupled papers. The proportion
of groups of papers citing a particular co-citation cluster being paired with a common
clusters of bibliographically coupled papers was 28 percent.

258 Scientometrics 65 (2005)


B. JARNEVING: Mapping of the research front

Table 6. The distribution of word profile similarity values over cluster-group pairs. Two cases: the first where
each cluster of bibliographically coupled papers was matched with each group of papers that cite a particular
co-citation cluster and was paired with the group to which it had its highest value of word profile similarity
and the second where each group of papers citing a particular co-citation cluster was matched with each
cluster of bibliographically coupled papers and was paired with the cluster to which it had its highest value of
word profile similarity. A columns hold the cluster index number of clusters of bibliographically coupled
papers, B columns hold the group index number for groups of papers that cite a particular co-citation cluster
and C columns hold the values of word profile similarity values for pairs
First case Second case
A B C A B C
87 110 0.67 87 110 0.67
2 158 0.60 2 158 0.60
4 90 0.58 4 90 0.58
29 126 0.58 29 126 0.58
56 143 0.56 56 143 0.56
78 155 0.55 78 155 0.55
68 104 0.52 68 104 0.52
61 127 0.52 61 127 0.52
33 142 0.50 33 142 0.50
48 100 0.49 48 100 0.49
54 112 0.47 54 112 0.47
14 157 0.47 14 157 0.47
25 176 0.45 25 176 0.45
83 111 0.44 83 111 0.44
75 140 0.43 48 165 0.44
41 92 0.43 75 140 0.43
22 180 0.42 41 92 0.43
47 111 0.42 22 180 0.42
52 109 0.40 41 135 0.40
6 128 0.37 52 109 0.40
27 114 0.36 6 128 0.37
44 180 0.36 27 114 0.36
55 166 0.35 55 166 0.35
46 156 0.35 46 156 0.35
88 127 0.35 16 113 0.33
16 113 0.33 47 93 0.32
72 96 0.32 72 96 0.32
62 112 0.32 41 169 0.32
84 126 0.32 2 152 0.31
57 157 0.32 84 103 0.30
1 126 0.31 74 89 0.30
30 126 0.31 83 107 0.30
80 157 0.30 . . .
58 176 0.30 . . .
74 89 0.30 . . .
20 158 0.30 . . .

Scientometrics 65 (2005) 259


B. JARNEVING: Mapping of the research front

Considering the mathematical properties of the function for the normalization of


word profile similarity, a minimal word profile similarity of 0.50 might be a more
reasonable standard for high cluster similarity, the condition to fulfill being the
mirroring of the subject content of another cluster or group. When increasing the
demands of word profile similarity to 0.50, only nine pairs were formed and on this
level and all pairs were disjoint. An overview of the number of clusters at different
threshold levels of normalized word profile similarity is given in Table 7.

Table 7. The distribution of pairs and clusters/groups over intervals of normalized word profile similarity,
where a pair is made up by one clusters of bibliographically coupled papers and one group of papers citing a
particular co-citation cluster. Only the lower bound of the interval is noted in the table and the upper bound is
implicitely the highest value of the distribution (0.67). Note that groups and clusters in pairs are not pairwise
disjoint below the threshold of 0.49

Thresholds # pairs # clusters/groups


0.59 2 4
0.49 9 18
0.39 21 39
0.29 54 68
0.19 407 141
0.09 4043 183
0.00 8420 184

Summary and discussion

It was assumed that the method of data collection and the partition of data in a
managable set of source articles from a statistical point of view resulted in a final set of
papers representative of the original population of papers. However, the further
partitions of data related to distributions of citation frequency, co-citation strength,
bibliographic coupling strength and cluster size, all being variables in the process of
partitioning large data sets in this context, are accomplished in a more or less heuristic
way, as there usually is lack of a precise definition of variable values that would lead to
the most valid threshold and partition. A good example of this is the setting of a citation
frequency threshold in co-citation analysis where the question is which threshold of
citation frequency that would imply that the signal is clearly discernable. Another
example is the setting of a minimum or maximum cluster size. If we assume that the
goal of clustering is the arrival at some kind of meaningful classification by
summarising data in a small number of groups of objects, a confused pattern of

Cf. (3). The interval is [0, 1] and ti = tj = tij, gives the maximum value.

260 Scientometrics 65 (2005)


B. JARNEVING: Mapping of the research front

numerous small clusters would not contribute to such a goal. Usually maximum size has
been used as a limiting parameter (e.g. SMALL & SWEENEY, 1985), but here a minimim
size of ten objects was applied. The approach chosen in this study regarding threshold
settings was to relate thresholds, if possible, to the features of the original distributions.
It should also be recognized that the choice of cluster method might affect results
considerably.
The chosen method approach of partition and threshold settings resulted in two sets
of citing papers, which in turn were partitioned in sub-sets (groups or clusters). This
final partitioning resulted in approximately the same number of sub-sets (groups or
clusters) for both sets and in the same average publication year. An important finding
was that 86 percent of all papers citing co-citation clusters cited a particular cluster
exclusively. Hence, in general the co-citation cluster structure seemed to distingiush
research themes from each other as reflected by citing papers. When we look at the
intersection between the set of papers citing co-citation clusters and the set of papers in
clusters of bibliographically coupled papers, 36 percent of the set of papers in clusters
of bibliographically coupled papers overlap with the set of papers citing co-citation
clusters, and inversely, 29 percent of the papers citing co-citation clusters could also be
found in the set of papers in clusters of bibliographically coupled papers. However, the
low degree of overlap deos not solely imply that we deal with two separate research
front landscapes as two papers need not be separated as to subject content on grounds of
not being identical.
At the next stage of the empirical process, when clusters of bibliographically
coupled papers were compared with groups of papers citing co-citation clusters as to the
subject similarity, only a small proportion of all cluster-group pairs had a strong word
profile similarity in relation to the the whole distribution of word profile similarity
values, and less than four percent where in the interval of 0.20-0.67 (highest value). On
a more detailed level, within the interval of 0.30-0.67, only 41 percent of all clusters of
bibliographically coupled papers could be mirrored by a group of papers citing a
particular co-citation cluster and only 33 percent of the groups of papers citing a
particular co-citation cluster could be mirrored by a cluster of bibliographically coupled
papers. When raising the threshold of word profile similarity to include only pairs with
a similarity of at least 0.50, the number of clusters/groups were reduced to 18, forming
nine pairs. Trying to reach a conclusion weather the application of these compared
methods of research front mapping in general would lead to similar results or not, this is
to some extent obstructed by the obvious problem of defining at which level two objects
could be considered as similar. Even though results in this study suggest that we in
fact deal with to different research front landscapes, the nature of differences between
these landscapes must be described on a detailed level before any final conclusion can
be made. Moreover, values of several variables could vary which may obstruct the
interpretation of results. Though this probably is a general problem for mapping

Scientometrics 65 (2005) 261


B. JARNEVING: Mapping of the research front

studies, the situation is complicated by the fact that we deal with two different types of
data, co-citations and bibliographic couplings. A reviewer of this paper commented on
this issue, suggesting that applications of methods that might lead to more controlled
experiments should be elaborated. Such an application could be to use a fixed set of
cited references for both methods. For instance, we could use the set of cited references
that link bibliographically coupled papers (with a citation frequency threshold of one) or
use a set of cited references delimited by some citation threshold appropriate for co-
citation analysis. In both cases we would avoid any other thresholds. However, such
approaches are probably more suited to smaller scale studies (smaller research fields)
where a certain amount of noise could be tolerated. It should also be noted that a fixed
set may imply deviations from optimal method applications in terms of loss of
significant associations between citing papers or the adding of insignificant associations
between cited references.
Conclusively, the results in this study would support a further comparative study of
these methods on a detailed level and on a more qualitative ground.

References

AHLGREN, P., JARNEVING, B., ROUSSEAU, R. (2003), Requirements for a co-citation similarity measure, with
special reference to Pearsons correlation coefficient. Journal of the American Society for Information
Science & Technology, 54 (6) : 550560.
BRAAM, R. R., MOED, H. F., VAN RAAN, A. J. F. (1988), Mapping of Science: Critical Elaboration and New
Approaches: A Case Study in Agricultural Biochemistry. In: Informetrics 87/88. EGGHE, L., ROUSSEAU,
R. (Eds), Amsterdam: Elsevier Science Publishers. Also published in a more extended version:
BRAAM, R. R., MOED, H. F., VAN RAAN, A. J. F. (1987), Mapping of Science: Critical Elaboration and
New Approaches: A Case Study in Agricultural Biochemistry. Research report to the Netherlands
Advisory Council for Science Policy (RAWB), Leiden.
BRAAM, R. R., M OED, H. F., VAN RAAN, A. F. J. (1991), Mapping science by combined co-citation and word
analysis 1: structural aspects. Journal of the American Society for Information Science, 42 (4) : 233251.
EVERITT, B. S., LANDAU, S., LEESE, M. (2001), Cluster Analysis. Fourth edition. London: Arnold.
FRANKLIN, J. J., JOHNSTON, R. (1988), Co-citation bibliometric modelling for S&T and R&D management.
In: Handbook of Quantitative Studies of Science and Technology. A. F. J.VAN RAAN (Ed.), Amsterdam:
North Holland.
GARFIELD, E., MORTON, V., MALIN, V. (1975), A system for automatic classification of scientific literature.
In: Essays of an Information Scientists.Vol: 2, pp. 356365, 197476. Reprinted from Journal of the
Indian Institute of Science, 57 (2) : 6174.
GARFIELD, E., MALIN, M. V., SMALL, H. (1978), Citation data as science indicators. In: Towards a Metric of
Science. The Advent of Science Indicators. Y. ELKANA, J. LEDERBERG, R. MERTON, A. THACKRAY,
H. ZUCKERMAN (Eds), New York: John Wiley & Sons.
JARNEVING, B. (2001), The cognitive structure of current cardiovascular research, Scientometrics, 50
(3) : 365389.

To illustrate this problem, just consider a few important variables: (1) method of clustering, (2) measure of
similarity and (3) threshold of coupling strength and let us allow for three values for each variable (which one
in a sense could consider reasonable). Then we would end up with 27 different mappings.

262 Scientometrics 65 (2005)


B. JARNEVING: Mapping of the research front

KESSLER, M. M. (1960), An Experimental Communication Center for Scientific and Technical Information.
Massachusetts Institute for Technology, Lincoln Laboratory.
KESSLER, M. M. (1961), An Experimental Study of Bibliographic Coupling between Technical Papers.
Massachusetts Institute for Technology, Lincoln Laboratory.
KESSLER, M. M. (1963), Bibliographic coupling between scientific papers. American Documentation, 14 (1) :
1025.
LEYDESDORFF, L. (1987), Various methods for the mapping of science. Scientometrics, 11 : 295324.
MANLEY, B. .F. J. (1994), Multivariate Statistical Methods: A Primer. Second edition. London: Chapman &
Hall.
MCCAIN, K. W. (1990), Mapping authors in intellectual space: a technical overview. Journal of the American
Society for Information Science, 41 (6) : 433443.
NOYONS, E. C. M. (1999), Bibliometric Mapping as a Science Policy and Research Management Tool.
Leiden: DSWO Press.
PERSSON, O. (1994), The intellectual base and research front of JASIS 1986-1990. Journal of the American
Society for Information Science, 45 (1) : 3138.
PETERS, H. P. F., BRAAM, R. R., VAN RAAN, A. F. J. (1995), Cognitive resemblance and citation relations in
chemical engineering publications. Journal of the American Society for Information Science, 46
(1) : 921.
PRICE, D. J. DE SOLLA (1965), Networks of scientific papers. Science, 149 : 510-515.
ROUSSEAU, B., ROUSSEAU, R. (2000), LOTKA: A program to fit a power law distribution to observed
frequency data. Cybermetrics, 4 (1) : paper 4. Available:
http://www.cindoc.csic.es/cybermetrics/articles/v4i1p4.html
SMALL, H. (1973), Co-citation in the scientific literature: a new measure of the relationship between two
documents. Journal of the American Society for Information Science, 24 (July-August) : 265269.
SMALL, H., GRIFFITH, B. (1974), The structure of scientific literatures, I: identifying and graphing specialities.
Science Studies, 4 : 1740.
SMALL, H., SWEENEY, E. (1985), Clustering the Science Citation Index using cocitations, I. A comparison of
methods. Scientometrics, 7 (3-6) : 391409.
SHARABCHIEV, Y. T. (1988), Comparative analysis of two methods of cluster analysis of bibliographic
references. (In Russian). Nauchno-Tekhnicheskaya Informatsiya, 2 : 2528.

Scientometrics 65 (2005) 263

S-ar putea să vă placă și