Sunteți pe pagina 1din 376

Chaomei Chen

Mapping Scientific
Frontiers
The Quest for Knowledge Visualization
Second Edition
Mapping Scientific Frontiers
Chaomei Chen

Mapping Scientific Frontiers


The Quest for Knowledge Visualization

Second Edition

123
Chaomei Chen
College of Information Science and Technology
Drexel University
Philadelphia, Pennsylvania
USA

ISBN 978-1-4471-5127-2 ISBN 978-1-4471-5128-9 (eBook)


DOI 10.1007/978-1-4471-5128-9
Springer London Heidelberg New York Dordrecht

Library of Congress Control Number: 2013944066

© Springer-Verlag London 2003, 2013


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’s location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


Foreword

Mapping science at first glance appears to be an oxymoron: how can we map


something as abstract as science? Scientific knowledge seems to occupy an intellec-
tual realm which can only be glimpsed by the mind of the highly trained specialist.
Yet this book demonstrates that the discipline of science mapping has been going
on for a number of years and is indeed beginning to flourish with new results and
insights. In this endeavor we may only be in the stage of the early explorers who
drew out the first crude geographic maps of the then known physical world. We
might argue that science mapping is simply the logical progression of map making
from the physical to the intellectual world.
It is not necessary to make the case for the importance of science in the modern
world and the object of our map making efforts, even though our society has
at times minimized the role of scientific research, neglected to provide adequate
funding and support, and attempted to retard its educational programs. What
more important artifact of human intelligence could we focus on than the current
scientific landscape? Of course science can be used for good or ill, and science
confers incredible power on those who master its principles. On occasion individual
scientists will abuse the trust we place in them in their quest for recognition, as
Chaomei Chen documents here in his study of “retractions” of scientific articles.
But despite these aberrations, science is the gateway to understanding of our place
in the universe, and the foundation of our social and economic well-being.
Despite the fact that the language we use to describe science is replete with
spatial metaphors such as “field” and “area” of research, when we actually go about
trying to create a map of science, we soon realize that the procedures used to make
geographic maps no longer apply. We must deal with the abstract relations and
associations among entities such as scientific ideas, specialties, fields or disciplines
whose very existence may be open to debate. Thomas Kuhn described these shifting
historical disciplines in a revealing interview: “Look : : : you must not use later
titles for [earlier] fields. And it’s not only the ideas that change, it’s the structure of
the disciplines working on them.” (Kuhn 2000, p. 290) Here we realize that science
in any historical period, and indeed the current period, is a terra incognita. Are we

v
vi Foreword

justified in seeking a spatial representation of such abstract and perhaps hypothetical


entities? Are our brains hardwired to take what is relational and project it in real
space?
Perhaps science mapping is difficult to grasp because three conceptual steps are
required before it makes sense, two of which involve some degree of mathematical
manipulation. First a unit of analysis must be chosen which should comprise the
elementary particles of our science universe. Secondly a measure of association
between the units must be defined. Thirdly, a means must be found for depicting
the units and their relations in a low dimensional space (usually two dimensions)
that can be perceived by humans. Once these intellectual leaps are made science
mapping seems natural and even inevitable.
The established scholarly disciplines of the history of science, sociology of
science and philosophy of science have long been regarded as providing the essential
tools for understanding the origin and evolution of scientific thought, its social
institutions, and its philosophical underpinnings. With the exception of some early
work in the sociology of science, the methods used by these disciplines have been
largely qualitative in nature. History of science has dealt largely with constructing
narratives using the methods of general history, and philosophy of science with
logical foundation and epistemology. Sociology of science, as exemplified by one
of its founders Robert Merton, was both strongly theoretical and open to the
use of quantitative evidence. This approach was also taken up by early social
network researchers who studied so-called invisible colleges. In the 1970s, however,
sociology of science turned away from quantitative methods to embrace more radi-
cal social theories of science, such as the social construction of scientific knowledge,
and largely abandoned the quantitative approach of the earlier sociologists.
More recently, primarily as the result of the availability of large databases of
scientific publications and citations and partly a reaction against constructivist
sociology, a discipline emerged which has opened up a new way to study the evo-
lution of science. This field has been called variously scientometrics, informetrics
and bibliometrics. These terms reflect not only the focus on quantitative methods
upon which it is built, but also its origins in what was once called library science.
It cannot be claimed that the upstart discipline has achieved acceptance in the
academic world, particularly on the part of the more established disciplines, and
institutionalization in the form of university programs and academic positions has
only just begun. Critics of scientometrics have claimed that a focus on the scientific
literature as its primary source material too severely limits the data on which studies
of science can be based. On the other hand, the increasing availability of the full text
of scientific papers in computer readable formats, opens up many new types of data
for analyses which when used in tandem with the standard on-line databases goes
far beyond what has been possible using the standard indexes alone. Combined with
software packages such as Chaomei Chen has pioneered, a powerful new tool for
the analysis of science has come into being. There is every indication that the new
field is here to stay and is exerting more and more influence on policy, even though
a rapprochement and integration with the traditional fields of history, sociology and
philosophy is probably a long way off.
Foreword vii

Chaomei Chen’s book is important because it builds on many of the concepts and
findings of history, sociology and philosophy of science, but at the same time adds a
new dimension. As an example of the power of the new methods the skeptical reader
should consult chapter eight presenting a case study on recent work on induced
pluripotent stem-cells which shows how mapping can inform historical studies as
well as assist medical researchers to get an overview of a research area. Here we see
the strength of the new methods for exploring and tracking the internal structure of
revolutionary developments in contemporary science.
His book also draws on an even broader disciplinary framework from computer
to information science and particularly information visualization. In the first edition
Chaomei Chen commented on the disciplines that contribute ideas to science
mapping. “Different approaches to mapping scientific frontiers over recent years
are like streams running from several different sources : : : . A lot of work needs
to be done to cultivate knowledge visualization as a unifying subject matter that
can join several disciplines.” (Chen 2003, p. vii) This remains true even today
when scientometrics, computer science, and network science continue to evolve in a
strangely independent manner yet often dealing with the same underlying data and
issues. This may be an inevitable side effect of the barriers between disciplines, but
hopefully this book will help bridge these various streams.
As an example of the relevance of history of science, Chaomei Chen comments
that the work of Thomas Kuhn was an important backdrop to mapping because
one could think of the unfolding of a revolution in science as a series of cross-
sectional maps that at some point undergoes a radical structural transformation.
Cross sectional thinking is also very much encouraged in the history of science
because historians are exhorted to understand the ideas of a historical period by
entering its mind-set, “to think as they did” (Kuhn 1977, p. 110), and not interpret
older science in terms of our “current” understanding. This is a difficult requirement
because once we know that a new discovery or finding has occurred it is extremely
difficult for us not to be influenced by it, and our first impulse is to find precursors
and antecedents. As analysts we need to take care not to allow the present to distort
the past.
As an example of how various cross-currents converge in science mapping we
could point out the tension between psychological factors, as exemplified by Kuhn’s
gestalt switching as a way of looking at conceptual change, and social forces such
as collegial networks and invisible colleges. Do social relations determine cognitive
relations, or vice versa? In Stanley Milgram’s early work (1967) on social networks,
human subjects were required to think about what acquaintances their acquaintances
had several steps removed. In Don Swanson’s work (1987) on undiscovered public
knowledge, discoveries are made by seeking concepts that are indirectly related
through other concepts that are currently unconnected. Thus the same type of
thinking is involved in both the social and intellectual tasks. If we are dealing with
words or references as our mapping units, then psychology clearly enters the picture
because an author’s memory and recall are involved in the associative process.
But that memory and recall are also influenced by what authors have seen other
authors or colleagues say. If we map individual scientists in their co-author relations,
viii Foreword

then social factors must come into play but psychological factors also contribute to
the selection of co-authors. Thus social and psychological factors are inexorably
intertwined in both the social and intellectual structure of science.
The competition in science mapping between the various choices for unit
of analysis such as words, references, authors, journals, etc. and the means of
associating them such as co-word, co-citation, co-authorship, direct citation, etc.
seems to boil down to the types of structures and level of relations we want to
observe. To better understand the role of discovery in specialty development we
might turn to co-citations because many discoveries are associated with specific
papers and authors. On the other hand, if we want to include broader societal, non-
scholarly factors then we might turn to co-words which can more readily capture
public or political sentiments external to science. Journals, a yet broader unit of
analysis, might best represent whole fields or disciplines. Choice of a unit of analysis
also depends on the historical period under investigation. Document co-citation
is probably not feasible prior to 1900 due to the absence of standard referencing
practice. However, name co-mention within the texts of scientific papers and books
is still very feasible for earlier periods. It is instructive to try to imagine how we
would carry out a co-mention or other kind of mapping for some earlier era, say
for scientific literature in the eighteenth century and whether we would be able to
identify the schools of thought and rival paradigms active during the period.
Another important issue is the interpretation of maps. We know that the network
of associations that underlies maps is hyper dimensional, and that projection in two
dimensions is inevitably an approximation and can place weakly related units close
together. This argues for the need to pay close attention to the links themselves
which give rise to the two dimensional solution in the first place, which we can
think of as the neurons of the World Brain (Garfield 1968) we are trying to visualize.
Only by knowing what the links signify can we gain a better understanding of what
the maps represent. This will involve looking more deeply at the context in which
the linking takes place, and seeking new ways of representing and categorizing
those relationships, for example, by function or type such as logical, causal, social,
hypothetical, metaphorical, etc. One positive development in this direction, as
described in the final chapter, is the advent of systems for “visual analytics” that
allow us to more deeply probe the underpinnings of maps with the ultimate goal of
supporting decision making.
Part of what is exciting about science mapping is that the landscape is continually
changing: every year there is a new crop of papers and the structure changes as new
areas emerge and existing areas evolve or die off. Some will find such a picture
unsettling and would prefer to see science as a stable and predictable enterprise, but
as Merton has argued (2004), serendipity is endemic to science, and thus also to
science maps. We do not yet know if discovery is in any way predictable, if there
are recognizable antecedents or conditions, or whether discovery or creativity can be
engineered to happen at a quicker pace. But because discoveries are readily apparent
in maps after they occur, we also have the possibility of studying maps for previous
time periods to look for structural antecedents.
Foreword ix

This is a wide ranging book on information visualization, with a specific focus


on science mapping. Science mapping is still in its infancy and many intellectual
challenges remain to be investigated and many of which are outlined in the final
chapter. In this new edition Chaomei Chen has provided an essential text, useful
both as a primer for new entrants and as a comprehensive overview of recent
developments for the seasoned practitioner.

SciTech Strategies, Inc. Henry Small

References

Chen C (2003) Mapping scientific frontiers. Springer, London


Garfield E (1968) “World Brain” or Memex? Mechanical and intellectual requirements for
universal bibliographic control. In: Montgomery EB (ed) The foundations of access to
knowledge. Syracuse University Press, Syracuse, pp 169–196, from http://garfield.library.
upenn.edu/essays/v6p540y1983.pdf
Kuhn TS (1977) The essential tension. University of Chicago Press, Chicago
Kuhn TS (2000) The road since structure. University of Chicago Press, Chicago
Merton RK, Barber E (2004) The travels and adventures of serendipity. Princeton University Press,
Princeton
Milgram S (1967) The small world problem. Psychol Today 2:60–7
Swanson DR (1987) Two medical literatures that are logically but not bibliographically connected.
J Am Soc Info Sci 38:228–233
Preface for the 2nd Edition

The first edition of Mapping Scientific Frontiers (MSF) was published over 10 years
ago in 2002. Since then, a lot has changed. Social media has flourished to the extent
that we have never seen before. News, debates, hoaxes, and scholarly blogs all fight
for attention on Facebook (launched in 2004), YouTube (2005), and Twitter (2006),
which are made ubiquitously accessible by popular mobile devices such as iPhone
(2007) and iPad (2010).
Over the past 10 years, remarkable scientific breakthroughs have been made,
for example, Grigori Perelman’s proof of the century-old Poincaré Conjecture in
2002, the Nobel Prize winning research on induced pluripotent stem cells (iPSCs)
by Shinya Yamanaka and his colleagues since 2006, and the recent discovery of the
Higgs Boson in 2012 at CERN.
The big sciences continue to get bigger. Large-scale data collection efforts for
scientific research such as the Sloan Digital Sky Survey (SDSS) (2000–2014) in
astronomy represent one of many sources of big data. As old scientific fields
transform themselves, new ones emerge. Visual analytics entered our horizon in
2005 as a new field and has played a critical role ever since in advancing the science
and technology for solving practical issues, especially when we deal situations
that are full of complex, uncertain, incomplete, and potentially conflicting data.
A representative case is concerned with maintaining the integrity of scientific
literature itself. The increasing number of publications has overshadowed the
increase of retractions. What can be done to maintain a trustworthy body of scientific
knowledge?
What is the role that Mapping Scientific Frontiers has played? According to
Google Scholar, it has been cited by 235 scientific publications on the web. These
publications are in turn cited by an even broader range of articles. These articles
allow us to take a glimpse on the context in which research in science mapping has
been evolving. Interestingly, the citation profile appears to show two stages. The
first one ranges from 2002 to 2008 and the second one from 2009 to the present
(Fig. 1). Citations in the first stage peaked in 2007, whereas citations in the second
stage were evenly distributed for the first 3 years. A study of citation data in the Web
of Science revealed a similar pattern.

xi
xii Preface for the 2nd Edition

Fig. 1 The citation profile of Mapping Scientific Frontiers (Source: Google Scholar)

What is the citation pattern telling us? The nature of the set of articles that cited
Mapping Scientific Frontiers as a whole can be analyzed in terms of how they are
in turn cited by subsequently published articles. In particular, we turn to articles
that have strong citation bursts, or abruptly increased citation rates, during the time
span of 2002–2013. Figure 2 shows 25 articles of this type. Articles in the first
stage shared a unique focus on information visualization and citation analysis. The
original motivation of Mapping Scientific Frontiers was indeed to bridge together
the two fields across the boundaries of different disciplines.
The second stage is predominated by a series of publications dedicated to global
science maps at disciplinary levels as opposed to the document level in the first
stage. The most influential work in the second stage in terms of citation burst is
a 2009 Scientometrics article by Alan L. Porter and Ismael Rafolsonon on the
interdisciplinarity of science. The second highest citation burst is attributed to a
2010 article published in the Journal of American Society for Information Science
and Technology by Ismael Rafols, Alan L. Porter, and Loet Leydesdorff on science
overlay maps. We are still in the second stage. In terms of the scale and the unit
of analysis, the study of interdisciplinary interactions is a profound and potentially
fruitful way to better understand the dynamics of scientific frontiers.
In addition to the conceptual and theoretical development, researchers today
have a much wider range of choice than before in terms of computational tools
for analyzing, visualizing, and exploring patterns and trends in scientific liter-
ature. Notable examples include CiteSpace, HistCite, VOSViewer, and Sci2 for
scientometric studies and science mapping; GeoTime, Jigsaw, and Tableau for
visual analytics; and Gephi, Alluvial Maps, D3, and WebGL for more generic
information visualization. Today, a critical mass is taking its shape and gathering
its strengths as visual analytic tools, data sources, and exemplars of in-depth and
longitudinal studies become increasingly accessible and inter-operable. Mapping
Scientific Frontiers has reached a new level with a broad range of unprecedented
opportunities to impact scientific activity across so many disciplines.
The second edition of Mapping Scientific Frontiers brings you some of the most
profound discoveries and advances in the study of scientific knowledge and the
dynamics of its evolution. Some of the new additions are highlighted as follows:
Preface for the 2nd Edition xiii

Fig. 2 A citation analysis of Mapping Scientific Frontiers reveals two stages of relevant research.
Red bars indicate intervals of citation burst

• The Sloan Digital Sky Survey (SDSS) is featured in Chap. 2 in the context of
how a map of the Universe may reveal.
• In Chap. 3, a series of new examples of visualizing a thematic evolution over time
are illustrated, including the widely known ThemeRiver, the elegant TextFlow,
and the versatile Alluvial Maps.
• Chapter 8 is a new chapter. It introduces the framework of a predictive analysis
and demonstrates how it can apply to a fast-advancing field such as regenerative
medicine, which highlights the work that was awarded the 2012 Nobel Prize
in medicine on induced pluripotent stem cells (iPSCs). Chapter 8 also addresses
practical implications of the retraction of a scientific publication. The second half
of Chap. 8 is devoted to the design, construction, and analysis of global science
maps, including our own new design of dual-map overlays.
xiv Preface for the 2nd Edition

• Chapter 9 is also a new chapter. It outlines some of the most representative


visual analytic tools such as GeoTime and Jigsaw. It also describes major analytic
features of CiteSpace.
The first edition concludes with ten challenges ahead. It is valuable to revisit
these challenges identified over 10 years ago and see what have changed and what
have newly emerged.
The second edition finishes with a set of new challenges and milestones ahead
for mapping scientific frontiers.

15 April 2013 Chaomei Chen


Villanova
Pennsylvania, USA
Acknowledgements

The second edition in part reflects the result of a continuous research effort
that I have been engaged in since the publication of the first edition. I’d like
to acknowledge the support and contributions of my colleagues, students, and
collaborators in various joint projects and publications, in particular, including my
current and former students Timothy Schultz (Drexel University, USA), Jian Zhang
(IBM Shanghai, China), and Donald A. Pellegrino (The Dow Chemical Company,
USA), collaborators such as Pak Chung Wong (PNNL, USA), Michael S. Vogeley
(Drexel University, USA), Alan MacEachren (Penn State University, USA), Jared
Milbank (Pfizer, USA), Loet Leydesdorff (The Netherlands), Richard Klavans,
Kevin Boyack, and Henry Small (SciTech Strategies, USA), and Hong Tseng (NIH,
USA).
As a Chang Jiang Scholar, I have the opportunity to work with the WISELab at
Dalian University of Technology, China, since 2008. I’d like to acknowledge the
collaboration with Zeyuan Liu, Yue Chen, Zhigang Hu, and Shengbo Liu. Yue Chen
is currently leading an ambitious effort to translate the second edition to Chinese.
I particularly appreciate the opportunity to work with Rod Miller, Chief Strategy
Officer at the iSchool of Drexel, and Paul Dougherty, Licensing Manager at the
Office of Technology Commercialization of Drexel University, over numerous
fruitful discussions of various research topics.
I’d like to acknowledge the support of sponsored research and grants from the
National Science Foundation (IIS-0612129, NSFDACS-10P1303, IIP 1160960),
Department of Homeland Security, Pfizer, and IMS Health. I’d also like to take the
opportunity to express my gratitude and appreciation to the hosts of my talks and
keynote speeches, including Michael Dietrich, History of Biology, Woods Hole,
MA; Stephanie Shipp, International Defense Agency (IDA), Washington D.C.;
Paula Fearon, NIH, Bethesda, MD; David Chavalarias, Mining the Digital Traces
of Science (MDTS), Paris, France; and Josiane Mothe, Institut de Recherche en
Informatique de Toulouse, France.

xv
xvi Acknowledgements

Special thanks to Beverley Ford, Editorial Director Computer Science, Springer


London Ltd, for her initiative and encouragement for getting the second edition on
my agenda, and to Ben Bishop, Senior Editorial Assistant, Springer, for his clear and
professional guidance that ensures a smooth and enjoyable process of preparation.
As always, my family, Baohuan, Calvin, and Steven, gives me all the strengths,
the courage, and the inspirations.
Contents

1 The Dynamics of Scientific Knowledge . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1


1.1 Scientific Frontiers .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2
1.1.1 Competing Paradigms . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6
1.1.2 Invisible Colleges . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 10
1.1.3 Conceptual Revolutions . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 11
1.1.4 TRACES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 16
1.2 Visual Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 20
1.2.1 Gestalt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 20
1.2.2 Famous Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 22
1.2.3 The Tower of Babel .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 23
1.2.4 Messages to the Deep Space . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 25
1.2.5 “Ceci n’est pas une Pipe”. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 30
1.2.6 Gestalt Psychology . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 34
1.2.7 Information Visualization and Visual Analytics.. . . . . . . . . . . . . . 35
1.3 Mapping Scientific Frontiers . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 39
1.3.1 Science Mapping.. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 40
1.3.2 Cases of Competing Paradigms . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 41
1.4 The Organization of the Book .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 43
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 45
2 Mapping the Universe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 47
2.1 Cartography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 47
2.1.1 Thematic Maps.. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 52
2.1.2 Relief Maps and Photographic Cartography.. . . . . . . . . . . . . . . . . . 53
2.2 Terrestrial Maps .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 54
2.3 Celestial Maps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 56
2.3.1 The Celestial Sphere Model .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 58
2.3.2 Constellations .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 63
2.3.3 Mapping the Universe . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 66
2.4 Biological Maps .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 77
2.4.1 DNA Double Helix . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 77

xvii
xviii Contents

2.4.2 Acupuncture Maps .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 79


2.4.3 Genomic Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 81
2.4.4 A Map of Influenza Virus Protein Sequences . . . . . . . . . . . . . . . . . 82
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 84
3 Mapping Associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 85
3.1 The Role of Association . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 85
3.1.1 As We May Think . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 86
3.1.2 The Origin of Cognitive Maps . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 87
3.1.3 Information Visualization . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 91
3.2 Identifying Structures .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 91
3.2.1 Topic Models.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 91
3.2.2 Pathfinder Network Scaling . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 93
3.2.3 Measuring the Similarity Between Images . . . . . . . . . . . . . . . . . . . . 95
3.2.4 Visualizing Abstract Structures . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 101
3.2.5 Visualizing Trends and Patterns of Evolution . . . . . . . . . . . . . . . . . 107
3.3 Dimensionality Reduction .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 111
3.3.1 Geometry of Similarity . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 113
3.3.2 Multidimensional Scaling . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 114
3.3.3 INDSCAL Analysis . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 119
3.3.4 Linear Approximation – Isomap . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 121
3.3.5 Locally Linear Embedding . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 124
3.4 Concept Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 127
3.4.1 Card Sorting.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 127
3.4.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 128
3.5 Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 131
3.5.1 Small-World Networks . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 131
3.5.2 The Erdös-Renyi Theory . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 133
3.5.3 Erdös Numbers.. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 134
3.5.4 Semantic Networks . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 135
3.5.5 Network Visualization .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 136
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 138
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 139
4 Trajectories of Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 143
4.1 Footprints in Information Space .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 143
4.1.1 Traveling Salesman Problem .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 144
4.1.2 Searching in Virtual Worlds . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 146
4.1.3 Information Foraging .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 148
4.1.4 Modeling a Foraging Process . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 149
4.1.5 Trajectories of Users . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 154
4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 160
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 161
5 The Structure and Dynamics of Scientific Knowledge .. . . . . . . . . . . . . . . . . . 163
5.1 Matthew Effect .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 164
Contents xix

5.2 Maps of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 167


5.2.1 Co-Word Maps .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 167
5.2.2 Inclusion Index and Inclusion Maps . . . . . . .. . . . . . . . . . . . . . . . . . . . 168
5.2.3 The Ontogeny of RISC . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 170
5.3 Co-Citation Analysis .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 172
5.3.1 Document Co-Citation Analysis . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 172
5.3.2 Author Co-Citation Analysis . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 180
5.4 HistCite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 190
5.5 Patent Co-Citations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 193
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 195
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 197
6 Tracing Competing Paradigms . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 201
6.1 Domain Analysis in Information Science . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 201
6.2 A Longitudinal Study of Collagen Research . . . . . .. . . . . . . . . . . . . . . . . . . . 203
6.3 The Mass Extinction Debates . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 206
6.3.1 The KT Boundary Event . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 206
6.3.2 Mass Extinctions .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 209
6.4 Supermassive Black Holes. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 218
6.4.1 The Active Galactic Nuclei Paradigm . . . . .. . . . . . . . . . . . . . . . . . . . 218
6.4.2 The Development of the AGN Paradigm . .. . . . . . . . . . . . . . . . . . . . 219
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 224
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 225
7 Tracking Latent Domain Knowledge . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 227
7.1 Mainstream and Latent Streams . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 228
7.2 Knowledge Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 229
7.2.1 Undiscovered Public Knowledge . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 230
7.2.2 Visualizing Latent Domain Knowledge . . .. . . . . . . . . . . . . . . . . . . . 234
7.3 Swanson’s Impact .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 239
7.4 Pathfinder Networks’ Impact . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 240
7.4.1 Mainstream Domain Knowledge.. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 241
7.4.2 Latent Domain Knowledge .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 242
7.5 BSE and vCJD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 248
7.5.1 Mainstream Domain Knowledge.. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 248
7.5.2 The Manganese-Copper Hypothesis . . . . . . .. . . . . . . . . . . . . . . . . . . . 254
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 255
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 256
8 Mapping Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 259
8.1 System Perturbation and Structural Variation . . . . .. . . . . . . . . . . . . . . . . . . . 259
8.1.1 Early Signs .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 260
8.1.2 A Structural Variation Model . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 262
8.1.3 Structural Variation Metrics .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 265
8.1.4 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 269
8.1.5 Complex Network Analysis (1996–2004) .. . . . . . . . . . . . . . . . . . . . 271
xx Contents

8.2 Regenerative Medicine.. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 274


8.2.1 A Scientometric Review .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 275
8.2.2 The Structure and Dynamics . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 277
8.2.3 System-Level Indicators .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 281
8.2.4 Emerging Trends .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 286
8.2.5 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 288
8.3 Retraction.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 290
8.3.1 Studies of Retraction . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 293
8.3.2 Time to Retraction . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 296
8.3.3 Retracted Articles in Context . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 297
8.3.4 Autism and Vaccine.. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 301
8.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 304
8.4 Global Science Maps and Overlays . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 304
8.4.1 Mapping Scientific Disciplines . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 305
8.4.2 Interdisciplinarity and Interactive Overlays . . . . . . . . . . . . . . . . . . . 308
8.4.3 Dual-Map Overlays .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 312
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 316
9 Visual Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 321
9.1 CiteSpace .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 321
9.2 Jigsaw .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 325
9.3 Carrot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 329
9.4 Power Grid Analysis .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 329
9.5 Action Science Explorer (iOpener) . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 331
9.6 Revisit the Ten Challenges Identified in 2002 . . . . .. . . . . . . . . . . . . . . . . . . . 332
9.7 The Future .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 338
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 339

Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 341
Abbreviations

2dF Two-degree field spectrograph


AAG The Association of American Geographers
ACA Author co-citation analysis
AGN Active galactic nuclei
ANT Actor Network Theory
ASE Action Science Explorer
ASIS&T American Society for Information Science and Technology
BSE Bovine Spongiform Encephalopathy
CAS Complex adaptive system
CBIR Content-based image retrieval
CfA Harvard-Smithsonian Center for Astrophysics
CJD Creutzfeldt-Jakob Disease
CKL Centrality divergence
CL Cluster linkage
DCA Document co-citation analysis
DMO Dual-map overlay
DNA Deoxyribonucleic acid
DoD Department of Defense
EOBT Expert Opinion on Biological Therapy
GCS Global Citation Score
GSA Generalized Similarity Analysis
HDF Hubble Deep Field
HMM Hidden Markov Model
HUDF Hubble Ultra Deep Field
iPSC Induced pluripotent stem cell
KT Boundary Cretaceous-Tertiary Boundary
LCS Local Citation Score
LGL Large Graph Layout
LLE Locally linear embedding
LSI Latent Semantic Indexing
MCR Modularity change rate

xxi
xxii Abbreviations

MDS Multidimensional scaling


MST Minimum spanning tree
NB Negative binomial
NVAC National Visualization and Analytics Center
PCA Principle component analysis
PFNET Pathfinder network
PNNL Pacific Northwest National Laboratory
PTSD Post-traumatic stress disorder
SCI Science Citation Index
SDSS Sloan Digital Sky Survey
SOM Self-organized maps
SSCI Social Science Citation Index
SVD Singular value decomposition
TRACES Technology in Retrospect and Critical Events in Science
TREC The Text Retrieval Conference
TSP Traveling salesman problem
USPTO The United States Patent and Trademark Office
WECC The Western Power Grid
WoS The Web of Science
XDF eXtreme Deep Field
ZINB Zero-inflated negative binomial
List of Figures

Fig. 1 The citation profile of Mapping Scientific Frontiers


(Source: Google Scholar) .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . xii
Fig. 2 A citation analysis of Mapping Scientific Frontiers
reveals two stages of relevant research. Red bars
indicate intervals of citation burst . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . xiii
Fig. 1.1 Conceptual change: a new conceptual system #2 is
replacing an old one #1 . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 12
Fig. 1.2 Computer-generated “best fit” of the continents. There
are several versions of this type of fit maps credited to
the British geophysicists E.C. Bullard, J.E. Everett,
and A.G. Smith .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 14
Fig. 1.3 Wegener’s conceptual system (top) and the
contemporary one (bottom) .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 15
Fig. 1.4 The conceptual structure of Wegener’s continental drift theory . . . 16
Fig. 1.5 The conceptual structure of Wegener’s opponents . . . . . . . . . . . . . . . . . 17
Fig. 1.6 Pathways to the invention of the video tape recorder
(© Illinois Institute of Technology) . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 19
Fig. 1.7 Alexander Fleming’s penicillin mould, 1935 (©
Science Museum, London) . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 21
Fig. 1.8 Minard’s map (Courtesy of http://www.napoleonic-
literature.com) .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 22
Fig. 1.9 Map of Cholera deaths and locations of water pumps
(Courtesy of National Geographic).. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 24
Fig. 1.10 The Tower of Babel (1563) by Pieter Bruegel.
Kunsthistorisches Museum Wien, Vienna. (Copyright
free, image is in the public domain) .. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 25
Fig. 1.11 The Tower of Babel by Maurits Escher (1928) .. . . . . . . . . . . . . . . . . . . . 26
Fig. 1.12 The gold-plated aluminum plaque on Pioneer
spacecraft, showing the figures of a man and a woman
to scale next to a line silhouette of the spacecraft . . . . . . . . . . . . . . . . . . 27

xxiii
xxiv List of Figures

Fig. 1.13 Voyagers’ message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 29


Fig. 1.14 Instructions on Voyager’s plaque .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 30
Fig. 1.15 René Magritte’s famous statement . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 31
Fig. 1.16 The first X-ray photograph, produced by Röntgen in
1895, showing his wife’s hand with a wedding ring . . . . . . . . . . . . . . . 32
Fig. 1.17 A Gestalt switch between figure and ground. Does the
figure show a vase or two faces? . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 33
Fig. 1.18 Is this a young lady or an old woman? . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 35
Fig. 2.1 Scenes in the film Powers of Ten (Reprinted from
http://www.powersof10.com/film © 2010 Eames Office) .. . . . . . . . . 48
Fig. 2.2 The procedure of creating a thematic map .. . . . .. . . . . . . . . . . . . . . . . . . . 49
Fig. 2.3 The visual hierarchy. Objects on the map that are most
important intellectually are rendered with the greatest
contrast to their surroundings. Less important elements
are placed lower in the hierarchy by reducing their
edge contrasts. The side view in this drawing further
illustrates this hierarchical concept .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 52
Fig. 2.4 Four types of relief map: (a) contours, (b) contours
with hill shading, (c) layer tints, and (d) digits
(Reprinted from http://www.nottingham.ac.uk/
education/maps/relief.html#r5) .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 53
Fig. 2.5 A Landsat photograph of Britain (left). Central London
(right) is shown as the blue area near to the lower right
corner. The Landsat satellite took the photo on May
23rd, 2001 (Reprinted from http://GloVis.usgs.gov/
ImgViewer.jsp?path=201&row=24&pixelSize=1000) . . . . . . . . . . . . . 54
Fig. 2.6 Ptolemy’s world map, re-constructed based on his
work Geography c. 150 (© The British Library http://
www.bl.uk/) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 55
Fig. 2.7 A road map and an aerial photograph of the
Westminster Bridge in London . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 55
Fig. 2.8 London Underground map conforms to the
geographical configuration . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 56
Fig. 2.9 London underground map does not conform to the
geographical configuration . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 57
Fig. 2.10 Atlas with the celestial sphere on his shoulders. This
is the earliest surviving representation of the classical
constellations (Courtesy of www.cosmopolis.com) . . . . . . . . . . . . . . . . 59
Fig. 2.11 Most of the 48 classical constellation figures are
shown, but not the stars comprising each constellation.
The Farnese Atlas, 200 BC from the National Maritime
Museum, London.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 59
Fig. 2.12 Constellations in the northern Hemisphere in 1795s.
The Constellations of Eratosthenes.. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 60
List of Figures xxv

Fig. 2.13 Constellations in the southern hemisphere in 1795s.


The Constellations of Eratosthenes.. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 61
Fig. 2.14 The painting of constellations by an unknown artist in
1575 on the ceiling of the Sala del Mappamondo of the
Palazzo Farnese in Caprarola, Italy. Orion the Hunter
and Andromeda are both located to the right of the
painting (Reprinted from Sesti 1991) . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 62
Fig. 2.15 Left: M-31 (NGC-224) – the Andromeda Galaxy;
Right: The mythic figure Andromeda . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 64
Fig. 2.16 Perseus and Andromeda constellations in John
Flamsteed’s Atlas Coelestis (1729) (Courtesy of http://
mahler.brera.mi.astro.it/) .. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 65
Fig. 2.17 Taurus and Orion in John Flamsteed’s Atlas Coelestis
(1729) (Courtesy of http://mahler.brera.mi.astro.it/) . . . . . . . . . . . . . . . 65
Fig. 2.18 Orion the Hunter (Courtesy of http://www.cwrl.utexas.
edu/syverson/) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 66
Fig. 2.19 Large-scale structures in the Universe (Reprinted from
Scientific American, June 1999).. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 67
Fig. 2.20 The CfA Great Wall – the structure is 500 million
light-years across. The Harvard-Smithsonian Center
for Astrophysics redshift survey of galaxies in the
northern celestial hemisphere of the universe has
revealed filaments, bubbles, and, arching across the
middle of the sample .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 68
Fig. 2.21 Slice through the Universe (Reprinted from Scientific
American, June 1999) .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 69
Fig. 2.22 Flying through in the 3D universe map (Courtesy of
http://msowww.anu.edu.au/) .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 69
Fig. 2.23 Part of the rectangular logarithmic map of the
universe depicting major astronomical objects beyond
100 mpc from the Earth (The full map is available at
http://www.astro.princeton.edu/universe/all100.gif.
Reprinted from Gott et al. 2005) . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 71
Fig. 2.24 A map of the universe based on the SDSS survey data
and relevant literature data from the web of science.
The map depicts 618,223 astronomic objects, mostly
identified by the SDSS survey, including 4 space
probes (A high resolution version of the map can be
found at http://cluster.cis.drexel.edu/cchen/projects/
sdss/images/2007/poster.jpg) . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 72
Fig. 2.25 The design of the circular map of the universe .. . . . . . . . . . . . . . . . . . . . 72
Fig. 2.26 The types of objects shown in the circular map of the universe . . . 73
Fig. 2.27 The center of the circular map of the universe ... . . . . . . . . . . . . . . . . . . . 73
xxvi List of Figures

Fig. 2.28 Major discoveries in the west region of the map. The
2003 Sloan Great Wall is much further away from us
than the 1989 CfA2 Great Wall . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 74
Fig. 2.29 The Hubble Ultra Deep Field (HUDF) is featured on
the map of the universe . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 75
Fig. 2.30 SDSS quasars associated with citation bursts . . .. . . . . . . . . . . . . . . . . . . . 76
Fig. 2.31 A network of co-cited publications based on the SDSS
survey. The arrow points to an article published in
2003 on a survey of high redshift quasars in SDSS II.
A citation burst was detected for the article. . . . .. . . . . . . . . . . . . . . . . . . . 76
Fig. 2.32 The original structure of DNA’s double helix
(Reprinted from Watson 1968) . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 79
Fig. 2.33 Ear acupacture point map. What is the best organizing
metaphor? (Courtesy of http://www.auriculotherapy-
intl.com/) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 80
Fig. 2.34 Musculoskeletal points (©1996 Terry Oleson, UCLA
School of Medicine. http://www.americanwholehealth.
com/images/earms.gif) .. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 81
Fig. 2.35 Caenorhabditis elegans gene expression terrain map
created by VxInsight, showing three-dimensional
representation of 44 gene mountains derived from 553
microarray hybridizations and consisting of 17,661
genes (representing 98.6 % of the genes present on the
DNA microarrays) (Reprinted from Kim et al. 2001) . . . . . . . . . . . . . . 82
Fig. 2.36 114,996 influenza virus protein sequences (Reprinted
from Pellegrino and Chen 2011) . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 83
Fig. 3.1 Liberation by Escher. Rigid triangles are transforming
into more lively figures (© Worldofescher.com).. . . . . . . . . . . . . . . . . . . 86
Fig. 3.2 The scope of the Knowledge of London, within which
London taxi drivers are supposed to know the most
direct route by heart, that is, without resorting to the
A–Z street map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 90
Fig. 3.3 Nodes a and c are connected by two paths. If r D 1,
Path 2 is longer than Path 1, violating the triangle
inequality; so it needs to be removed.. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 95
Fig. 3.4 A Pathfinder network of the 20-city proximity data .. . . . . . . . . . . . . . . 96
Fig. 3.5 A Pathfinder network of a group of related concepts .. . . . . . . . . . . . . . 96
Fig. 3.6 Visualization of 279 images by color histogram .. . . . . . . . . . . . . . . . . . . 98
Fig. 3.7 Visualization of 279 images by layout . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 99
Fig. 3.8 Visualizations of 279 images by texture . . . . . . . .. . . . . . . . . . . . . . . . . . . . 100
Fig. 3.9 Valleys and peaks in ThemeView (© PNNL) . . .. . . . . . . . . . . . . . . . . . . . 102
Fig. 3.10 A virtual landscape in VxInsight . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 102
List of Figures xxvii

Fig. 3.11 A virtual landscape of patent class 360 for a period


between 1980 and 1984 in VxInsight. Companies’
names are color-coded: Seagate-red, Hitachi-green,
Olympus-blue, Sony-yellow, IBM-cyan, and
Philips-magenta (Courtesy of Kevin Boyack) . .. . . . . . . . . . . . . . . . . . . . 103
Fig. 3.12 A SOM-derived base map of the literature of
geography (Reprinted from Skupin 2009) . . . . . .. . . . . . . . . . . . . . . . . . . . 104
Fig. 3.13 The process of visualizing citation impact in the
context of co-citation networks (© 2001 IEEE) . . . . . . . . . . . . . . . . . . . . 105
Fig. 3.14 The design of ParadigmView (© 2001 IEEE) . .. . . . . . . . . . . . . . . . . . . . 107
Fig. 3.15 Examples of virtual landscape views (© 2001 IEEE) . . . . . . . . . . . . . . 108
Fig. 3.16 Streams of topics in Fidel Castro’s speeches and other
documents (Reprinted from Havre et al. 2000) .. . . . . . . . . . . . . . . . . . . . 109
Fig. 3.17 The evolution of topics is visualized in TextFlow
(Reprinted from Cui et al. 2011) . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 110
Fig. 3.18 Alluvial map of scientific change (Reprinted from
Rosvall and Bergstrom 2010) .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 111
Fig. 3.19 Load a network in .net format to the alluvial map generator .. . . . . . 112
Fig. 3.20 An alluvia map generated based on networks
of co-occurring terms in publications related to
regenerative medicine. Top 300 most frequently
occurred terms are chosen each year . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 112
Fig. 3.21 An alluvial map of popular tweet topics identified as
Hurricane Sandy approaching . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 113
Fig. 3.22 An alluvial map of co-occurring patterns of chemical
compound fragments .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 113
Fig. 3.23 The simplest procedure of generating an MDS map . . . . . . . . . . . . . . . 115
Fig. 3.24 A geographic map showing 20 cities in the US
(Copyright © 1998–2012 USATourist.com, LLC
http://www.usatourist.com/english/tips/distances.html) .. . . . . . . . . . . 115
Fig. 3.25 An MDS configuration according to the mileage chart
for 20 cities in the US . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 116
Fig. 3.26 The mirror image of the original MDS configuration,
showing an overall match to the geographic map,
although Orlando, Miami should be placed further
down to the South . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 116
Fig. 3.27 The procedure of generating an MST-enhanced MDS
map of the CRCARS data. Nodes are placed by MDS
and MST determines explicit links . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 117
xxviii List of Figures

Fig. 3.28 An MDS configuration of the 406 cars in the CRCARS


data, including an MST overlay. The edge connecting
a pair of cars is coded in grayscale to indicate the
strength of similarity: the darker, the stronger the
similarity. The MST structure provides a reference
framework for assessing the accuracy of the MDS
configuration (Courtesy of http://www.pavis.org/) . . . . . . . . . . . . . . . . . 118
Fig. 3.29 The procedure of journal co-citation analysis described
in Morris and McCain (1998) . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 118
Fig. 3.30 Cluster solution for SCI co-citation data (Reproduced
from Morris and McCain (1998). Note that “Comput
Biol Med” and “Int J Clin Monit Comput” belong to
different clusters) .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 119
Fig. 3.31 SCI multidimensional scaling display with cluster
boundaries (Reproduced from Morris and McCain
(1998). Note the distance between “Comput Biol Med”
and “Int J Clin Monit Comput” to the left of this MDS
configuration).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 120
Fig. 3.32 Individual differences scaling results of two red-green
color-deficient subjects. The Y axis is not fully
extended as normal subjects . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 122
Fig. 3.33 SCI weighted individual differences scaling display
(Reproduced from Morris and McCain 1998) . .. . . . . . . . . . . . . . . . . . . . 122
Fig. 3.34 SSCI weighted individual differences scaling display
(Reproduced from Morris and McCain 1998) . .. . . . . . . . . . . . . . . . . . . . 123
Fig. 3.35 The Swiss-roll data set, illustrating how Isomap
exploits geodesic paths for nonlinear dimensionality
reduction. Straight lines in the embedding (the blue
line in part a) now represent simpler and cleaner
approximations to the true geodesic paths than do the
corresponding graph paths (the red line in part b)
(Reproduced from Tenenbaum et al. (2000) Fig. 3.
http://www.sciencemag.org/cgi/content/full/290/5500/
2319/F3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 124
Fig. 3.36 Face images varying in pose and illumination (Fig.
1A) (Reprinted from Tenenbaum et al. 2000).. .. . . . . . . . . . . . . . . . . . . . 125
Fig. 3.37 Isomap (K D 6) applied to 2,000 images of a
hand in different configurations (Reproduced from
Supplemental Figure 1 of Tenenbaum et al. (2000)
http://isomap.stanford.edu/handfig.html) . . . . . . .. . . . . . . . . . . . . . . . . . . . 126
Fig. 3.38 The color-coding illustrates the
neighborhood-preserving mapping discovered
by LLE (Reprinted from Roweis and Saul 2000) .. . . . . . . . . . . . . . . . . . 126
Fig. 3.39 The procedure used for concept mapping .. . . . . .. . . . . . . . . . . . . . . . . . . . 128
List of Figures xxix

Fig. 3.40 An MDS-configured base map of topical statements


and ratings of importance shown as stacked bars.. . . . . . . . . . . . . . . . . . 129
Fig. 3.41 Hierarchical cluster analysis divided MDS coordinates
into nine clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 130
Fig. 3.42 A structural hole between groups a, b and c (Reprinted
from Burt 2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 132
Fig. 3.43 A visualization of a co-citation network associated
with research in regenerative medicine. The colors
indicate the time of publication .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 138
Fig. 4.1 Three Traveling Salesman tours in German cities:
the 45-city Alten Commis-Voyageur tour (green), the
Groetschel’s 120-city tour (blue), and by far the latest
15,112-city tour (red) (Courtesy of http://www.math.
princeton.edu/) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 145
Fig. 4.2 Knowledge garden . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 146
Fig. 4.3 A scene in StarWalker when two users exploring the
semantically organized virtual space . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 146
Fig. 4.4 More users gathering in the scene . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 147
Fig. 4.5 A site map produced by see POWER. The colored
contours represent the hit rate of a web page. The
home page is the node in the center (Courtesy of http://
www.compudigm.com/) . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 147
Fig. 4.6 Modeling trails of information foragers in thematic spaces .. . . . . . . 149
Fig. 4.7 Legend for the visualization of foraging tails . . .. . . . . . . . . . . . . . . . . . . . 154
Fig. 4.8 Relevant documents for Task A in the ALCOHOL
space (MST) .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 155
Fig. 4.9 Overview first: user jbr’s trails in searching the alcohol
space (Task A) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 155
Fig. 4.10 Zoom in : : : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 157
Fig. 4.11 Details on demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 157
Fig. 4.12 Overview first, zoom in, filtering, detail on demand.
Accumulative trajectory maps of user jbr in four
consecutive sessions of tasks. Activated areas in each
session reflect the changes of the scope (clockwise:
Task A to Task D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 158
Fig. 4.13 Synthesized trails. The trajectory of the optimal path
over the original path of user jbr . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 159
Fig. 5.1 An inclusion map of research in mass extinction based
on index terms of articles on mass extinction published
in 1990. The size of a node is proportional to the total
number of occurrences of the word. Links that violate
first-order triangle inequality are removed (© D 0.75) . . . . . . . . . . . . . . 169
Fig. 5.2 The co-word map of the period of 1980–1985 for the
debate on RISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 171
xxx List of Figures

Fig. 5.3 The co-word map of another period: 1986–1987 for


the debate on RISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 171
Fig. 5.4 A document co-citation network of publications in
Data and Knowledge Engineering . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 173
Fig. 5.5 Citation analysis detected a vital missing citation from
Mazur’s paper in 1962 to Rydon’s paper in 1952 . . . . . . . . . . . . . . . . . . 174
Fig. 5.6 A global map of science based on document co-citation
patterns in 1996, showing a linked structure of nested
clusters of documents in various disciplines and
research areas (Reproduced from Garfield 1998) .. . . . . . . . . . . . . . . . . . 176
Fig. 5.7 Zooming in to reveal a detailed structure of
biomedicine (Reproduced from Garfield 1998).. . . . . . . . . . . . . . . . . . . . 177
Fig. 5.8 Zooming in even further to examine the structure of
immunology (Reproduced from Garfield 1998) . . . . . . . . . . . . . . . . . . . . 178
Fig. 5.9 The specialty narrative of leukemia viruses. Specialty
narrative links are labeled by citation-context
categories (Reproduced from Small 1986) .. . . . .. . . . . . . . . . . . . . . . . . . . 179
Fig. 5.10 A generic procedure of co-citation analysis. Dashed
lines indicate visualization options .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 181
Fig. 5.11 The first map of author co-citation analysis, featuring
specialties in information science (1972–1979)
(Reproduced from White and Griffith 1981) .. . .. . . . . . . . . . . . . . . . . . . . 182
Fig. 5.12 A two-dimensional Pathfinder network integrated with
information on term frequencies as the third dimension
(Reproduced from Chen 1998) . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 183
Fig. 5.13 A Pathfinder network of SIGCHI papers based on their
content similarity. The interactive interface allows
users to view the abstract of a paper seamlessly as they
navigate through the network (Reproduced from Chen 1998) . . . . . 184
Fig. 5.14 A Pathfinder network of co-cited authors of the ACM
Hypertext conference series (1989–1998) (Reproduced
from Chen and Carr 1999) .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 184
Fig. 5.15 A Pathfinder network of 121 information science
authors based on raw co-citation counts (Reproduced
from White 2003) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 185
Fig. 5.16 A minimum spanning tree solution of the author
co-citation network based on the ACM Hypertext
dataset (Nodes D 367, Links D 366) . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 187
Fig. 5.17 The author co-citation network of the ACM Hypertext
data in a Pathfinder network (Nodes D 367, Links D 398) . . . . . . . . . 188
Fig. 5.18 The procedure of co-citation analysis as described in
Chen and Paul (2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 188
List of Figures xxxi

Fig. 5.19 A Pathfinder network showing an author co-citation


structure of 367 authors in hypertext research
(1989–1998). The color of a node indicates its
specialty membership identified by PCA: red for the
most predominant specialty, green the second, and
blue the third (© 1999 IEEE) .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 189
Fig. 5.20 A landscape view of the hypertext author co-citation
network (1989–1998). The height of each vertical bar
represents periodical citation index for each author (©
1999 IEEE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 190
Fig. 5.21 An annotated historiograph of co-citation research
(Courtesy of Eugene Garfield; the original diagram can
be found at: http://garfield.library.upenn.edu/histcomp/
cocitation small-griffith/graph/2.html) .. . . . . . . . .. . . . . . . . . . . . . . . . . . . . 191
Fig. 5.22 A minimum spanning tree of a network of 1,726
co-cited patents related to cancer research .. . . . .. . . . . . . . . . . . . . . . . . . . 193
Fig. 5.23 Landscapes of patent class 360 for four 5-year periods.
Olympus’s patents are shown in blue; Sony in green;
Hitachi in green; Philips in magenta; IBM in cyan; and
Seagate in red (Reproduced from Figure 1 of Boyack
et al. 2000) .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 194
Fig. 5.24 Map of all patents issued by the US Patent Office in
January 2000. Design patents are shown in magenta;
patents granted to universities in green; and IBM’s
patents in red (Reproduced from Figure 5 of Boyack et
al. 2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 195
Fig. 5.25 A visualization of the literature of co-citation analysis . . . . . . . . . . . . 196
Fig. 6.1 Paradigm shift in collagen research (Reproduced from
Small 1977) .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 204
Fig. 6.2 The curve of a predominant paradigm . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 206
Fig. 6.3 An artist’s illustration of the impact theory: before
the impact, seconds to impact, moment of impact, the
impact crater, and the impact winter (© Walter Alvarez) . . . . . . . . . . 209
Fig. 6.4 Shoemaker-Levy 9 colliding into Jupiter in 1994.
Eight impact sites are visible. From left to right are
the E/F complex (barely visible on the edge of the
planet), the star-shaped H site, the impact sites for tiny
N, Q1, small Q2, and R, and on the far right limb the
D/G complex. The D/G complex also shows extended
haze at the edge of the planet. The features are rapidly
evolving on timescales of days. The smallest features
in this image are less than 200 km across. This image
is a color composite from three filters at 9,530, 5,550,
and 4,100 Å (Copyright free, image released into the
public domain by NASA) . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 210
xxxii List of Figures

Fig. 6.5 Interpretations of the key evidence by competing


paradigms in the KT debate . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 210
Fig. 6.6 A paradigmatic view of the mass extinction debates
(1981–2001) .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 211
Fig. 6.7 The location of the Chicxulub crater . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 212
Fig. 6.8 Chicxulub’s gravity field (left) and its magnetic
anomaly field (right) (© Mark Pilkington of the
Geological Survey of Canada) .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 213
Fig. 6.9 The periodicity cluster . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 215
Fig. 6.10 A year-by-year animation shows the growing impact
of articles in the context of relevant paradigms. The
top-row snapshots show the citations gained by the
KT impact articles (center), whereas the bottom-row
snapshots highlight the periodicity cluster (left) and
the Permian extinction cluster (right) . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 216
Fig. 6.11 Citation peaks of three clusters of articles indicate
potential paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 217
Fig. 6.12 Supermassive black holes search between 1991 and
1995. The visualization of the document co-citation
network is based on co-citation data from 1981
through 2000. Three paradigmatic clusters highlight
new evidence (the cluster near to the front) as well as
theoretical origins of the AGN paradigm . . . . . . .. . . . . . . . . . . . . . . . . . . . 221
Fig. 6.13 The visualization of the final period of the AGN case
study (1996–2000). The cluster near to the front has
almost vanished and the cluster to the right has also
reduced considerably. In contrast, citations of articles
in the center of the co-citation network rocketed,
leading by two evidence articles published in Nature:
one is about NGC-4258 and the other is about MCG-6-30-15. . . . . 222
Fig. 6.14 The rises and falls of citation profiles of 221 articles
across three periods of the AGN paradigm . . . . .. . . . . . . . . . . . . . . . . . . . 223
Fig. 7.1 An evolving landscape of research pertinent to BSE
and CJD. The next hot topic may emerge in an area
that is currently not populated . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 228
Fig. 7.2 A Venn diagram showing potential links between
bibliographically unconnected literatures (Figure 1
reprinted from Swanson and Smalheiser (1997)) .. . . . . . . . . . . . . . . . . . 233
Fig. 7.3 A schematic diagram, showing the most promising
pathway linking migraine in the source literature to
magnesium in the target literatures (C to A3) (Courtesy
of http://kiwi.uchicago.edu/) .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 234
List of Figures xxxiii

Fig. 7.4 A schematic flowchart of Swanson’s Procedure II


(Figure 4 reprinted from Swanson and Smalheiser
(1997), available at http://kiwi.uchicago.edu/webwork/
fig4.xbm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 235
Fig. 7.5 Mainstream domain knowledge is typically high in
both relevance and citation, whereas latent domain
knowledge can be characterized as high relevance and
low citation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 237
Fig. 7.6 The strategy of visualizing latent domain knowledge.
The global context is derived from co-citation
networks of highly cited works. An “exit” landmark
is chosen from the global context to serve as the
seeding article in the process of domain expansion.
The expanded domain consists of articles connecting
to the seeding article by citation chains of no more
than two citation links. Latent domain knowledge is
represented through a citation network of these articles . . . . . . . . . . . 237
Fig. 7.7 An overview of the document co-citation map. Lit-up
articles in the scene are Swanson’s publications. Four
of Swanson’s articles are embedded in the largest
branch – information science, including information
retrieval and citation indexing. A dozen of his articles
are gathered in the green specialty – the second largest
grouping, ranging from scientometrics, neurology,
to artificial intelligence. The third largest branch –
headache and magnesium – only contains one of
Swanson’s articles.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 239
Fig. 7.8 The procedure of visualizing latent domain knowledge .. . . . . . . . . . . 241
Fig. 7.9 An overview of the mainstream domain knowledge.. . . . . . . . . . . . . . . 242
Fig. 7.10 A landscape view of the Pathfinder case. Applications
of Pathfinder networks are found in a broader context
of knowledge management technologies, such as
knowledge acquisition, knowledge discovery, and
artificial intelligence. A majority of Pathfinder network
users are cognitive psychologists .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 243
Fig. 7.11 This citation map shows that the most prolific themes
of Pathfinder network applications include measuring
the structure of expertise, eliciting knowledge,
measuring the organization of memory, and comparing
mental models. No threshold is imposed .. . . . . . .. . . . . . . . . . . . . . . . . . . . 245
Fig. 7.12 This branch represents a new paradigm of
incorporating Pathfinder networks into Generalized
Similarity Analysis (GSA), a generic framework for
structuring and visualization, and its applications
especially in strengthening traditional citation analysis .. . . . . . . . . . . 248
xxxiv List of Figures

Fig. 7.13 Schvaneveldtl’s “exit” landmark in the landscape of


the thematic visualization .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 250
Fig. 7.14 An overview of 379 articles in the mainstream of BSE
and vCJD research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 251
Fig. 7.15 A year-by-year animation shows the growing impact
of research in the connections between BSE and
vCJD. Top-left: 1991–1993; Top-right: 1994–1996;
Bottom-left: 1997–1999; Bottom-right: 2000–2001 . . . . . . . . . . . . . . . . 252
Fig. 7.16 Articles cited more than 50 times during this period
are labeled. Articles labeled 1–3 directly address the
BSE-CJD connection. Article 4 is Prusiner’s original
article on prion, which has broad implications on brain
diseases in sheep, cattle, and human . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 253
Fig. 8.1 An overview of the structural variation model ... . . . . . . . . . . . . . . . . . . . 263
Fig. 8.2 Scenarios that may increase or decrease individual
terms in the modularity metric.. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 267
Fig. 8.3 The structure of the system before the publication of
the ground breaking paper by Watts . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 271
Fig. 8.4 The structure of the system after the publication of
Watts 1998 .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 271
Fig. 8.5 The structural variation method is applied to a set of
patents related to cancer research. The star marks the
position of a patent (US6537746). The red lines show
where the boundary-spanning connections were made
by the patent. Interestingly, the impacted clusters are
about recombination . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 274
Fig. 8.6 Major areas of regenerative medicine . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 277
Fig. 8.7 The modularity of the network dropped considerably
in 2007 and even more in 2009, suggesting that some
major structural changes took place in these 2 years in
particular .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 282
Fig. 8.8 Many members of Cluster #7 are found to have citation
bursts, shown as citation rings in red. Chin MH 2009
and Stadtfeld M 2010 at the bottom area of the cluster
represent a theme that differs from other themes of the cluster .. . . 286
Fig. 8.9 A network of the regenerative medicine literature
shows 2,507 co-cited references cited by top 500
publications per year between 2000 and 2011. The
work associated with the two labelled references was
awarded the 2012 Nobel Prize in Medicine .. . . .. . . . . . . . . . . . . . . . . . . . 288
Fig. 8.10 The rate of retraction is increasing in PubMed
(As of 3/29/2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 291
Fig. 8.11 The survival function of retraction. The probability of
surviving retraction for 4 years or more is below 0.2.. . . . . . . . . . . . . . 297
List of Figures xxxv

Fig. 8.12 An overview of co-citation contexts of retracted


articles. Each dot is a reference of an article. Red
dots indicate retracted articles. The numbers in front
of labels indicate their citation ranking. Potentially
damaging retracted articles are in the middle of an area
that otherwise free from red dots . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 299
Fig. 8.13 Red dots are retracted articles. Labeled ones are highly
cited. Clusters are formed by co-citation strengths .. . . . . . . . . . . . . . . . 300
Fig. 8.14 An extensive citation context of a retracted 2003 article
by Nakao et al. The co-citation network contains
27,905 cited articles between 2003 and 2011. The
black dot in the middle of the dense network represents
the Nakao paper. Red dots represent 340 articles that
directly cited the Nakao paper (there are 609 such
articles in the Web of Science). Cyan dots represent
2,130 of the 9,656 articles that bibliographically
coupled with the direct citers . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 300
Fig. 8.15 69 clusters formed by 706 sentences that cited the
1998 Wakefield paper . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 302
Fig. 8.16 Divergent topics in a topic-transition visualization of
the 1998 Wakefield et al. article . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 302
Fig. 8.17 The UCSD map of science. Each node in the map is
a cluster of journals. The clustering was based on
a combination of bibliographic couplings between
journals and between keywords. Thirteen regions are
manually labeled (Reproduced with permission) .. . . . . . . . . . . . . . . . . . 306
Fig. 8.18 Areas of research leadership for China. Left: A
discipline-level circle map. Right: A paper-level circle
map embedded in a discipline circle map. Areas
of research leadership are located at the average
position of corresponding disciplines or paradigms.
The intensity of the nodes indicates the number of
leadership types found, Relative Publication Share
(RPS), Relative Reference Share (RRS), or state-of-the
art (SOA) (Reprinted from Klavans and Boyack 2010
with permission).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 307
Fig. 8.19 A discipline-level map of 812 clusters of journals and
proceedings. Each node is a cluster. The size of a
node represents the number of papers in the cluster
(Reprinted from Boyack 2009 with permission) .. . . . . . . . . . . . . . . . . . . 308
Fig. 8.20 The Scopus 2010 global map of 116,000 clusters of
1.7 million articles (Courtesy of Richard Klavans and
Kevin Boyack, reproduced with permission) . . .. . . . . . . . . . . . . . . . . . . . 309
xxxvi List of Figures

Fig. 8.21 An overlay on the Scopus 2010 map shows papers that
acknowledge NCI grants (Courtesy of Kevin Boyack,
reproduced with permission) . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 310
Fig. 8.22 A global science overlay base map. Nodes represent
Web of Science Categories. Grey links represent
degree of cognitive similarity (Reprinted from Rafols
et al. 2010 with permission) . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 311
Fig. 8.23 An interactive science overlay map of
Glaxo-SmithKline’s publications between
2000 and 2009. The red circles are GSK’s publications
in clinical medicine (as moving mouse-over the
Clinical Medicine label) (Reprinted from Rafols et al.
2010 with permission, available at http://idr.gatech.
edu/usermapsdetail.php?id=61) . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 312
Fig. 8.24 A similarity map of JCR journals shown in VOSViewer .. . . . . . . . . . 313
Fig. 8.25 The Blondel clusters in the citing journal map (left)
and the cited journal map (right). The overlapping
polygons suggest that the spatial layout and the
membership of clusters still contain a considerable
amount of uncertainty. Metrics calculated based on the
coordinates need to take the uncertainty into account .. . . . . . . . . . . . . 314
Fig. 8.26 Citation arcs from the publications of Drexel’s iSchool
(blue arcs) and Syracuse School of Information
Studies (magenta arcs) reveal where they differ in
terms of both intellectual bases and research frontiers . . . . . . . . . . . . . 315
Fig. 8.27 h-index papers (cyan) and citers to CiteSpace (red) .. . . . . . . . . . . . . . . 315
Fig. 9.1 A screenshot of GeoTime (Reprinted from Eccles et al. 2008) . . . . 322
Fig. 9.2 CiteSpace labels clusters with title terms of articles
that cite corresponding clusters .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 323
Fig. 9.3 Citations over time are shown as tree rings. Tree rings
in red depict the years an accelerated citation rate was
detected (citation burst). Three areas emerged from the
visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 324
Fig. 9.4 A network of 12,691 co-cited references. Each year top
2,000 most cited references were selected to form the
network. The same three-cluster structure is persistent
at various levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 325
Fig. 9.5 The document view in Jigsaw . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 326
Fig. 9.6 The list view of Jigsaw, showing a list of authors, a
list of concepts, and a list of index terms. The input
documents are papers from the InfoVis and VAST conferences . . . 327
Fig. 9.7 A word tree view in Jigsaw . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 328
Fig. 9.8 Tablet in Jigsaw provides a flexible workspace to
organize evidence and information .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 328
List of Figures xxxvii

Fig. 9.9 Carrot’s visualizations of clusters of text documents.


Top right: Aduna cluster map visualization; lower
middle: circles visualization; lower right: Foam Tree
visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 329
Fig. 9.10 Left: The geographic layout of the Western Power
Grid (WECC) with 230 kV or higher voltage. Right:
a GreenGrid layout with additional weights applied
to both nodes (using voltage phase angle) and links
(using impedance) (Reprinted from Wong et al. 2009
with permission).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 330
Fig. 9.11 A screenshot of ASE (Reprinted from Dunne et al.
2012 with permission) . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 332
Fig. 9.12 An ultimate ability to reduce the vast volume of
scientific knowledge in the past and a stream of new
knowledge to a clear and precise representation of a
conceptual structure .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 337
Fig. 9.13 A fitness landscape of scientific inquires.. . . . . . .. . . . . . . . . . . . . . . . . . . . 337
Chapter 1
The Dynamics of Scientific Knowledge

Science is what you know, philosophy is what you don’t know.


— Bertrand Russell (1872–1970)

Scientific knowledge changes all the time. Most of the changes are incremental, but
some are revolutionary and fundamental. There are two kinds of contributions to
the body of scientific knowledge: persistent and long-lasting ones versus transient
and fast-moving ones. Once widely known theories and interpretations may be
replaced by new theories and new interpretations. Scientific frontiers consist of
the current understanding of the world and the current set of questions that the
scientific community is addressing. Scientific frontiers are not only where one
would expect to find the cutting edge knowledge and technology of human being,
but also unsolved mysteries, controversies, battles and debates, and revolutions.
For example, a bimonthly newsletter Scientific Frontiers1 digests scientific reports
of scientific anomalies – observations and facts that do not quite fit into prevail-
ing scientific theories. This is where the unknown manifests itself in all sorts
of ways.
In this book, we will start with what is known about the structure and dynamics
of scientific knowledge and how information and computational approaches can
help us develop a good understanding of the complex and evolving system. We will
also trace the origin of some of the most fundamental assumptions that underline
the state of the art in science mapping, interactive visual analytics and quantitative
studies of science. This is not a technical tutorial; instead, our focus is on principles
of visual thinking and the ways that may vividly reveal the dynamics of scientific
frontiers at various levels of abstraction.

1
http://www.science-frontiers.com/

C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 1


DOI 10.1007/978-1-4471-5128-9 1, © Springer-Verlag London 2013
2 1 The Dynamics of Scientific Knowledge

1.1 Scientific Frontiers

The pioneering innovators in the study of invisible colleges were Nick Mullins,
Susan Crawford, and other sociologists of science. In 1972, Diana Crane argues
scientific knowledge is diffused through invisible colleges (Crane 1972). The prob-
lems of scientific communication can be understood in terms of interaction between
a complex and volatile research front and a stable and much less flexible information
communication system. The research front creases new knowledge; the formal
communication system evaluates it and disseminates it beyond the boundaries of
the research area that produced it. The research front is continually evolving and
updating its own directions. This dynamics makes it challenging for anyone to
keep abreast of the current state of a research area solely through scholarly articles
circulated in the formal communication system. Research in information science
and scholarly communication has shown that when scientists experience difficulties
in finding information through formal communication channels, a common reason
is the lack of a broader context of where a particular piece of information belongs
in a relatively unfamiliar area.
Philosophy of science and sociology of science, two long established fields of
studies, provide high-level theories and interpretations of the dynamics of science
and scientific frontiers. In contrast, scientometrics is the quantitative study of
science. Its goal is to identify and make sense of empirical patterns that can shed
light on how science functions. Typically, scientometric studies have relied on
scientific literature, notably Thomson Reuters’ Web of Science, Elsevier’s Scopus,
and Google Scholar, patents, awards made by federal government agencies, and,
more recently, social media sources such as Twitter.
Mapping scientific frontiers aims to externalize the big picture of science. Its
origin can be easily traced back to the pioneering work of Eugene Garfield on
historgraphics of citation, Belver Griffith and Henry Small on document co-citation
analysis, and Howard White on author co-citation analysis. Today, researchers have
many more options of science mapping software than just 5 years ago. Many of
the major science mapping software applications are freely accessible. Notable
examples include our own software CiteSpace (2003), the Science of Science Tool
(SCI2) (2009) from Indiana University, VOSViewer from the Netherlands (2010),
SciMAT (2012) from Spain. If we can only pick one software that has made the
most substantial contribution to the widespread interest of network visualization, I
would choose Pajek. It was probably the first freely available software system for
visualizing large-scale networks. It has inspired many subsequent efforts towards
the development and maintenance of science mapping software tools. Although new
generation of systems such as Gephi have various new features, Pajek has earned
a unique position in giving many researchers the first taste of visualizing a large
network.
Mapping scientific frontiers takes more than presenting an intuitively designed
and spectacularly rendered big picture of science. A key question is how one
can identify information that is not only meaningful, but also actionable. The
1.1 Scientific Frontiers 3

maturing field of visual analytics provides a promising direction of pursuit. Visual


analytics can be seen as the second generation of information visualization. The
first-generation information visualization aims to gain insights from data that
may not readily lend itself to an intuitive visual and spatial representation. The
second-generation visual analytics makes it explicit that the goal is to support
evidence-based reasoning and decision making activities.
Due to its very nature, science mapping involves a broad variety of units
of analysis at different levels of granularity. The notions of macro-, meso-, and
microscopic levels can be helpful to clarify these units, although the precise
definition of these levels themselves is subject to a debate. For example, at a
macroscopic level, we are interested in the structure and dynamics of a discipline
and the entirety of a scientific community; we may even want to study how multiple
disciplines interact, for example, in the study of interdisciplinarity. Importantly,
many studies have suggested that interdisciplinary activities may play an essential
role in the development of science. Boundary-spanning activities in general may
indeed hold the key to scientific creativity.
At a lower level of aggregation, the meso scale often refers to a system of groups.
In other words, the unit of analysis at this level is groups. The existence of this
level implies that the macro and societal level is heterogeneously distributed. It is
not evenly distributed. Science mapping at this level corresponds to the study of
paradigms, including a thematic thread of research that could rise and fall over
time. At the even lower microscopic level, the units of analysis include individual
scientists and particular approaches to specific topics and solutions to specific
problems.
Scientific literature provides a wide range of options for researchers to choose
their units of analysis. For example, subject categories in the Web of Science have
been used to represent sub-disciplines. Cited references have been used to indicate
concept symbols. Word occurrence patterns have been used to model underlying
topics. What is special about the new generation of science mapping efforts is the
realization of the profound role of handling large-scale and heterogeneous sources
of streams of data so that one can explore the complexity of the development of
scientific knowledge from a broad range of perspectives. This realization in turn
highlights the significance of studying how distinct perspectives interact with each
other and how we may improve our understanding of science in terms of hindsight,
insights, and foresights.
The remarkable progress in science mapping in the recent few years is one of the
series revivals of what was pioneered in the 1960s and 1970s. The most seminal
works in information science include the contribution of Derek de Solla Price
(1922–1983), namely his Networks of Scientific Papers (Price 1965), Little Science,
Big Science (Price 1963), and Science since Babylon (1961). In Little Science, Big
Science, Price raised a series of questions that subsequently inspired generations
of researchers in what is now known as the science of science: Why should we
not turn the tools of science on science itself? Why not measure and generalize,
make hypotheses, and derive conclusions? He used the metaphor of studying the
behavior of gas in thermodynamics to illustrate how the science of science could
4 1 The Dynamics of Scientific Knowledge

improve our understanding of science. Thermodynamics studies the behavior of


gas under various conditions of temperature and pressure, but the focus is not on
the trajectory of a specific molecule. Rather, the focus is on the entirety of the
structure and dynamics of a complex adaptive system as a whole. Price suggested
that we should apply the same kind of rigorous scientific inquiries and data-driven
investigations to science itself: the volume of the body of scientific knowledge as a
whole, the trajectory of “molecules” over the landscape of science, the way in which
these “molecules” interact with each other, and the political and social properties of
this “gas.”
Today we take “the exponential growth of scientific literature” for granted.
It was Price who pointed out this empirical law. In addition, he identified several
remarkable features and drew a number of powerful conclusions. The empirical law
holds true with high accuracy over long periods of time. The growth is surprisingly
rapid however it is measured. He estimated that, among others, the number of
international telephone calls will be doubled in 5 years, the number of scientific
journals will be doubled in 15 years, and the number of universities will be doubled
in 20 years. He was convinced that it is so far-reaching that it should become the
fundamental law of any analysis of science. Following his “gas” metaphor, he used
the notion of invisible colleges to describe the way in which “molecules” in science
interact with each other. Here is an excerpt from his Little Science, Big Science on
invisible colleges:
We tend now to communicate person to person instead of paper to paper. In the most
active areas we diffuse knowledge through collaboration. Through select groups we seek
prestige and the recognition of ourselves by our peers as approved and worthy collaborating
colleagues. We publish for the small group, forcing the pace as fast as it will go in a process
that will force it harder yet. Only secondarily, with the inertia born of tradition, do we
publish for the world at large (Price 1963, p. 91).

A popular design metaphor that has been adopted for science mapping is the
notion of an abstract landscape with possible contours to highlight virtual valleys
and peaks. Similar landscape metaphors appeared in many earlier designs of
information visualization. What comes naturally with such metaphors is the notion
of exploration and navigation. Landmarks such as peaks of mountains are used
to attract an explorer’s attention. If the shape of the landscape matches to the
salient properties of the system that underlines the landscape, then exploring the
system becomes an intuitive and enjoyable navigation through the landmarks that
can be found effortlessly. Many of the earlier information visualization systems
capitalized on the assumption that the higher probability of an event is to occur,
the more important it is for the user to find the event easily. In contrast, users are
less motivated to visit valleys, or pay attention to events that tend to be associated
with low probabilities. For example, main-stream systems often emphasize high-
frequency topics and highlight prominent authors as opposed to low-frequency
outliers.
A different but probably equally thought-provoking analog may come from
evolutionary biology. Charles Darwin’s natural selection is now a household term
that describes the profound connection between fitness and survival. The notion
1.1 Scientific Frontiers 5

of a fitness landscape provides an intuitive and yet sustainable framework for a


broad range of analytic studies concerning situational awareness, gap analysis,
portfolio analysis, discovery processes, and strategic planning. Traversals on a
fitness landscape characterize an optimization process. The traveler’s ultimate goal
is to move to the point where the fitness reaches a global maximum. The fitness
landscape paradigm has a great potential not only in biology but also in many other
disciplines. It may help us address many common questions such as: where are
we in a disciplinary context? What would be the necessary moves for us to reach
our destination? Is it possible to find a path of consecutive improvements? To what
extent do we need to sacrifice short-term losses in order to maximize the ultimate
gain? Visual analytics provides a promising platform to address these questions.
Manfred Kochen urged every information scientist to read Science since Babylon
because it sets foundations of possible paradigms in information science (Kochen
1984). Sociologist Robert Merton and information scientist Eugene Garfield re-
garded Networks of Scientific Papers the most important contribution of Derek
Price to information science, which pioneers the use of citation patterns of the
publications in scientific literature for the study of the contents and perimeters of
research fronts in science. Particularly related to the theme of mapping scientific
frontiers, Price was a pioneer in proposing that citation study can establish
a conceptual map of current scientific literature. Such topography of scientific
literature should indicate the overlap and relative importance of journals, authors,
or individual papers by their positions within the map.
Generations of information scientists as well as scientists in general have been
influenced by works in the philosophy and the history of science, in particular,
by Thomas Kuhn’s structure of scientific revolutions (Kuhn 1962), Paul Targard’s
conceptual revolutions (Thagard 1992), and Diana Crane’s invisible colleges (Crane
1972). The notion of tracking scientific paradigms originated in this influence. Two
fruitful strands of efforts are particularly worth noting here. One is the work of
Eugene Garfield and Henry Small at the Institute for Scientific Information (ISI)
in mapping science through citation analysis. The other is the work of Michel
Callon and his colleges in tracking changes in scientific literature using the famous
co-word analysis. In fact, their co-word analysis is designated for a much wider
scope – scientific inscriptions, which includes technical reports, lecture notes,
grant proposals, and many others as well as publications in scholarly journals and
conference proceedings. More detailed analysis of these examples can be found in
later chapters. The new trend today focuses on the dynamics of scientific frontiers
more specifically. What are the central issues in a prolonged scientific debate? What
constitutes a context in which a prevailing theory evolves? How can we visualize the
process of a paradigm shift? Where are the rises and falls of competing paradigms in
the context of scientific frontiers? What are the most appropriate ways to visualize
scientific frontiers?
At the center of this revived trend of measuring and studying science as a
whole, mapping scientific frontiers is undertaking an unprecedented transformation.
To apply science on science itself, we need to understand the nature of scientific
activities, the philosophy and the sociology of science. Our journey will start with
6 1 The Dynamics of Scientific Knowledge

the so-called visualism in science, which says what contemporary scientists have
been doing in their daily work is, in essence, to visualize, to interpret, and to explain
(Ihde 1998). What is the metaphor that we can use to visualize scientific frontiers?
Our quest of knowledge domain visualization starts from mapping of terrestrial and
celestial phenomena in the physical world, cartography of conceptual maps and
intellectual structures of scientific literature, to static snapshots and longitudinal
maps featuring the dynamics of scientific frontiers.
There are three simplistic models of how scientific knowledge grows. The most
common one is a cumulative progression of new ideas developing from antecedent
ideas in a logical sequence. Hypotheses derived from theory are tested against
empirical evidence and either accepted or rejected. There is no ambiguity in the
evidence and consequently no disagreement among scientists about the extent to
which a hypothesis has been verified. Many discussions of the nature of scientific
method are based on this model of scientific growth.
An alternative model is that the origins of new ideas come not from the most
recent developments but from any previous development whatever in the history of
the field. In this model, there is a kind of random selection across the entire history
of a cultural area. Price (1965) argues that this kind of highly unstructured growth
is characteristic of the humanities.
The first of these models stresses continuous cumulative growth, the second its
absence. Another type of model includes periods of continuous cumulative growth
interspersed with periods of discontinuity. A notably representative is Kuhn’s theory
of scientific revolutions. In Kuhn’s terminology, periods of cumulative growth are
normal science. The disruption of such cumulative growth is characterized by crisis
or revolution.

1.1.1 Competing Paradigms

One of the most influential works in the twentieth century is the theory of the
structure of scientific revolutions by Thomas Kuhn (1922–1996) (1962). Before
Kuhn’s structure, philosophy of science had been dominated by what is known as
the logical empirical approach. The logical empiricism uses modern formal logic
to investigate how scientific knowledge could be connected to sense experience.
It emphasizes the logical structure of science rather than its psychological and
historical development.
Kuhn criticized that the logical empiricism cannot adequately explain the history
of science. He claimed that the growth of scientific knowledge is characterized by
revolutionary changes in scientific theories. According to Kuhn, most of the time
scientists are engaged in a stage of an iterative process – normal science. The stage
of normal science is marked by the dominance of an established framework, or
paradigms. The majority of scientists would work on specific hypotheses within
such paradigms. The foundation of a paradigm largely remains unchallenged until
new discoveries cast more and more doubts, or, anomalies, over the foundation.
1.1 Scientific Frontiers 7

As more and more anomalies build up, scientists begin to examine basic assump-
tions that have been taken for granted. This re-examination marks a period of crises.
To resolve such crises, radically new theories with greater explanatory power may
replace the current paradigms that are in trouble. This type of replacement is often
view-changing in nature. They are often revolutionary and transformative. As the
new paradigm is accepted by the scientific community, science enters another period
of normal science. Scientific revolutions, as Kuhn claimed, are an integral part of
science and science progresses through such revolutionary changes. Although the
most common perception of the paradigm shift theory implies the rarity and severity
of such change, such view-changing events are much more commonly found at
almost all levels of science, from topics, fields of study, to disciplines.
Kuhn characterized the structure of scientific revolutions in terms of the
dynamics of competing scientific paradigms. His theory provides deep insights into
the mechanisms that operate at macroscopic levels and offers ways to explain the
history of science in terms of the tension between radical changes and incremental
extensions. The revolutionary transformation of science from one paradigm to
another – a paradigm shift – is one of the most widely known concepts not only in
scientific communities but also to the general public. The Copernican revolution is
a classic example of a paradigm shift. It marked the change from the geo-centric
to the solar-centric view of our solar system. Another classic example is Einstein’s
general relativity, which took over the authoritative place of Newtonian mechanics
and became the new predominant paradigm in physics.
Stephen Toulmin (1922–2009), a British philosopher of science, suggested a
“Darwinian” model of scientific disciplines: the more disciplines there are in which
a given theory is applicable, the more likely the theory will survive. A similar point
is made by a recent study of the value of ideas in a quite different context Kornish
and Ulrich (2011). It found that more valuable ideas tend to connect many different
topics.
Although Kuhn’s theory has been broadly received, philosophers criticized it in
several ways. In particular, the notion of incommensurability between competing
paradigms was heavily criticized. Incommensurability refers to the communicative
barrier between different paradigms; it can be taken as a challenge to the possibility
of a rational evaluation of competing paradigms using external standards. If that was
the case, the argument may lead to the irrationality of science.
Margaret Masterman (1970) examined Kuhn’s discussion of the concept of
paradigms and found that Kuhn’s definitions of a paradigm can be separated into
three categories:
1. Metaphysical paradigms, in which the crucial cognitive event is a new way of
seeing, a myth, a metaphysical speculation
2. Sociological paradigms, in which the event is a universally recognized scientific
achievement
3. Artifact or construct paradigms, in which the paradigm supplies a set of tools
or instrumentation, a means for conducting research on a particular problem, a
problem-solving device.
8 1 The Dynamics of Scientific Knowledge

She emphasized that the third category is most suitable to Kuhn’s view of
scientific development. Scientific knowledge grows as a result of the invention of
a puzzle-solving device that can be applied to a set of problems producing what
Kuhn has described as “normal science.”
In this book, we will focus on puzzle-solving examples in this category. For
example, numerous theories have been proposed to explain what caused the
extinction of dinosaurs 65 million years ago; scientists are still debating on this
topic. Similarly, scientists are still skeptical about what causes brain diseases in
sheep, cattle, and human. These topics share some common characteristics:
• interpretations of available evidence are controversial
• conclusive evidence is missing
• the current instruments are limited
Mapping the dynamics of competing paradigms is an integral part of our quest
for mapping scientific frontiers. We will demonstrate some intriguing connections
between Kuhn’s view on paradigm shifts and patterns identified from scholarly
publications.
Information scientists are concerned with patterns of scientific communications
and intellectual structures of scientific disciplines. Since the 1970s, information
scientists began to look for signs of competing paradigms in scientific literature,
for example, a rapid change of research focus within a short period of time. In
1974, Henry Small and Belver Griffith were among the first to address issues
concerning identifying and mapping specialties from the structure of scientific
literature by tapping on co-citation patterns as a grouping mechanism (Small and
Griffith 1974). In a longitudinal study of collagen research published in 1977,
Small demonstrated how collagen research underwent some rapid changes of its
focus at a macroscopic level (Small 1977). He used data from the Science Citation
Indexing (SCI) to group documents together based on how tightly they were co-
cited in subsequently published articles. Groupings of co-cited documents were
considered as a representation of leading specialties, or paradigms. Small used the
multidimensional scaling technique to map highly cited articles each year in clusters
on a two-dimensional plane. An abrupt disappearance of a few key documents in
the leading cluster in 1 year and the rapidly increased number of documents in the
leading cluster in the following year indicate an important type of specialty change –
rapid shift in research focus -, which is an indicator of “revolutionary” changes.
We can draw some useful insights from studies of thematic maps of geographic
information. For example, if people study a geographic map first and read relevant
text later, they can remember more information from the text (Rittschof et al.
1994). Traditionally, a geographic map shows two important types of information:
structural and feature information. Structure information helps us to locate indi-
vidual landmarks on the map and determine spatial relations among them. Feature
information refers to detail, shape, size, color, and other visual properties used to
depict particular items on a map. When people study a map, they first construct
a mental image of the map’s general spatial framework and add the landmarks
into the image subsequently (Rittschof et al. 1994). The mental image integrates
1.1 Scientific Frontiers 9

information about individual landmarks in a single relatively intact piece, which


allow rapid and easy access to the embedded landmarks. In addition, the greater the
integration of structural and feature information in the image, the more intact the
image is. It is much easier to find landmark information in an intact image. Once
landmark information is found, it can help to retrieve further details. If we visualize
a paradigm as a cluster of highly cited landmark articles and combine citation and
co-citation into the same visualization model, then users are likely to construct an
intact image of a network of top-sliced articles from the chosen subject domain.
Paul Thagard (1992) proposed a computational approach to the study of con-
ceptual revolutions. The primary purpose is to clarify the structural characteristics
of conceptual systems before, during, and after conceptual revolutions. He placed
his own approach between the formal approaches of logical empiricism and Kuhn’s
historical ones.
Tracking the dynamics of competing paradigms requires us to focus on a
paradigm as the unit of analysis. Visualized interrelationships between individual
publications in the literature must be explained in a broader context of a scientific
inquiry. We need to consider how individual publications contribute to the devel-
opment of a scientific debate. We need to consider how we could differentiate
paradigms. In this book, we will pay particular attention to how information
visualization and visual analytics techniques may help us track the development
of competing paradigms.
Gestalt psychology believes that our mind is holistic. We see the entirety of an
object before we attend to its parts. And the whole is greater than the sum of its
parts. In terms of information theory, the way that individual parts form the whole
gives us additional information about the system as a whole. Norwood Russell
Hanson (1924–1967) argues in his Patterns of Discovery (1958) that what we see is
influenced by our existing preconceptions.
Kuhn further developed the view how a gestalt switch is involved in scientific
discovery and explained the nature of a paradigm shift in terms of a gestalt
switch. Kuhn cited an experiment in which psychologists showed participants
ordinary playing cards at brief exposures and demonstrated that our perceptions are
influenced by our expectations. For example, it took much longer for participants
to recognize unanticipated cards such as black hearts or red spades than recognize
expected ones. Kuhn quoted one comment: “I can’t make the suit out, whatever it
is. It didn’t even look like a card that time. I don’t know what color it is now or
whether it’s a spade or heart. I’m not sure I even know what a spade looks like. My
God!”
To Kuhn, such dramatic shifts in perception also explain what scientific com-
munities experience in scientific revolutions. When Johannes Kepler (1571–1630)
abandoned the universe of perfect circles, he must have experienced some similar
holistic change. Empirical evidence is central to Kuhn’s view. Before a paradigm
shift can take place, anomalies would have to accumulate and build up. But why did
anomalies trigger a Gestalt switch in the mind of Kepler or Einstein but not others?
And how did others then become convinced to adopt the new paradigms?
10 1 The Dynamics of Scientific Knowledge

1.1.2 Invisible Colleges

How do scientific communities accept new scientific contributions as part of the


scientific knowledge? Diana Crane addresses this issue in her “Invisible Colleges:
Diffusion of Knowledge in Scientific Communities” (Crane 1972). She emphasizes
the role of an invisible college. An invisible college is a small network of highly
productive scientists. They share the same field of study, communicate with one
another and monitor the rapidly changing structure of knowledge in their field.
Crane suggests that such an invisible college is responsible for the growth of
scientific knowledge.
Crane demonstrates that research in basic science tends to have a similar growth
pattern, starting from a slow growth, followed by an exponential growth, then a
linear growth, and finally a gradual decline. These stages correspond to a series of
changes in the scientific community. The activities of invisible colleges produce a
period of exponential growth in publications and expand the scientific community
by attracting new members.
The way an invisible college functions is rather difficult to grasp. A member of
an invisible college could be connected with a large number of individuals. More
interestingly, it has been observed that members of an invisible college seem to
play a unifying role such that many researchers outside the invisible college become
connected because of such invisible colleges.
Several studies have demonstrated the presence of an invisible college, or a net-
work of core productive scientists linking otherwise isolated groups of researchers
in a research area. For a scientist, one way to maintain an outstanding productivity
is to apply “the same procedure, task, or pieces of equipment over and over,
introducing new variables or slight modifications of old variables” (McGrath and
Altman 1966).
The continuous expansion of the amount of data and information makes it more
and more challenging for a scientist to locate the right information for his research.
The scientist is unlikely to have access to all the potentially relevant information.
It is probably not necessary anyway. One problem, however, is concerned with
where he/she may devote his/her effort. Should we seek information within our
own research field or reach out to a different research field or even from a different
discipline? One the one hand, searching for information within an area that we are
familiar with would be much easier than searching outside the area. We would
already know where the major landmarks are and we would be good at picking
up various clues from the environment efficiently. On the other hand, we probably
won’t be able to find much information that we do not know already with the same
area where we have spent most of our time. Searching outside our home area would
be more challenging and risky, but there is a better chance to find something that
we do not know. Conceptually, searching in our home area is seen as a local search,
whereas searching in a distant area can be seen as making a long jump. One may
also wonder how long we should stay within the same area of research and when
would be the best time to move on to a new area. One way to make a decision in
1.1 Scientific Frontiers 11

such a situation is to estimate whether it would be worthwhile to take the perceived


risk considering the possible reward one may expect. Research in optimal foraging
and evolutionary biology is a good source of inspiration.

1.1.3 Conceptual Revolutions

Thomas Kuhn’s theory is sociologically and historically motivated. In contrast, Paul


Thagard’s computational approach to conceptual revolutions focuses on the logic
(Thagard 1992). Conceptual revolutions replace a whole system of concepts and
rules with a new system. Thagard points out that there has been little detailed
explanation of such changes, although historians and philosophers of science
have noted the importance of scientific revolutions. Thagard focused on questions
concerning how exactly a conceptual revolution takes place.
Thagard argued that the ultimate acceptance of a scientific theory essentially de-
pends on the explanation coherence of the theory. If a theory with fewer assumptions
can explain more phenomenon than an alternative theory, then the simpler one is
considered to be superior. Thagard demonstrated the idea with examples such as
the conceptual development of plate tectonics in the latest geological revolution and
Darwin’s natural selection theory. A conceptual revolution may involve structural
and non-structural changes. Thagard illustrated a structural change with the example
of the continental drift to modern theories, and a non-structural change with the
example of how the meaning of the concept of evolution changed through Darwin’s
origins of species.
Accounts of scientific change can be roughly categorized as accretion theories
and gestalt theories. In accretion theories, a new conceptual system is developed by
adding new nodes and links. Kuhn criticized accretion theories of scientific growth.
Kuhn’s Gestalt switch is radically different. If the accretion theories are akin to
a biological evolution through a chain of single mutation instances, then Kuhn’s
Gestalt switch is like an evolution with multiple simultaneous mutations. Different
approaches have different implications. For example, accretion theories would have
difficulties to explain why it would be worthwhile for scientists to take apparent
setbacks. In the metaphor of searching in a problem space, are we following a
greedy search strategy or are we ready to tolerate short-term loss to maximize
our longer term goal? Accretion theories are more suitable to describe the initial
growth stages than later stages of decline. Gestalt theories are more suitable for
explaining the dynamics of a system of multiple paradigms. In both cases, more
detailed mechanisms are necessary to account for how a new system is constructed
and how it replaces an old system.
Thagard comes up with such mechanisms by asking the question: what makes a
system standout? He suggests that we should focus on rules, or mechanisms, that
govern how concepts are connected. For example, we should consider the dynamics
of how concepts are connected in terms of the variation of strengths of links over
time. Adding a link between two concepts can be seen as strengthening an existing
12 1 The Dynamics of Scientific Knowledge

Fig. 1.1 Conceptual change:


a new conceptual system #2
is replacing an old one #1

but possibly weak link between the two concepts. Removing an existing link can
be seen as a result of a decay of its strength; they no longer have a strong enough
presence in the system to be taken into account. Figure 1.1 illustrates how an old
system #1 is replaced by a new system #2 in this manner. Using this framework,
Thagard identified nine steps to make conceptual changes:
1. Adding a new instance, for example that the blob in the distance is a whale.
2. Adding a new weak rule, for example that whales can be found in the Arctic
Ocean.
3. Adding a strong rule that plays a frequent role in problem solving and explana-
tion, for example that whales eat sardines.
4. Adding a new part-relation, also called decomposition.
5. Adding a new kind-relation, for example that a dolphin is a kind of whale.
6. Adding a new concept, for example narwhale.
7. Collapsing part of a kind-hierarchy, abandoning a previous distinction.
8. Recognizing hierarchies by branch jumping, that is, shifting a concept from one
branch of a hierarchical tree to another.
9. Tree switching, that is, changing the organizing principle of a hierarchical tree.
Branch jumping and tree switching are much rare events associated with
conceptual revolutions. Thagard examined seven scientific revolutions:
1. Copernicus’ solar-centric system of the planets replacing the earth-centric theory
of Ptolemy
1.1 Scientific Frontiers 13

2. Newtonian mechanics, synthesizing celestial and earth-bound physics, replacing


the cosmological views of Descartes
3. Lavoisier’s oxygen theory replacing the phlogiston theory of Stahl
4. Darwin’s theory of evolution by natural selection replacing the prevailing view
of divine creation of species
5. Einstein’s theory of relativity replacing and absorbing Newtonian physics
6. Quantum theory replacing and absorbing Newtonian physics
7. The geological theory of plate tectonics that established the existence of conti-
nental drift
Thagard’s central claim is that it is best to explain the growth of scientific
knowledge in terms of explanation coherence. The power of a new paradigm
must be assessed in terms of its strength in explaining phenomena coherently in
comparison with existing paradigms. He demonstrated how the theory of continental
drift gained its strength in terms of its explanation coherence.
A concept system represents part-of and kind-of relations between conceptual
components at various levels. The continental drift theory is a conceptual revolution
that involved structural changes. The German meteorologist and geophysicist Alfred
Lothar Wegener (1880–1930) was the first to give a complete statement of the
continental drift hypothesis. Early geographers making maps of the South Atlantic
Ocean were probably the first to notice the similarity of the coastlines of South
America and Africa. What would be the most convincing reason to explain the
similarity? Is it possible that the two continents used to be adjacent to each other?
Wegener was impressed with the similarity in the coastlines of eastern South
America and western Africa. He speculated that those lands had once been joined
together. It was not until the early twentieth century, however, that Wegener used the
geography of the Atlantic coastlines, along with geologic and paleontological data,
to suggest that all the continents were once connected in the Late Paleozoic era. He
searched for geological and paleontological evidence that could support his theory.
His search in the literature confirmed that there are indeed many closely related
fossil organisms and similar rock strata on widely separated continents, particularly
between the Americas and Africa. Wegener’s continental drift theory won some
support in the following decade, but his explanation of the driving forces behind the
continents’ movement was not convincing.
Wegener first presented his theory in 1912 and published it in full in 1915 in his
major work Die Entstehung der Kontinente und Ozeane (The Origin of Continents
and Oceans). He proposed that there was a single supercontinent, Pangaea, some
286–320 million years ago and it was later broke up into the continents we see
today. Other scientists had proposed such a supercontinent but had explained the
appearance of isolated continents as the result of the sinking of large portions of
the supercontinent and the deeply sunken areas became today’s Atlantic and Indian
oceans. In contrast, Wegener proposed that Pangaea broke up into pieces and these
pieces moved slowly over long periods of geologic time and that is why they are
now thousands of miles apart. He described this movement as die Verschiebung der
Kontinente, i.e. continental displacement, which is the core of the continental drift
theory.
14 1 The Dynamics of Scientific Knowledge

Fig. 1.2 Computer-generated “best fit” of the continents. There are several versions of this type
of fit maps credited to the British geophysicists E.C. Bullard, J.E. Everett, and A.G. Smith

The matching coastlines of continents around the Atlantic Ocean become


strikingly apparent in computer-fitted maps. The computer fit was made at the
1,000-m (500-fathom) submarine depth contour, which provided the best fit of the
coastlines. Such computer fits find the best result by finding the depth contour that
minimizes both the overlaps and gaps between the continents (See Fig. 1.2).
Wegener’s theory was received well by many European geologists. The English
geologist Arthur Holmes (1890–1965) pointed out that the lack of a driving force
was hardly sufficient grounds to ditch the entire concept. Around 1930, Holmes
proposed the power of convection as a mechanism to explain Wegener’s continental
drift. He suggested that currents of heat and thermal expansion in the Earth’s mantle
could make the continents move toward or away from one another and create new
ocean floor and mountain ranges. Wegener died in 1930. Holmes was a little too
late to support Wegener. On the other hand, Holmes was about 30 years too early
to back up his theory with hard evidence. On the hindsight, Holmes had come very
close to the modern understanding of Earth’s plates and their dynamics.
The difference between Wegener’s theory and the contemporary conceptual
systems is highlighted in Fig. 1.3. Paul Thagard draws our attention to how
Wegener’s continental drift theory (Fig. 1.4) differs from the conceptual structure of
1.1 Scientific Frontiers 15

Fig. 1.3 Wegener’s conceptual system (top) and the contemporary one (bottom)

his opponents (Fig. 1.5). Making conceptual structures explicit helps us understand
the central issues concerning how paradigms compete with each other.
Continental drift, along with polar wandering and seafloor spreading, is the
consequence of plate movements. Continental drift is the movement of one continent
relative to another continent. Polar wandering is the movement of a continent
relative to the rotational poles or spin axis of the Earth. Seafloor spreading is the
movement of one block of seafloor relative to another block of seafloor. Evidence
for both polar wandering and continental drift comes from matching continental
coastlines, paleoclimatology, paleontology, stratigraphy, structural geology, and
paleomagnetism. The concept of seafloor spreading is supported by evidence of the
age of volcanic islands and the age of the oldest sediments on the seafloor. It is also
supported by discoveries of the magnetism of the seafloor.
It is obviously a remarkable accomplishment to be able extract and summarize
the conceptual structure of a scientific theory. Representing it with such a high level
of clarity enables us focus on conceptual differences between conceptual structures
and pinpoint their merits and potentials. On the other hand, the distilling process
clearly demands the highest level of intellectual analysis and reasoning. It requires
the ability to tease out the most critical information from a vast and growing body of
relevant scientific knowledge. Today’s information and computing techniques still
have a long way to go to be able to turn a body of scientific knowledge into this
type of conceptual structures. Examples demonstrated by Thagard provide a good
reference for us to consider and reflect on what opportunities are opened up by
new generations of science mapping and visual analytics tools and what challenges
remain for us to overcome.
16 1 The Dynamics of Scientific Knowledge

Fig. 1.4 The conceptual structure of Wegener’s continental drift theory

1.1.4 TRACES

How long does it take for the society to fully recognize the value of scientific
breakthroughs or technological innovations? The U.S. Department of Defense
(DoD) commissioned a study in the 1960s to address this question. The study,
1.1 Scientific Frontiers 17

Fig. 1.5 The conceptual structure of Wegener’s opponents

Project Hindsight, was set to search for lessons learned from the development of
some of the most revolutionary weapon systems. A preliminary report of Project
Hindsight was published in 1966. A team of scientists and engineers analyzed
retrospectively how 20 important military weapons came along, including Polaris
18 1 The Dynamics of Scientific Knowledge

and Minuteman missiles, nuclear warheads, C-141 aircraft, and Mark 46 torpedo,
and the M 102 Howitzer. The team of experts identified 686 “research or exploratory
development events” that were essential for the development of the weapons. Only
9 % were regarded as “scientific research” and 0.3 % was base research. 9 %
of research was conducted in universities. One of the preliminary conclusions of
Project Hindsight was that basic research commonly found in universities didn’t
seem to matter very much in these highly creative developments. In contrast,
projects with specific objectives appeared to be much more fruitful.
Project Hindsight concluded that projects funded with specific defense purposes
were about one order of magnitude more efficient than projects with the same
amount of finding but without specific defense goals. Project Hindsight further
concluded that:
1. The contributions of university research were minimal.
2. Scientists contributed most effectively when their effort was mission oriented.
3. The lag between initial discovery and final application was shortest when the
scientist worked in areas targeted by his sponsor.
Project Hindsight emphasized mission-oriented research, contract research, and
commission-initiated research. Although these conclusions were drawn from the
study of military weapon development, some of these conclusions found their way
to the evaluation of scientific fields such as biomedical research.
In respond to the findings of Project Hindsight, the National Science Foundation
(NSF) commissioned a study TRACES – Technology in Retrospect and Critical
Events in Science. Project Hindsight looked back 20 years, but TRACES looked
the history of five inventions and their origins dated back as early as 1850s. The
five inventions are the contraception pill, matrix isolation, the video tape recorder,
ferrites, and the electron microscope.
TRACES identified 340 critical research events associated with these inventions
and classified them into three major categories: non-mission research, mission-
oriented research, and development and application. 70 % of the critical events
belonged to non-mission research, i.e. basic research. 20 % were mission oriented,
and 10 % was development and application. Universities were responsible for 70 %
of non-mission and one third of mission oriented research. For most inventions,
75 % of the critical events occurred before the conception of the ultimate inventions.
Critical research events are not evenly distributed over time. Events in the early
stages are separated by longer periods of time than events occurred in later stages.
The video tape recorder, for example, was invented in mid-1950s. It took almost
100 years to complete the first 75 % of all relevant milestones, i.e. the critical
research events, but it took only 10 years for the remaining 25 % of the critical
events to converge rapidly. In particular, the innovation was conceived in the final
5 years.
The invention of the video tape recorder involves six areas: control theory,
magnetic and recording materials, magnetic theory, magnetic recording, electronics,
and frequency modulation (Fig. 1.6). The earliest non-mission research event
appeared in magnetic theory. It was Weber’s early ferromagnetic theory in 1852.
1.1 Scientific Frontiers 19

Fig. 1.6 Pathways to the invention of the video tape recorder (© Illinois Institute of Technology)

The earliest mission-oriented research appeared in 1898 when Poulsew used steel
wire for the first time for recording. According to TRACES, the technique was
“readily available but had many basic limitations, including twisting and single track
restrictions.” Following Poulsew’s work, Mix & Genest was able to develop steel
tape with several tracks around 1900s, but limited by the lack of flexibility and
increased weight.
This line of invention continued as homogeneous plastic tape on the magne-
tophon tape recorder was first introduced in 1935 by AEG. A two layer tape was
developed by 1940s. The development of reliable wideband tapes was intensive in
early 1950s. The first commercial video tape record appeared in late 1950s.
The invention of electron microscope went through similar stages. The first 75 %
of research was reached before the point of invention and the translational period
from conception to innovation. The invention of electron microscope relied on
five major areas, namely, cathode ray tube development, electron optics, electron
sources, wave nature of electrons, and wave nature of light. Each area may trace
several decades back to the initial non-mission discoveries. For instance, Maxwell’s
electromagnetic wave theory of light in 1864, Roentgen’s discovery of emission
of X-ray radiation in 1893, and Schrodinger’s foundation of wave mechanics in
1926 all belong to non-mission research that ultimately led to the invention of
electronic microscope. As a TRACES diagram shows, between 1860 and 1900 there
was no connection across these areas of non-mission research. While the invention
of electronic microscope was dominated by many earlier non-mission activities,
the invention of video tape recorder revealed more diverse interactions among non-
mission research, mission oriented research, and development activities.
20 1 The Dynamics of Scientific Knowledge

Many insights revealed by TRACES have implications on today’s discussions


and policies concerning peer review and transformative research. Perhaps the most
important lesson learned is the role of basic research, or non-mission research.
As shown in the timeline diagrams of TRACES, an ultimate invention, at least
in all the inventions studied by TRACES, emerged as multiple lines of research
converged. Each line of research was often led by years and even decades of non-
mission research, which was then in turn followed by mission-oriented research and
development and application events.
In other words, it is evident that it is unlikely for non-mission research to foresee
how their work will evolve and that it is even harder for non-mission research in
one subfield to recognize potential connections with critical development in other
subfields. Taken these factors together, we can start to appreciate the magnitude of
the conceptual gulf that transformative research has to bridge.

1.2 Visual Thinking

Vision is a unique source for thinking. We often talk about hindsight, insight,
foresight, and oversight. Our attention is first drawn to the big picture, the Gestalt,
before we attend to details (McKim 1980). Visual thinking actively operates on
structural information, not only to see what is inside, but also to figure out how the
parts are connected to form the whole.

1.2.1 Gestalt

The history of science and technology is full of discoveries in which visual thinking
played a critical role. Visual thinking from the abstract to the concrete is a powerful
strategy. In abstraction, the thinker can readily restructure even transform a concept.
Then the resulting abstraction can be represented in a concrete form and tested in
reality. When abstract and concrete ideas are expressed in graphic form, the abstract-
to-concrete thinking strategy becomes visible.
As everyone looking at Leonard de Vinci’s Mona Lisa is probably seeing a
“Mona Lisa” quite different from what others see, the individual perceptual ability
can be vital in science as not only does it often distinguish an expert from a
novice, but also means whether one can catch a passing chance of discovery. The
English novelist and essayist Aldous Huxley (1894–1963) wrote: “The experienced
microscopist will see certain details on a slide; the novice will fail to see them.
Walking through a wood, a city dweller will be blind to a multitude of things which
the trained naturalist will see without difficulty. At sea, the sailor will detect distant
objects which, for the landsman, are simply not there at all.” A knowledgeable
observer sees more than a less knowledgeable companion because he or she has
a richer stock of memories and expectations to draw upon to make sense of what is
perceived.
1.2 Visual Thinking 21

Fig. 1.7 Alexander


Fleming’s penicillin mould,
1935 (© Science Museum,
London)

Discoveries in the direct context of seeing are common in the history of


science. When Sir Alexander Fleming (1881–1955) noticed that the colonies of
staphylococci around one particular colony had died, he seized the small window
of opportunity created by the unexpected observation and led to the discovery of
penicillin. Many bacteriologists would not have thought this particularly remarkable
for it has long been known that some bacteria interfere with growth of others.
Figure 1.7 is a photograph of Fleming’s penicillin mould.
The German chemist Kekule von Stradonitz (1829–1896) made one of the most
important discoveries of organic chemistry, the structure of the benzene ring. Having
pondered the problem for some time, he turned his chair to the fire and fell asleep:
“Again the atoms were gamboling before my eyes : : : . My mental eye : : : could
now distinguish large structures : : : all twining and twisting in snake-like motion.
But look! What was that? One of the snakes had seized hold of its own tail, and the
form whirled mockingly before my eyes. As if a flash of lightning I awoke.” The
spontaneous inner image of the snaking biting its own tail suggested to Kekule that
organic compounds, such as benzene, are not open structures but closed rings.
Complex thinking operations often require imagery that is abstract and Gestalt-
like. This is not that abstract imagery is more important than concrete; rather,
abstract and concrete imagery are complementary. A flexible visual thinker can
move readily back and forth between the two. Chess, with 64 positions, requires
complex mental manipulations. Researchers have found that chess masters rarely
see a realistic and detailed memory image of the chessboard. Instead, they com-
monly see a Gestalt-like image made up of strategic groupings. Expert chess players
are able to focus their thinking on higher-level patterns and avoid the distraction of
details that are less relevant to the patterns; they think in relation to abstract sensory
images, not concrete ones (McKim 1980).
Information visualization aims to reveal insights into complex and abstract
information by drawing upon a wide range of perceptual and cognitive abilities of
human beings. Information visualization not only can help us find specific pieces of
22 1 The Dynamics of Scientific Knowledge

Fig. 1.8 Minard’s map (Courtesy of http://www.napoleonic-literature.com)

information, but also provide a means of recognizing patterns and relationships at


various levels, which in turn can greatly help us prioritize our search strategies.
Mapping scientific frontiers is to take a step further. The focus is no longer an
isolated body of information. Instead, we are interested in the information conveyed
by the holistic patterns at various levels.
The crucial element in visual thinking is a metaphor that can accommodate
the meaning of individual visual-spatial attributes and form a holistic image.
Sometimes the presence of such metaphors is implicit; sometimes the absence of
such metaphors is obvious. As mentioned at the beginning of this chapter, Hermes
is the messenger of the gods and he brings a word from the realm of the wordless.
A message in a bottle is an ancient way of communicating. Human beings have put
all sorts of “messages” in a wide variety of “bottles”, ranging from a bottle in the
ocean to Pioneer’s gold plaque in deep space.

1.2.2 Famous Maps

One picture is worth of thousands of words. A classic example is the compelling


story-telling map by Charles Joseph Minard (1781–1870). This famous map depicts
the retreat of Napoleon’s army in 1812. It communicates a variety of information
to the viewer. For example, the size of the French army is shown as the width of
the bands. The army’s location is shown on the two-dimensional map, including the
direction of the movement of the advance (upper band) and retreat (lower band).
The temperature on certain dates during the retreat is shown in association with a
chart below the map (Fig. 1.8).
The size of Napoleon’s army is shown as the width of the band in the map,
starting on the Russian-Polish border with 422,000 soldiers. By the time they
reached Moscow in September, the size of the army dropped to 100,000. Eventually
1.2 Visual Thinking 23

only a small fraction of Napoleon’s army survived. Information visualization is in


general a powerful and effective tool for conveying a complex idea. However, as
shown in the above examples, one may often need to use a number of complimentary
visualization methods in order to reveal various relationships.
Edward Tufte presented several in-depth case studies of the role of visual
explanation in making decisions (Tufte 1983, 1990, 1997). In particular, Tufte
demonstrated how visual evidence, if only presented differently, might have saved
the space shuttle Challenger and how John Snow’s map put an end to the 1854
cholera epidemic in London (Tufte 1997). In the Challenger explosion case, the
explosion was due to the leak from a seal component called O-ring. Pre-launching
test data, however, was presented through an obscure visual representation and the
engineers failed to convince NASA officers that they should abort the launch. On
the hindsight, Tufte redesigned the presentation of the same data and the pattern
of O-ring failure became clear. In another example, Tufte illustrates the role of
visual patterns in resolving a cholera outbreak in London in 1854. John Snow
(1813–1858) is a legendary figure in the history of public health, epidemiology and
anesthesiology. He was able to identify convincing evidence from a spatial pattern
of deaths and narrowed down the cause of the deaths to a specific water pump (See
Fig. 1.9).

1.2.3 The Tower of Babel

Many of us are familiar with the story of the Tower of Babel in the Bible.2 Ancient
Mesopotamians believed that the mountains were holy places and gods dwell on top
of mountains and such mountains were contact points between heaven and earth,
for example, Zeus on Mount Olympus, Baal on Mount Saphon, and Yahweh on
Mount Sinai. But there were no natural mountains on the Mesopotamian plain, so
people built ziggurats instead. The word ziggurat means a “tower with its top in the
heavens.” A ziggurat is a pyramid-shaped structure that typically had a temple at
the top. Remains of ziggurats have been found at the sites of ancient Mesopotamian
cities, including Ur and Babylon.
The story of the Tower of Babel is in the Bible, Genesis 11: 1–9. The name
Babylon literally means “gate of the gods.” It describes how the people used brick
and lime to construct a tower that would reach up to heaven. According to the story,
the whole earth used to have only one language and a few words. People migrated
from the east and settled on a plain. They said to each other, “Come, let us build
ourselves a city, and a tower with its top in the heavens, and let us make a name
for ourselves, lest we be scattered abroad upon the face of the whole earth.” They
baked bricks and used bitumen as mortar. When the Lord came down to see the city
and the tower, the Lord said, “Behold, they are one people, and they have all one

2
http://www.christiananswers.net/godstory/babel1.html
24 1 The Dynamics of Scientific Knowledge

Fig. 1.9 Map of Cholera deaths and locations of water pumps (Courtesy of National Geographic)

language; and this is only the beginning of what they will do; and nothing that they
propose to do will now be impossible for them. Come, let us go down, and there
confuse their language, that they may not understand one another’s speech.” So the
Lord scattered them abroad from there all over the earth, and they left off building
the city. Therefore its name was called Babel, because there the Lord confused the
language of all on the earth; and from there the Lord scattered them abroad over the
face of the earth. Archaeologists examined the remains of the city of Babylon and
found a square of earthen embankments some 300 ft on each side, which appears to
be the foundation of the tower. Although the Tower of Babel is gone, a few ziggurats
survived. The largest surviving temple, built in 1250 BC, is found in western Iran.
The Tower of Babel has been a popular topic for artists. Pieter Bruegel
(1525–1569) painted the Tower of Babel in 1563, which is now in Vienna’s
Kunsthistorisches Museum Wien (See Fig. 1.10). He painted the tower as an
1.2 Visual Thinking 25

Fig. 1.10 The Tower of Babel (1563) by Pieter Bruegel. Kunsthistorisches Museum Wien, Vienna.
(Copyright free, image is in the public domain)

immense structure occupying almost the entire picture, with microscopic figures,
rendered in perfect detail. The top floors of the tower are in bright red, whereas
the rest of the brickwork has already started to weather. Maurits Cornelis Escher
(1898–1972) was also intrigued by the story. In his painting in 1928, people were
building the tower when they started to experience the confusion and frustration of
the communication breakdown caused by the language barrier (See Fig. 1.11).

1.2.4 Messages to the Deep Space

The moral of the Tower of Babel story in this book is the vital role of our language.
Consider the following examples and examine the basis of our communication that
we have been taking for granted. Space probes Pioneer and Voyager are travelling
into deep space with messages designed to reach some intelligent forms in a few
million years. If aliens do exist and eventually find the messages on the spacecraft,
will they be able to understand? What are the assumptions we make when we
communicate our ideas to others?
Pioneers 10 and 11 both carried small metal plaques identifying their time and
place of origin for whatever intelligent forms might find them in the distant future.
NASA placed a more ambitious message aboard Voyager 1 and 2 – a kind of time
capsule – to communicate a story of our world to extraterrestrial.
26 1 The Dynamics of Scientific Knowledge

Fig. 1.11 The Tower of


Babel by Maurits Escher
(1928)

Pioneer 10 was launched in 1972. It is now one of the few most remote man-
made objects. Communication was lost on January 23, 2003 when it was 80 AU3
from the Sun. It was 12 billion kilometers or 745.6 million miles away. Pioneer
10 was headed towards the constellation of Taurus (The Bull). It will take Pioneer
over 2 million years to pass by one of the stars in the constellation. Pioneer 11
was launched in 1973. It is headed toward the constellation of Aquila (The Eagle),
Northwest of the constellation of Sagittarius. Pioneer 11 may pass near one of the
stars in the constellation in about 4 million years.
According to “First to Jupiter, Saturn, and Beyond” (Fimmel et al. 1980), a group
of science correspondents from the national press were invited to see the spacecraft

3
Astronomical Unit: one AU is the distance between the Earth and the Sun, which is about
150 million kilometers (93,000 million miles).
1.2 Visual Thinking 27

Fig. 1.12 The gold-plated aluminum plaque on Pioneer spacecraft, showing the figures of a man
and a woman to scale next to a line silhouette of the spacecraft

before it was to be shipped to Kennedy Space Center. One of the correspondents,


Eric Burgess, visualized Pioneer 10 as mankind’s first emissary beyond our Solar
System. This spacecraft should carry a special message from mankind, a message
that would tell any finder of the spacecraft a million or even a billion years
that planet Earth had evolved an intelligent species that could think beyond its
own time and beyond its own Solar System. Burgess and another correspondent
Richard Hoagland approached Director of the Laboratory of Planetary Studies at
Cornell University, Dr. Carl Sagan. A short while earlier, Sagan had been involved
in a conference in the Crimea devoted to the problems of communicating with
extraterrestrial intelligence. Together with Dr. Frank Drake, Director of the National
Astronomy and Ionosphere Center at Cornell University, Sagan designed a type of
message that might be used to communicate with an alien intelligence.
Sagan was enthusiastic about the idea of a message on the Pioneer spacecraft.
He and Drake designed a plaque, and Linda Salzman Sagan prepared the artwork.
They presented the design to NASA; it was accepted to put on the spacecraft. The
plaque design was etched into a gold- anodized aluminum plate 15.25 by 22.8 cm
(6 by 9 in.) and 0.127 cm (0.05 in.) thick (See Fig. 1.12).
28 1 The Dynamics of Scientific Knowledge

This plate was attached to the antenna support struts of the spacecraft in a position
where it would be shielded from erosion by interstellar dust. The bracketing bars on
the far right are the representation of the number 8 in binary form (1,000), where
one is indicated above by the spin-flip radiation transition of a hydrogen atom from
electron state spin up to state spin down that gives a characteristic radio wave length
of 21 cm (8.3 in.). Therefore, the woman is 8  21 cm D 168 cm, or about 50 600 tall.
The bottom of the plaque shows schematically the path that Pioneers 10 and 11 took
to escape the solar system – starting at the third planet from the Sun accelerating
with a gravity assist from Jupiter out of the solar system. Also shown to help identify
the origin of the spacecraft is a radial pattern etched on the plaque that represents
the position of our Sun relative to 14 nearby pulsars (i.e., spinning neutron stars)
and a line directed to the center of our Galaxy. The plaque may be considered as
the cosmic equivalent to a message in a bottle cast into the sea. Sometime in the
far distant future, perhaps billions of years from now, Pioneer may pass through
a planetary system of a remote stellar neighbor, one of whose planets may have
evolved intelligent life. If that life possesses the technical ability and curiosity,
it may detect and pick up the spacecraft and inspect it. Then the plaque with its
message from Earth may be found and deciphered.
Pioneer 10 will be out there in interstellar space for billions of years. One day it
may pass through the planetary system of a remote stellar neighbor, one of whose
planets may have evolved intelligent life. If that life possesses sufficient capability to
detect the Pioneer spacecraft – needing a higher technology than mankind possesses
today – it may also have the curiosity and the technical ability to pick up the
spacecraft and take it into a laboratory to inspect it. Then the plaque with its
message from Earth should be found and possibly deciphered. Due to the loss of
communication, we may never hear from it again unless one day it could be picked
up by intelligent aliens in the deep space.
Voyager 1 and 2 were launched in the summer of 1977. They have become the
third and fourth human built artifacts to escape our solar system. The two spacecraft
will not make a close approach to another planetary system for at least 40,000 years.
The Voyager carried sounds and images to portray the diversity of life and culture
on Earth. These materials are recorded on a 12-in. gold-plated copper disk. Carl
Sagan was responsible for selecting the contents of the record for NASA (See
Fig. 1.13). They assembled 115 images and a variety of natural sounds, such as
those made by surf, wind and thunder, birds, whales, and other animals. They also
included musical selections from different cultures and eras, and spoken greetings
from Earth-people in fifty-five languages, and printed messages from President
Carter of the United States of America and United Nation’s Secretary General
Waldheim. Each record is encased in a protective aluminum jacket, together with a
cartridge and a needle. Instructions, in symbolic language, explain the origin of the
spacecraft and indicate how the record is to be played. The 115 images are encoded
in analog form. The remainder of the record is in audio, designed to be played at
16–2/3 rev/s. It contains the spoken greetings, beginning with Akkadian, which was
1.2 Visual Thinking 29

Fig. 1.13 Voyagers’ message

spoken in Sumer about 6,000 years ago, and ending with Wu, a modern Chinese
dialect. Following the section on the sounds of Earth, there is an eclectic 90-min
selection of music, including both Eastern and Western classics and a variety of
ethnic music. It will be 40,000 years before they make a close approach to any other
planetary system. In Carl Sagan’s words, “The spacecraft will be encountered and
the record played only if there are advanced space-faring civilizations in interstellar
space. But the launching of this bottle into the cosmic ocean says something very
hopeful about life on this planet.”
A 12-in. gold plated copper disk containing recorded sounds and images
representing human cultures and life on Earth is affixed to each Voyager – a
message in a bottle cast into the cosmic sea. The disks are like a phonograph
record. Cartridge and needle are supplied, along with some simple diagrams, which
represent symbolically the spacecraft’s origin and instructions for playing the disk.
Figure 1.14 shows instructions on Voyager’s plaque. Now see if you would be able
to understand them if you were an alien.
The Voyager record is detailed in “Murmurs of Earth” (1978) by Sagan, Drake,
Lomberg et al. This is the story behind the creation of the record, and includes a
30 1 The Dynamics of Scientific Knowledge

Fig. 1.14 Instructions on Voyager’s plaque

full list of everything on the record. Warner News Media, including a CD-ROM that
replicates the Voyager record, reissued “Murmurs of Earth” in 1992. The CD-ROM
is made available for purchase.4

1.2.5 “Ceci n’est pas une Pipe”

“Ceci n’est pas une pipe” is a famous statement made by Belgian surrealist René
Magritte (1898–1967) in his oil painting in 1929 “The Treachery of Images.” The
picture of a pipe, Fig. 1.15, is underlined by the thought-provoking subtitle in
French – “This is not a pipe.”
Obviously, the “image” pipe is not a real pipe; it doesn’t share any physical
properties or functionality of a real pipe. On the other hand, this surrealistic painting
certainly makes us think deeper about the role of our language. The apparent
contradiction between the visual message conveyed by the picture of a pipe and the
statement made in words underlines the nature of language and interrelationships

4
http://math.cd-rom-directory.com/cdrom-2.cdprod1/007/419.Murmurs.of.Earth.-.The.Voyager.
Interstellar.Record.shtml
1.2 Visual Thinking 31

Fig. 1.15 René Magritte’s


famous statement

between what we see, what we think, and what we say. Philosophers study such
questions in the name of hermeneutics. Hermeneutics can be traced back to the
Greeks and to the rise of Greek philosophy. Hermes is the messenger of the gods,
he brings a word from the realm of the wordless; hermeios brings the word from
the Oracle. The root word for hermeneutics is the Greek verb hermeneuein, which
means to interpret.
Don Ihde’s book Expanding Hermeneutics – Visualism in Science (Ihde 1998)
provides a series of examples from the history of science and technology in
an attempt to establish that visualist hermeneutics is essential to science and
technology. According to Ihde, “This hermeneutics, not unlike all forms of writing,
is technologically embedded in the instrumentation of contemporary science, in
particular, in its development of visual machines or imaging technologies.”
Ihde argues that what we see is mediated by enabling devices. We see through,
with, and by means of instruments (Ihde 1998). Science has found ways to enhance,
magnify, and modify its perceptions. From this perspective, Kuhn’s philosophy in
essence emphasizes that science is a way of “seeing.” We will return to Kuhn’s
paradigm theory later with the goal to visualize the development of a paradigm.
Ihde refers to this approach as perceptual hermeneutics. Key features of perceptual
hermeneutics are repeatable Gestalt, visualizable, and isomorphic.
Ihde noted that Leonardo da Vinci’s depictions of human anatomy show muscu-
lature, organs, and the like and his depictions of imagined machines in his technical
diaries were indeed in the same style – both exteriors and interiors were visualized.
Ihde also found similar examples from astronomy and medicine, such as Galileo’s
telescope and the invention of X-rays in 1895 by German physicist Wilhelm Conrad
Röntgen (1845–1923) (See Fig. 1.16). What had been invisible or occluded became
observable. These imaging technologies have similar effects as da Vinci’s exploded
diagram style – they transform non-visual information to visual representations.
Two types of imaging technologies are significant: translation technologies that
transform non-visual dimensions to visual ones, and isomorphic ones. Imaging
technologies increasingly dominate contemporary scientific hermeneutics.
32 1 The Dynamics of Scientific Knowledge

Fig. 1.16 The first X-ray


photograph, produced by
Röntgen in 1895, showing his
wife’s hand with a wedding
ring

The epistemological advantages of visualization are its repeatable Gestalt


features. The simplest of Gestalt features is the appearance of a figure against a
ground, or the appearance of a foreground figure in a background. Usually, we are
able to single out some features from a background without any problems, although
sometimes it takes a lot more perceptual and cognitive processing before we can be
certain what forms the foreground and what forms the background. Gestalt patterns,
for example, are often connected to the moment of an “Aha!” as we suddenly realize
what the intended relationship between the features and the background is supposed
to be.
Do you see a vase or two people facing each other in Fig. 1.17? It depends on
which one you think is the figure. If you take the white vase as the figure, then the
two faces will recede into the background. The figure-ground switch in this picture
represents a typical Gestalt switch. The same set of pixels can be interpreted as the
parts of totally different patterns at a higher level. Isn’t it amazing!
In the “naı̈ve image realism” of visual isomorphism, recognizing objects is
straightforward, even though the observer may have never seen such images before.
The isomorphism, meaning the same shape, makes it easy to connect. In Ihde’s
1.2 Visual Thinking 33

Fig. 1.17 A Gestalt switch


between figure and ground.
Does the figure show a vase
or two faces?

words: “Röntgen5 had never seen a transparent hand as in the case of his wife’s
ringed fingers, but it was obvious from the first glimpse what was seen.” On the
other hand, there are more and more visual techniques that are moving away from
visual isomorphism. For example, the transparent and translucent microorganisms
in “true color” were difficult to see. It was false coloring that turned microscopic
imaging techniques to a standard technique within scientific visual hermeneutics.
Hermeneutics brings a word from the wordless. Information visualization aims
to bring insights into abstract information to the viewer. In particular, information
visualization deals with information that may not readily lend itself to geometric or
spatial representations. The subject of this book is about ways to depict and interpret
a gigantic “pipe” of scientific frontiers with reference to the implications of how
visualized scientific frontiers and real ones are interrelated.
As shown in the history of the continent drift theory, a common feature of a
research front is the presence of constant debates between competing theories and
how the same evidence could be interpreted from different views. These debates at a
disciplinary scale will be used to illustrate the central theme of this book – mapping
scientific frontiers. How can we take snapshots of a “battle ground” in scientific
literature? How can we track the development of competing schools of thought
over time? From a hermeneutic point of view, what are the relationships between
“images” of science and science itself? How do we differentiate the footprints of
science and scientific frontiers? Would René Magritte point to a visualization of a
scientific frontier, and say “This is not a science frontier?”
In the rest of this chapter, we will visit a few more examples and explore
profound connections between language, perception, and cognition. Some examples
illustrate the barrier of languages not only in the sense of natural languages but also
in terms of communicative barriers across scientific and technological disciplines.
Some show the power of visual languages throughout the history of mankind. Some
underline limitations of visual languages. Through these examples, we will be able
to form an overview of the most fundamental issues in grasping the dynamics of the
forefront of science and technology.

5
Wilhelm Röntgen, the inventor of X-ray, made copies of the X-ray of his wife’s hand and sent
these to his colleagues across Europe as evidence of his new invention.
34 1 The Dynamics of Scientific Knowledge

1.2.6 Gestalt Psychology

We can only see what we want to see. In other words, our vision is biased and
selective. Margritte’s pipe looks so realistic that people feel puzzled when they
read the subtitle “This is not a pipe.” Towards the end of the nineteenth century,
a group of Austrian and Germany psychologists found that human beings tend to
perceive coherent patterns out of visual imagery. Gestalt is a Germany word, which
essentially means a tendency of recognizing a pattern, i.e. a holistic image, out of
individual parts, even though sometimes the holistic image is illusive.
The study of pattern-seeking behavior is a branch of psychology called Gestalt
psychology. Human being’s perception has a tendency to seek patterns out of what
we see, or what we expect to see. A widely known example is the face on Mars,
which reminds us how our perceptual system can sometimes cheat on us.
Gestalt psychology emphasizes the importance of organizational processes
of perception, learning, and problem solving. They believe that individuals are
predisposed to organize information in particular ways. The basic ideas of Gestalt
psychology are:
• Perception is often different from reality. This includes optical illusions.
• The whole is more than the sum of its parts. Human experience couldn’t
be explained unless the overall experience is examined instead of individual
parts.
• The organism structures and organizes experience. The word Gestalt in German
means structured whole. This means an organism structures experience even
though structure might not be necessarily inherent.
• The organism is predisposed to organize experience in particular ways. For
example, according to the law of proximity, people tend to perceive as a unit
those things that are close together in space. Furthermore, similar people tend to
perceive as a unit those things that are similar to one another.
• Problem solving involves restructuring and insight. Problem solving involves
mentally combining and recombining the various elements of a problem until
a structure that solves the problem is achieved.
Human beings have the tendency of seeking patterns. Gestalt psychology
considers perception an active force. We perceive a holistic image that means more
than the sum of parts. We first see an overall pattern, then go on to analyze its
details. Personal needs and interests drive the detailed analysis. Like a magnetic
field, perception draws sensory imagery together into holistic patterns. According
to Gestalt theory, perception obeys an innate urge towards simplification by cohering
complex stimuli into simpler groups. Grouping effects include proximity, similarity,
continuity, and line of direction. Gestalt psychology highlights the ambiguity of
humans’ pattern-seeing abilities. Figure 1.18 shows a famous drawing by Maurits
Escher. See if you can see two figures alternatively, or even simultaneously.
1.2 Visual Thinking 35

Fig. 1.18 Is this a young


lady or an old woman?

1.2.7 Information Visualization and Visual Analytics

Information visualization is concerned with the design, development, and appli-


cation of computer generated interactive graphical representations of information.
This often implies that information visualization primarily deals with abstract,
non-spatial data. Transforming such non-spatial data to intuitive and meaningful
graphical representations is therefore of fundamental importance to the field. The
transformation is also a creative process in which designers assign new meanings
into graphical patterns. Like art, information visualization aims to communicate
complex ideas to its audience and inspire its users for new connections. Like
science, information visualization must present information and associated patterns
rigorously, accurately, and faithfully (Chen 2010).
There are a number of widely read reviews and surveys of information visualiza-
tion (Card 1996; Hearst 1999; Herman et al. 2000; Hollan et al. 1997; Mukherjea
1999). There are several books on information visualization, notably (Card et al.
1999; Chen 1999a; Spence 2000; Ware 2000). Information Visualization, published
by Sage, is a peer-reviewed international journal on the subject. A more recent
overview can be found in (Chen 2010).
The goal of information visualization is to reveal patterns, trends, and other new
insights into an information rich phenomenon. Information visualization particu-
larly aims to make sense of abstract information. A major challenge in information
visualization is to develop intuitive and meaningful visual representations of non-
spatial and non-numerical information so that users can interactive explore the same
36 1 The Dynamics of Scientific Knowledge

dataset from a variety of perspectives. The mission of information visualization


is well summarized in (Card et al. 1999): “Information visualization is the use of
computer-supported, interactive, visual representations of abstract data to amplify
cognition.”
A common question is the relationship between information visualization and
scientific visualization. A simple answer is that they are unique in terms of their cor-
responding research communities. They do overlap, but largely differ. Here are some
questions that might further clarify the scope of information visualization. First, is
the original data numerical? Graphical depictions of quantitative information are
often seen in the fields of data visualization, statistical graphics, and cartography.
For example, is a plot of daily temperatures of a city for the last 2 years qualified
as information visualization? The answer to this question may depend on another
question: how easy or straightforward is it for someone to produce the plot? As
Michael Friendly and Daniel J. Denis put it, unless you know its history, everything
might seem novel. By the same token, what is complex and novel today may become
trivial in the future. A key point to differentiate information visualization from
data visualization and scientific visualization is down to the presence or absence
of data in quantitative forms and how easy one can transform them to quantitative
forms. This is why researchers emphasize the ability to represent nonvisual data in
information visualization.
Second, if the data is not spatial or quantitative in nature, what does it take to
transform it to something that is spatial and visual? This step involves visual design
and the development of computer algorithms. It is this step that clearly distinguishes
information visualization from its nearest neighbors such as quantitative data
visualization. More formally, this step can be found in an earlier taxonomy of
information visualization, which models the process of information visualization
in terms of data transformation, visualization transformation, and visual map-
ping transformation. Data transformation turns raw data into mathematical forms.
Visualization transformation establishes a visual–spatial model of the data. Visual
mapping transformation determines the appearance of the visual–spatial model to
the user. On the other hand, if the data is quantitative in nature, researchers and
designers are in a better position to capitalize on this valuable given connection.
The connection between scientific and artistic aspects of information visual-
ization is discussed in terms of functional information visualization and aesthetic
information visualization. The primary role of functional information visualization
is to communicate a message to the user, whereas the goal of aesthetic information
visualization is to present a subjective impression of a data set by eliciting a visceral
or emotive response from the user.
The holy grail of information visualization is for users to gain insights. In
general, the notion of insight is broadly defined, including unexpected discoveries, a
deepened understanding, a new way of thinking, eureka-like experiences, and other
intellectual breakthroughs.
In early years of information visualization, it is believed that the ability to view
the entirety of a data set at a glance is important to discover interesting and otherwise
hidden connections and other patterns. More recently, it is realized, with the rise of
1.2 Visual Thinking 37

visual analytics, that the actionability of information visualization is essential and


it emphasizes the process of searching for insights instead of the notion of insights
per se.
Researchers have identified a number of stages of the process of information
visualization, namely mapping data to visual form, designing visual structures, and
view transformations. Mapping data to visual form involves the transformations
of data tables, variable types, and metadata. Visual structures can be divided into
spatial substrate, marks, connection and enclosure, retinal properties, and temporal
coding. View transformations concern location probes, viewpoint controls, and
distortion.
The origins of information visualization involve computer graphics, scientific
visualization, information retrieval, hypertext, geographic information systems,
software visualization, multivariate analysis, citation analysis and others such as
social network analysis. A motivation for applying visualization techniques is a need
to abstract and transform a large amount of data to manageable and meaningful
proportions. Analysis of multidimensional data is one of the earliest application
areas of information visualization. For example, Alfred Inselberg demonstrated how
information visualization could turn a multivariate analysis into a 2-dimensional
pattern recognition problem using a visualization scheme called parallel coordinates
(Inselberg 1997).
Research in visual information retrieval has made considerable contributions to
information visualization. Ben Shneiderman at the University of Maryland proposed
a well-known mantra to characterize how users interact with the visualization of a
large amount of information:
• Overview: see overall patterns, trends
• Zoom: see a smaller subset of the data
• Filter: see a subset based on values
• Detailed on demand: see values of objects when interactively selected
• Relate: see relationships, compare values
• History: keep track of actions and insights
• Extract: mark and capture data
Users would start from an overview of the information space and zoom-in to the
part that seems to be of interest and call for more details. A common design question
is what options are available to attract users’ attention most effectively. It is known
that our perception is attracted to something that is moving, probably due to our
ancestors’ survival needs in hunting animals. However, a dashboard that is full of
blinking lights is probably not informative either. The precise meanings conveyed
by specific colors are strongly influenced by the local culture where the system
is located. For example, trends colored in green in a financial visualization would
be interpreted positively, whereas contours colored in dark blue in a geographic
information system may imply something that is under the sea level.
Mapping scientific frontiers can draw valuable insights from many exciting
exemplars of information visualization. We will see in later chapters what
constitutes the paradigmatic structure of hypertext. It is geographic configurations
38 1 The Dynamics of Scientific Knowledge

that provide the base map of a thematic map. Indeed, thematic maps provide a
prosperous metaphor for a class of information visualization known as information
landscape. Notable examples include ThemeView (Wise et al. 1995) and Bead
(Chalmers 1992).
ManyEyes is a more recent example. It is a ‘social kind of data analysis’ in the
words of its designers at the formerly IBM’s Visual Communication Laboratory.
ManyEyes enables many people to have a taste of what is like to create your
own information visualization that they would otherwise have no such chance
at all. The public-oriented design significantly simplifies the entire process of
information visualization. Furthermore, ManyEyes is indeed a community-building
environment in which one can view visualizations made by other users, make
comments, and make your own visualizations. These reasons alone would be
enough to earn ManyEyes a unique position in the development of information
visualization. ManyEyes and Wikipedia share some interesting characteristics—
both tap in social construction and both demonstrate emergent properties of a
self-organizing underlying system.
Modeling and visualizing intellectual structures from scientific literature have
reached a new level in terms of the number of computer applications available,
the number of researchers actively engaged in relevant areas, and the number of
relevant publications. Traditionally, the scientific discipline that has been actively
addressing issues concerning science mapping and intellectual structure mapping
is information science. Information science itself constitutes of two sub-fields:
information retrieval and citation analysis. Both information retrieval and citation
analysis take the widely accessible scientific literature as their input. However,
information retrieval and citation analysis concentrate on disjoint sections of a
document. Information retrieval focuses on the bibliographic record of a document,
such as title and keyword list, and/or the full-text of a document, whereas citation
analysis focuses on referential links embedded in the document, or those appended
at the end of the document. The ultimate challenge for information visualization is to
invent and adapt powerful visual-spatial metaphors that can convey the underlying
semantics.
Information retrieval has brought many fundamental inspirations and challenges
to the field of information visualization. Our quest aims to demonstrate that
science mapping goes beyond information retrieval, information visualization, and
scientometrics. It becomes a unique field of study on its own and yet it has the
potential to be applicable to a wide range of scientific domains. Our focus is on the
growth of scientific knowledge and what are the key problems to solve and what
are the central tasks to support. Instead of focusing on locating specific items in
scientific literature, we turn to higher levels of granularity – scientific paradigms
and their movements in scientific frontiers.
Visual analytics can be seen as the second generation of information visualiza-
tion. It has transformed not only how we visualize complex and dynamic phenomena
in the new information age, but also how we may optimize analytical reasoning and
make sound decisions with incomplete and uncertain information (Keim et al. 2008).
Today’s widespread recognition of the indispensable value of visual analytics as a
1.3 Mapping Scientific Frontiers 39

field and the rapid growth of an energetic and interdisciplinary scientific community
would be simply impossible without the remarkable vision and tireless efforts of Jim
Thomas (1946–2010), his colleagues of the National Visualization and Analytics
Center (NVAC) at Pacific Northwest National Laboratory (PNNL), and the growing
community in visual analytics science and technology.
In 2004, Jim Thomas founded NVAC and initiated a new research area, visual
analytics. Visual analytics is the science of analytical reasoning facilitated by
visual interactive interfaces that focuses on analytical reasoning facilitated by
interactive visual interfaces (Thomas and Cook 2005; Wong and Thomas 2004).
Visual analytics is a multidisciplinary field. It brings together several scientific and
technical communities from computer science, information visualization, cognitive
and perceptual sciences, interactive design, graphic design, and social sciences.
It addresses challenges involving analytical reasoning, data representations and
transformations, visual representations and interaction techniques, and techniques to
support production, presentation, and dissemination of the results. Although visual
analytics has some overlapping goals and techniques with information visualization
and scientific visualization, it is especially concerned with sense-making and
reasoning and it is strongly motivated by solving problems and making sound
decisions.
Visual analytics integrates new computational and theory-based tools with
innovative interactive techniques and visual representations based on cognitive,
design, and perceptual principles. This science of analytical reasoning is central
to the analyst’s task of applying human judgments to reach conclusions from a
combination of evidence and assumptions (Thomas and Cook 2005). Today, visual
analytics centers are found in several countries, including Canada, Germany, the
United Kingdom, and the United States; and universities integrated visual analytics
into their core information sciences curricula which made the new field a recognized
and promising outgrowth of the fields of information visualization and scientific
visualization (Wong 2010).
The key contribution of visual analytics is that it is motivated by analytic
reasoning and decision making needs with high uncertainty data. Visual analytics
emphasizes the role of evidence in analytic reasoning and making informed
decisions. This is precisely what is needed for mapping scientific frontiers, i.e.
evidence-based reasoning. In the second edition of the book, we introduce the latest
development of visual analytics in relation to supporting analytic tasks pertinent to
mapping scientific frontiers.

1.3 Mapping Scientific Frontiers

This book is written with a few groups of audience in mind, for example, researchers
and students in information science, computer science, history of science, philoso-
phy of science, and sociology of science. The book is also suitable for readers who
are interested in scientometrics, information visualization, and visual analytics as
well as science of science policy and research evaluation.
40 1 The Dynamics of Scientific Knowledge

“Three Blind Men and an Elephant” is a widely told folktale in China. The
story probably started in Han Dynasty (202 BC–220 AD) (Kou and Kou 1976).
The story was later expanded to six blind men in India. As the folktale goes, six
blind men went to figure out what the elephant looks like. The first one approached
the elephant and felt the elephant’s body. He claimed: “The elephant is very like a
wall!” The second one feeling the tusk said, “It is like a spear!” The third one took
the elephant’s trunk and said, “It is like a snake!” The fourth touched the knee and
shouted, “It is like a tree!” The fifth touched the ear and thought it was like a fan.
The sixth, seizing on the swinging tail and, was convinced that the elephant must be
like a rope. They could not agree what an elephant is really like.
The moral of this folktale is that we are in a similar situation in which scientists
receive all sorts of messages about scientific frontiers. Actor Network Theory
(ANT) was originally proposed as a sociological model of science (Latour 2005;
Callon et al. 1986). According to this model, the work of scientists consists of the
enrolment and juxtaposition of heterogeneous elements – rats, test tubes, colleagues,
journal articles, grants, papers at scientific conferences, and so on – which need
continual management. Scientists simultaneously reconstruct social contexts – labs
simultaneously rebuild and link the social and natural contexts upon which they act.
Examining inscriptions is a key approach used for ANT. The other is to “follow
the actor,” via interviews and ethnographic research. Inscriptions include journal
articles, conference papers, presentations, grant proposals, and patents. Inscriptions
are the major products of scientific work (Latour 2005; Callon et al. 1986). In
Chap. 3, we will describe co-word analysis, which was originally developed for
analyzing inscriptions. Different genres of inscriptions may send messages to
scientists. On the one hand, messages from each genre of inscriptions form a
snapshot of scientific frontiers. For example, journal publications may provide a
snapshot of the “head” of the elephant; conference proceedings may provide the
“legs”; and textbooks may provide the “trunk”. On the other hand, messages in
different “bottles” must be integrated at a higher level, i.e. the “elephant” level, to
be useful as guidance to scientists and engineers.
Mapping scientific frontiers involves several disciplines, from philosophy of
science, sociology of science, to information science, scientometrics, and informa-
tion visualization. Each individual discipline has its own research agenda and prac-
tices, its own theories and methods. On the other hand, mapping scientific frontiers
by its very nature is interdisciplinary. One must transcend disciplinary boundaries so
that each contributing approach can fit into the context. Otherwise, the Tower of Ba-
bel is not only a story in the Bible, it could be a valid summary of the fate of new gen-
erations’ efforts in achieving the “holy grail” of standing on the shoulders of giants.

1.3.1 Science Mapping

Science maps depict the spatial relations between research fronts, which are areas
of significant activity. Such maps can also simply be used as a convenient means of
depicting the way research areas are distributed and conveying added meaning to
their relationships.
1.3 Mapping Scientific Frontiers 41

Even with a database that is completely up-to-date, we are still only able to create
maps that show where research fronts have been. These maps may reveal a fresh
view of where the action is and give a hint where it may be going. However, as we
expand the size of the database from 1 year to a decade or more, the map created
through citation analysis provides a historical, indeed historiographical, window on
the field that we are investigating.
From a global viewpoint, these maps show relationships among fields or
disciplines. The labels attached or embedded in the graphics reveal their semantic
connections and may hint at why they are linked to one another. Furthermore, the
maps reveal which realms of science or scholarship are being investigated today and
the individuals, publications, institutions, regions, or nations currently pre-eminent
in these areas.
By using a series of chronologically sequential maps, one can see how knowledge
advances. While maps of current data alone cannot predict where research will
go, they can be useful indicators in the hands of informed analysts. By observing
changes from year to year, trends can be detected. Thus, the maps become
forecasting tools. And since some co-citation maps include core works, even a
novice can instantly identify those articles and books used most often by members
of the “invisible college.”
The creation of maps by co-citation clustering is a largely algorithmic process.
This stands in contrast to the relatively simple but arduous manual method we used
over 30 years ago to create a historical map of DNA research from the time of
Mendel up to the work of Nierenberg and others.
Samuel Bradford (1878–1948) referred to “a picture of the universe of discourse
as a globe, on which are scattered, in promiscuous confusion, the mutually related,
separate things we see or think about.” John Bernal (1901–1971), a prominent
international scientist and an X-ray crystallography scientist, was a pioneer in social
studies of science or “science of science”. His book The Social Function of Science
(Bernal 1939) has been regarded as a classic in this field. To Bernal, science is
the very basis of philosophy. There was no sharp distinction between the natural
sciences and the social sciences for Bernal, and the scientific analysis of society was
an enterprise continuous with the scientific analysis of nature. For Bernal, there was
no philosophy, no social theory, and no knowledge independent of science. Science
was the foundation of it all.
Bernal, among others, created by laborious manual methods what we would
today describe as historiographs. However, dynamic longitudinal mapping was
made uniquely possible by the development of the ISI® database. Indeed, it gave
birth to scientometrics and new life to bibliometrics.

1.3.2 Cases of Competing Paradigms

It is not uncommon for a new theory in science to meet its resistance. A newborn
theory may grow stronger and become dominant over time. On the other hand, it
42 1 The Dynamics of Scientific Knowledge

might well be killed in its cradle. What are the factors that determine the fate of a
new theory? Is there any conclusive evidence? Are there in fact patterns in the world
of science and technology that can make us wiser? Let us take a look at some of the
widely known and long-lasting debates in the history of science.
Remember, Kuhn’s paradigm theory focuses on puzzle-solving problems. In this
book, we aim to describe a broad range of theories, methodologies, and examples
that can contribute to our knowledge of how to better capture the dynamics of the
creation of scientific knowledge. We will demonstrate our work in citation-based
approaches to knowledge domain visualization and present in-depth analysis of
several puzzle-solving cases, in particular, including debates between competing
theories on the causes of dinosaurs’ extinctions, the power sources of active galactic
nuclei, and the connections between mad cow disease and a new variant of human
brain disease.

1.3.2.1 Dinosaurs’ Extinctions

Five mass extinctions have occurred in the past 500 million years on earth,
including the greatest ever Permian-Triassic extinction 248 million years ago and
the Cretaceous-Tertiary extinction 65 million years ago, which wiped out the
dinosaurs among many other species. The Cretaceous-Tertiary extinction, also
known as the KT extinction, has been the topic of intensive debates over the last
20 years, involving over 80 theories of what caused the mass extinction of dinosaurs.
Paleontologists, geologists, physicists, astronomers, nuclear chemists, and many
others are all involved. We will use our visualization techniques to reveal the process
of this debate.

1.3.2.2 Super-Massive Black Holes

Albert Einstein predicted the existence of black holes in the universe. By their virtual
nature, we cannot see black holes directly, even if a real one falls into the scope of
our telescope. Astronomers are puzzled by the gravitational power from the centers
of galaxies. If our theories are correct, the existence of heavyweight black holes
is among the few explanations. Astronomers have been collecting evidence with
increasingly powerful telescopes. In this case, we will analyze the impact of such
evidence on the acceptance of a particular paradigm.

1.3.2.3 BSE and vCJD

The 1997 Nobel Prize in physiology or medicine was awarded to Stanley Prusiner,
professor of neurology, virology, and biochemistry, for his discovery of prions – an
abnormal form of a protein responsible for diseases such as scrapie in sheep, Bovine
Spongiform Encephalopathy (BSE) in cattle – also known as mad cow disease,
1.4 The Organization of the Book 43

and Creutzfeldt-Jakob disease (CJD) in humans. While CJD is often found among
people over 55, vCJD patients have an average of 27. In the middle of UK’s BSE
crisis, the public concerns about whether it is safe to eat beef products at all. This
concern has led to the question whether eating contaminated food can cause vCJD.

1.4 The Organization of the Book

This book is written with an interdisciplinary audience in mind, especially for


information scientists who are interested in visualizing the growth of scien-
tific knowledge, for computer scientists who are interested in characterizing the
dynamics of scientific paradigms through the use of visualization and animation
techniques, for philosophers and researchers in social studies of science who are
interested various case studies and possible explanations based on visual explo-
ration. The book also provides the foundations for people who want to start their
own quests into scientific frontiers and deal with invisible colleges and competing
paradigms.
Chapter 1, “The Dynamics of Scientific Knowledge”, introduces a wide range of
examples to illustrate fundamental issues concerning visual communications and
visual analytic reasoning in general and mapping scientific frontiers in particular.
We emphasize the profound connections between perception and cognition. We use
the metaphor of message in a bottle to highlight the role of visual representations
in communication as well as in everyday life. We also use the story of the blind
men and the elephant to analogy the challenges that science mapping must face.
Several examples in this chapter identify the key requirements for unambiguous
and effective communication based on our perceptual abilities. The power of visual
languages is traced from the ancient cave painting to the messages carried by
the modern spacecraft Pioneer and Voyager. The messages sent to the deep space
also raise the question of what prior knowledge is required for understanding
visualization. Limitations of visual language are explained in terms of Gestalt
psychology. The holistic nature of Gestalt switch at the macroscopic level of
paradigm shift and the mechanisms of replacing a conceptual structure with a new
structure with higher explanation cohesiveness at a more detailed level has set the
stage for topics that we will elaborate and discuss further in the book.
Chapter 2, “Mapping the Universe”, explores the origin of cartography and its
role in mapping phenomena in the physical world, from terrestrial maps, celestial
maps, to biological maps. We highlight the influential role of thematic maps in sub
sequent visualizations of more abstract phenomena. The idea of a geographic base
map and a thematic overlay is such a simple yet powerful model that we repeatedly
refer to this method throughout the book. We also emphasize the role of a holistic
metaphor, or an intact image. Stories associated with constellation figures are good
examples of this type of metaphor. The second edition includes new examples of
global science maps and interactive overlays.
44 1 The Dynamics of Scientific Knowledge

Chapter 3, “Mapping Associations”, extends the spatial metaphors described


in Chap. 2 to capture the essence of conceptual worlds. On the one hand, we
distinguish the uniqueness of mapping conceptual systems. On the other hand,
it is our intention to consolidate design strategies and visual representations that
can be carried through into the new realm. This chapter introduces some of the
most commonly used methods to generate visual-spatial models of concepts and
their interrelationships. Examples in this chapter demonstrate not only the use
of classic multivariate analysis methods such as multidimensional scaling (MDS)
and principle component analysis (PCA), but also the promising route for further
advances in non-linear multidimensional scaling. We introduce a number of network
modeling and analysis approaches.
Chapter 4, “Trajectories of Search”, describes three interrelated aspects of
science mapping: structural modeling, visual-semantic displays, and behavioral
semantics. Structural mapping is concerned with how to extract meaningful rela-
tionships from information resources. Visual-semantic displays focus on the design
of effective channels for effective communication. Traditionally, structural mapping
and visual-semantic display are regarded as the core of information visualization.
Behavioral semantics emphasizes the meaning of behavioral patterns in helping us
to understand the structure of an information space. It also provides a promising
way to build responsive virtual environments. We expect these enabling techniques
will play an increasingly important role in mapping scientific frontiers.
Chapter 5, “The Structure and Dynamics of Scientific Knowledge”, presents a
historical account of theories and quantitative methods of mapping science. Two
major streams of work, co-word analysis and co-citation analysis, are illustrated
with examples. The influence of information visualization is highlighted.
Chapter 6, “Tracing Competing Paradigms”, focuses on the visualization of
competing paradigms by using theories and techniques described in previous
chapters. This chapter demonstrates the process of detecting competing paradigms
through two detailed case studies. One is on the prolonged scientific debates among
geologists and paleontologists on mass extinctions. The other is on the search for
supermassive black holes and the active nuclei paradigm concerned by astronomers
and astrophysics.
Chapter 7, “Tracking the Latent Domain Knowledge”, demonstrates three more
case studies on the theme of visualizing the dynamics of scientific frontiers. In
contrast to Chap. 6, the case studies in this chapter emphasize the role of citation
networks in revealing less frequently cited works. The goal is to foster further
research in discovering paradigms.
Chapter 8, “Mapping Science”, introduces a structural variation model to
measure the value of newly available information by conceptualizing the devel-
opment of scientific knowledge as a complex adaptive system. This chapter also
includes a case study of identifying emerging trends in regenerative medicine and a
study of retracted articles and their impacts on the literature. Global science maps
and interactive overlays are also introduced in this chapter. A new dual-map overlay
design is proposed to make citations explicit in terms of both source and target
journals of citation links.
References 45

Chapter 9, “Visual Analytics”, outlines several applications that are designed to


support analytic reasoning and decision making tasks in general, although some
of them specifically target the understanding of scientific literature. Challenges
identified in the first edition of the book in 2002 are reviewed. New milestones
are set to highlight the challenges ahead.

References

Bernal JD (1939) The social function of science. The Macmillan Co., New York
Callon M, Law J, Rip A (eds) (1986) Mapping the dynamics of Science and technology: sociology
of science in the real world. Macmillan Press, London
Card SK (1996) Visualizing retrieved information: a survey. IEEE Comput Graph Appl 16(2):
63–67
Card S, Mackinlay J, Shneiderman B (eds) (1999) Readings in information visualization: using
vision to think. Morgan Kaufmann, San Francisco
Chalmers M (1992) BEAD: explorations in information visualisation. Paper presented at the
SIGIR’92, Copenhagen, Denmark, June 1992
Chen C (1999) Information visualisation and virtual environments. Springer, London
Chen C (2010) Information visualization. Wiley Interdiscip Rev Comput Stat 2(4):387–403
Crane D (1972) Invisible colleges: diffusion of knowledge in scientific communities. University of
Chicago Press, Chicago
Fimmel RO, Allen JV, Burgess E (1980) Pioneer: first to Jupiter, Saturn, and beyond (U.S. Govern-
ment Printing Office No. NASA SP-446). Scientific and Technical Information Office/NASA,
Washington, DC
Hearst MA (1999) User interfaces and visualization. In: Baeza-Yates R, Ribeiro-Neto B (eds)
Modern information retrieval. Addison-Wesley, Harlow, pp 257–224
Herman I, Melançon G, Marshall MS (2000) Graph visualization and navigation in information
visualization: a survey. IEEE Trans Vis Comput Graph 6(1):24–44
Hollan JD, Bederson BB, Helfman J (1997) Information visualization. In: Helenader MG,
Landauer TK, Prabhu P (eds) The handbook of human computer interaction. Elsevier Science,
Amsterdam, pp 33–48
Ihde D (1998) Expanding hermeneutics: visualism in science. Northwester University Press,
Evanston
Inselberg A (1997) Multidimensional detective. Paper presented at the IEEE InfoVis’97, Phoenix,
AZ, October 1997
Keim D, Mansmann F, Schneidewind J, Thomas J, Ziegler H (2008) Visual analytics: scope and
challenges. Vis Data Min 4404:76–90
Kochen M (1984) Toward a paradigm for information science: the influence of Derek de Solla
Price. J Am Soc Inf Sci Technol 35(3):147–148
Kornish LJ, Ulrich KT (2011) Opportunity spaces in innovation: empirical analysis of large
samples of ideas. Manag Sci 57(1):170–128
Kou L, Kou YH (1976) Chinese folktales. 231 Adrian Road, Millbrae, CA 94030: Celestial Arts,
pp 83–85
Kuhn TS (1962) The structure of scientific revolutions. University of Chicago Press, Chicago
Latour B (2005) Reassembling the social – an introduction to actor-network-theory. Oxford
University Press, Oxford
Masterman M (1970) The nature of the paradigm. In: Lakatos I, Musgrave A (eds) Criticism and
the growth of knowledge. Cambridge University Press, Cambridge, pp 59–89
McGrath JE, Altman I (1966) Small group research: a synthesis and critique of the field. Holt,
Rinehart & Winston, New York
46 1 The Dynamics of Scientific Knowledge

McKim RH (1980) Experiences in visual thinking, 2nd edn. PWS Publishing Company, Boston
Mukherjea S (1999) Information visualization for hypermedia systems. ACM Comput Surv
31(4):U24–U29
Norwood RH (1958) Patterns of discovery. Cambridge University Press, Cambridge
Price DD (1963) Little science, big science. Columbia University Press, New York
Price DD (1965) Networks of scientific papers. Science 149:510–515
Rittschof KA, Stock WA, Kulhavy RW, Verdi MP, Doran JM (1994) Thematic maps improve
memory for facts and inferences: a test of the stimulus order hypothesis. Contemp Educ Psychol
19:129–142
Small H (1977) A co-citation model of a scientific specialty: a longitudinal study of collagen
research. Soc Stud Sci 7:139–166
Small HG, Griffith BC (1974) The structure of scientific literatures I: identifying and graphing
specialties. Sci Stud 4:17–40
Spence B (2000) Information visualization. Addison-Wesley, New York
Thagard P (1992) Conceptual revolutions. Princeton University Press, Princeton
Thomas JJ, Cook K (2005) Illuminating the path: the R&D agenda for visual analytics. IEEE
Computer Society, Los Alamitos
Tufte ER (1983) The visual display of quantitative information. Graphics Press, Cheshire
Tufte ER (1990) Envisioning information. Graphics Press, Cheshire
Tufte ER (1997) Visual explanations. Graphics Press, Cheshire
Ware C (2000) Information visualization: perception for design. Morgan Kaufmann Publishers,
San Francisco
Wise JA, Thomas JJ, Pennock K, Lantrip D, Pottier M, Schur A, et al (1995) Visualizing the
non-visual: spatial analysis and interaction with information from text documents. Paper
presented at the IEEE symposium on information visualization’95, Atlanta, Georgia, USA,
30–31 October 1995
Wong PC (2010) The four roads less traveled – a tribute to Jim Thomas (1946–2010). From http://
vgtc.org/JimThomas.html
Wong P, Thomas J (2004) Visual analytics. IEEE Comput Graph Appl 24(5):20–21
Chapter 2
Mapping the Universe

A picture is worth a thousand words.


Chinese Proverb

Powers of Ten is a short documentary film written and directed by Ray Eames and
her husband, Charles Eames. It was rereleased in 1977. Starting from a one-meter
wide scene, the film moves 10 times farther away every 10 s. By the 7th move,
we have already moved far enough to see the entire Earth (Fig. 2.1). In 1998, the
Library of Congress selected the film for preservation in the United States National
Film Registry because it is “culturally, historically, or aesthetically significant.” In
this chapter, we will review principles and techniques that have been developed for
drawing maps at three very different scales, namely, geographical maps, maps of the
universe, and maps of protein sequences and compounds.
This chapter focuses on a variety of organizing models behind a variety of maps,
and in particular their role in making visual thinking and visual communication
effective. These models are also known as metaphors. The fundamental value of a
metaphor is its affordance. The central theme in this chapter is the design of thematic
maps that represent phenomena in the physical world across terrestrial mapping and
celestial mapping. The key question is: what are the roles of various metaphors in
mapping macrocosmic phenomena and macrocosmic ones?

2.1 Cartography

Maps are graphic representations of the cultural and physical environment. Maps
appeared as early as the fifth or sixth century BC. Cartography is the art, science,
and technology of making maps. There are two types of maps: general-purpose
maps and thematic maps. General-purpose maps are also known as reference maps.
Examples of reference maps include topographic maps and atlas maps. These maps
display objects from the geographical environment with emphasis on location, and

C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 47


DOI 10.1007/978-1-4471-5128-9 2, © Springer-Verlag London 2013
48 2 Mapping the Universe

Fig. 2.1 Scenes in the film Powers of Ten (Reprinted from http://www.powersof10.com/film
© 2010 Eames Office)

the purpose is to show a variety of features of the world or a region, such as


coastlines, lakes, rivers, and roads. In history, the reference map was prevalent until
the middle of the eighteenth century. The knowledge of the world was sharply
increasing and cartographers were pre-occupied with making a world map that
would be as comprehensive as possible. Thematic maps, on the other hand, are
more selective and they display the spatial distribution of a particular geographic
phenomenon. Thematic maps are also known as special-purpose, single-topic, or
statistical maps. Thematic maps emerged as scientists turned their attention to the
spatial attributes of social and scientific data, such as climate, vegetation, geology,
and trade. A thematic map is designed to demonstrate particular feature or concepts.
The purpose of thematic maps is to illustrate the structural characteristics of some
particular geographical distribution. A thematic map normally focuses on a single
theme.
Thematic maps came late in the development of cartography. Thematic maps
make it easier for professional geographers, planners, and other scientists and
academicians to view the spatial distribution of phenomena. Thematic maps were
not widely introduced until the early nineteenth century. The last 30 years have been
referred to as the “era of thematic mapping,” and this trend is expected to continue
in the future.
Every thematic map has two important components: a geographic or base map
and a thematic overlay (See Fig. 2.2). A geographic base map provides information
2.1 Cartography 49

Fig. 2.2 The procedure


of creating a thematic map

of location to which the thematic overlay can be related. Thematic maps must be
well designed and include only necessary information. Simplicity and clarity are
important design features of the thematic overlay.
Researchers are still debating about the roles of communication and visualization
within the context of modern cartography. David DiBiase’s view of visualization in
scientific research includes visual communication as in the public realm portion of
his model. His model suggests that visualization takes place along a continuum, with
exploration and confirmation in the private realm, and synthesis and presentation in
the public realm. The private realm constitutes visual thinking and the public realm
is visual communication. The traditional view of cartographic communication is
incorporated into more complex descriptions of cartography, indeed, as an important
component.
The distinction between cartographic communication and cartographic visualiza-
tion is that the former deals with an optimal map whose purpose is to communicate
a specific message, and the latter concerns a message that is unknown and for which
there is no optimal map (Hearnshaw and Unwin 1994). This idea follows much of
the thinking that distinguishes deterministic thinking and probabilistic thinking, and
this characterizes much of scientific thinking of the twentieth century. The latest
view of visualization in cartography and communication recognize the importance
of the map user in the communication process, who was often overlooked in the
traditional view.
Cartographers have recognized that map readers are different, and not simple
mechanical unthinking parts of the process, that they bring to the map reading
activity their own experiences and cognition. Map communication is the component
of thematic mapping whose purpose is to present one of many possible results of a
geographical inquiry. Maps are seen as tools for the researcher in finding patterns
and relationships among mapped data, not simply for the communication of ideas
to others. Cartographic communication requires that the cartographer knows what
a map reader needs so as to send the right message to the map reader, although a
cartographer may never be certain that the intended message is conveyed precisely.
50 2 Mapping the Universe

Cartography is a process of abstraction, involving selection, classification,


simplification, and symbolization. Each type of abstraction reduces the amount of
specific detail shown on the map. On the other hand, the map reader needs enough
information to be able to understand the map. The most complex of the mapping
abstractions is symbolization. Two major classes of symbols are used for thematic
maps: replicative and abstract. Replicative symbols are designed to look like their
real-world counterparts; they are used only to stand for tangible objects such as
coastlines, trees, houses, and cars. Base-map symbols are replicative in nature,
whereas thematic-overlay symbols may be either replicative or abstract. Abstract
symbols generally take the form of geometric shapes, such as circles, squares, and
triangles. They are traditionally used to represent amounts that vary from place
to place.
Maps and their quantitative symbols are unique mechanisms for the communica-
tion of spatial concepts. Because it is possible to cram a lot of information into one
symbol, the designer often tries for too much of a good thing. Overloaded symbols
are hard to understand; they may send the wrong message or incomprehensible
messages. For example, the proportional circle is the most commonly misused
symbol. It can be sized, segmented, colored, or sectored. It is tempting to include
all of these on one map. Unfortunately, if the map reader cannot see the spatial
distribution clearly and easily, then the map is not communicating. If a thematic
map overloads proportional circles with three or more different data sets, the map
will fail to convey anything useful. A good design guideline is to limit the number of
variables symbolized by proportional point symbols to one, possibly two, but never
three or more.
An isarithmic map, also known as a contour map, is a planimetric graphic repre-
sentation of a three-dimensional volume. Isoline mapping is a system of quantitative
line symbols that attempt to portray the undulating surface of the three-dimensional
volume. Contour is a common example. An isarithmic mapping technique always
implies the existence of a third dimension. The isarithmic technique also requires
that the volume’s surface is continuous in nature, rather than discrete or stepped.
Isarithmic mapping has had a long history. Isometric lines showing the depth
of the ocean floor are called isobaths, which were first used in 1584. In 1777, the
isohypse line was proposed by Meusnier as a way of depicting surface features
and was first used in an actual map made by du Carla-Dupain-Triel in 1782. The
isohypse is an isometric line.
The most important perceptional tendency to a viewer as well as the cartographer
is the figure and ground configuration. Our underlying perceptional tendency is to
organize the visual field into two categories: important objects which form figures
and less important ones which form grounds. Gestalt psychologists first introduced
this concept early in the twentieth century. Figures are objects standing out from the
background. Figures are remembered better; grounds are formless and often lost in
perception.
In the three-dimensional world, we see buildings in front of sky and cars in front
of pavements. Texture and differences in texture can produce figures in perception.
2.1 Cartography 51

Orientation of the textural elements is more important in figure development than is


the positioning of the elements. The cartographic literature also provides evidence
that texture and texture discrimination lead to the emergence of figures.
Adding clear edges to figural objects can produce a strong figure in two-
dimensional visual experience. Conversely, reducing edge definition can weaken
figural dominance. There are many ways to form edges, for example, using contrasts
of brightness, reflection, or texture. If the cartographer overlooks the role of figures
and grounds, the resultant maps are likely to be very confusing.
A significant geographical clue is the differentiation between land and water,
if the mapped area contains both. This distinction has been suggested as the first
important process in thematic map reading. Maps that present confusing land-water
forms deter the efficient and unambiguous communication of ideas. Land-water
differentiation usually aims to cause land areas to be perceived as figures and water
areas as ground. In unusual cases, water areas are the focal point of the map and
would therefore be given graphic treatment to cause them to appear as figures.
Cartographers have developed comprehensive guidelines for using letters in a
map. Here are four golden rules:
1. Legibility
2. Harmony
3. Suitability of reproduction
4. Economy and ease of execution
Good lettering design on the map can be achieved by contrast of capitals and
lowercase. A map that contains only one form or the other is exceptionally dull and
usually indicates a lack of planning. In general, capitals are used to label larger
features such as countries, oceans, and continents, and important items such as
large cities, national capitals, and perhaps mountain ranges. Smaller towns and less
important features may be labeled in lower case with initial capitals.
Careful lettering placement enhances the appearance of the map. There are
several conventions, supported by a few experimental studies. Most professional
cartographers agree that point symbols should be labeled with letters set solid (no
letter spacing). Upper right positioning of a label to a point symbol is usually
recommended.
The visual hierarchy, also known as organizational hierarchy, is the intellectual
plan for the map and the eventual graphic solution that satisfied the plan. The
cartographer sorts through the components of the map to determine the relative
intellectual importance of each, then seeks a visual solution that will cast each
component in a manner compatible with its position along the intellectual spectrum.
Objects that are important intellectually are rendered so that they are visually
dominant within the map frame (See Fig. 2.3).
The planning of visual hierarchy must suit the purpose of the map. For example,
water is ordinarily placed beneath the land in the order. Fundamental perceptual
organization of the two-dimensional visual field is based on figure and ground.
The figure-ground phenomenon is often considered to be one of the most primitive
52 2 Mapping the Universe

Fig. 2.3 The visual hierarchy. Objects on the map that are most important intellectually are
rendered with the greatest contrast to their surroundings. Less important elements are placed lower
in the hierarchy by reducing their edge contrasts. The side view in this drawing further illustrates
this hierarchical concept

forms of perceptual organization. Objects that stand out against their backgrounds
are referred to as figures in perception, and their formless backgrounds as grounds.
The segregation of the visual field into figures and grounds is a kind of automatic
perceptual mechanism. With careful attention to graphic detail, all the elements can
be organized in the map space so that the emerging figure and ground segregation
produces a totally harmonious design. Later chapters in the book include examples
of how figure-ground perception plays a role in describing scientific paradigms.
Cartographers have developed several techniques to represent the spherical sur-
face of the Earth. These techniques are known as map projections. Map projections
commonly use three types of geometric surfaces: cylinder, cone, and plane. A few
projections, however, cannot be categorized as such, or are combinations of these.
The three classifications are used for a wide variety of projections, including some
that are not geometrically constructed.

2.1.1 Thematic Maps

All thematic maps consist of a base map and a thematic overlay that depicts the
distribution pattern of a specific phenomenon. Different types of phenomena or
data require different mapping techniques. Qualitative and quantitative maps can
be distinguished as follows.
Qualitative maps show a variety of different phenomena across different regions.
For example, an agriculture map of Virginia would show that tobacco is the
dominant commercial product of Southside, beef cattle the dominant commercial
product of the Valley of Virginia, and so forth. Quantitative maps, on the other
hand, focus on a particular phenomenon and display numerical data associated
with the phenomenon. The nature of the phenomena, either continuous or discrete,
determines the best mapping method. For example, spatially continuous phenomena
like rainfall amounts are mapped using isolines; total counts of population may be
mapped using dots or graduated symbols; mean income on a county-by-county basis
would use area symbols.
2.1 Cartography 53

Fig. 2.4 Four types of relief map: (a) contours, (b) contours with hill shading, (c) layer tints, and
(d) digits (Reprinted from http://www.nottingham.ac.uk/education/maps/relief.html#r5)

2.1.2 Relief Maps and Photographic Cartography

Relief maps are used to represent a three-dimensional surface, such as hills, valleys
and other features of a place. Techniques such as contour lines, shading, and
layer tints are commonly used in relief maps. Reasoning in three dimensions
requires skills. Many people find relief features harder to interpret than most other
information on a map. There are more than a dozen distinct methods for showing
relief and so the map designer has a wide choice (See Fig. 2.4).
Information visualization has adapted many techniques from relief maps to
represent abstract structures and volatile phenomena. Notable examples include
self-organized maps (SOMs) (Lin 1997) and ThemeScape models (Wise et al.
1995). See Chap. 4 for more details.
In Chap.1, we introduce the view of visualism of science, which emphasizes the
instrumentational role of technologies in scientific discovery. Earlier cartography
relied on craftsmen’s measuring and drawing skills. Today, photographic cartogra-
phy relies on new technologies. For example, the powerful Hubble Space Telescope
(HST) took high-quality photographs of stars and galaxies for celestial mapping.
54 2 Mapping the Universe

Fig. 2.5 A Landsat photograph of Britain (left). Central London (right) is shown as the blue area
near to the lower right corner. The Landsat satellite took the photo on May 23rd, 2001 (Reprinted
from http://GloVis.usgs.gov/ImgViewer.jsp?path=201&row=24&pixelSize=1000)

Satellites have played an increasingly significant role in making thematic maps.


For example, the LANDSAT 7 satellite, launched in 1999, carried the Enhanced
Thematic Mapper Plus (ETMC) instrument, which is an eight-band multispectral
scanning radiometer capable of providing high-resolution image information of
the Earth’s surface. It detects spectrally filtered radiation at visible, near-infrared,
short wave, and thermal infrared frequency bands from the Earth. Nominal ground
sample distances or “pixel” sizes are 49 ft (15 m) in the panchromatic band; 98 ft
(30 m) in the 6 visible, near and short-wave infrared bands; and 197 ft (60 m) in the
thermal infrared band. The ETM C produces approximately 3.8 gigabits of data for
each scene, which is roughly equivalent to a 430-volume encyclopedias. Figure 2.5
shows a photograph of Britain from LANDSAT and a detailed photograph of Central
London.

2.2 Terrestrial Maps

The Greek astronomer Claudius Ptolemy (c.85–163 AD) generated one of the most
famous world maps in about 150 AD. Unfortunately, none of his maps survived.
Scholars in the Renaissance in the fifteenth century reconstructed Ptolemy’s map
following his instructions (See Fig. 2.6).
Ptolemy’s map represented his knowledge of the world. The map was most
detailed round the Mediterranean because he worked in Alexandria. The map
showed only three continents: Europe, Asia and Africa. The sea was colored in
light brown, the rivers in blue, and the mountains in dark brown. The surrounding
heads represent the major winds.
2.2 Terrestrial Maps 55

Fig. 2.6 Ptolemy’s world map, re-constructed based on his work Geography c. 150 (© The British
Library http://www.bl.uk/)

Fig. 2.7 A road map and an aerial photograph of the Westminster Bridge in London

Advances in mineralogy, stratigraphy and paleontology permitted the publication


of the first geological maps in the early nineteenth century, in which colors
were used to indicate the distribution of rocks and soils. Modern world maps,
equipped with satellites and remote sensing technologies, are far more accurate
and informative than the old world maps. Computer technologies now allow users
to make their own maps on the Internet using up-to-date geographic databases.
Cartography has pushed forward the frontiers between the known and the unknown.
Figure 2.7 includes the Westminster Bridge in London on a road map, its aerial
photograph, and a tourist photograph of the Big Ben – a landmark of London.
As we know, cartography is a process of abstraction. The best-known example
is the London Underground map. Figure 2.8 shows an earlier version of London
56 2 Mapping the Universe

Fig. 2.8 London Underground map conforms to the geographical configuration

Underground map, in which stations and routes are geographically accurate.


Because there is too much information about Central London to fit into the map,
an enlarged section of Central London is provided to show the detail. In contrast,
Fig. 2.9 shows the current version of London Underground. The most unique feature
of the current version is its simplicity and clarity: underground routes are shown
as straight lines. Geographical accuracy gives way to simplicity and clarity. The
topology of the underground in Central London is clear, although some information
visualization techniques have been applied specifically to help us read the map more
easily. The map produced by the geographic-independent design is certainly “not to
scale.”

2.3 Celestial Maps

Constellations are the imaginary work of our ancestors. The real purpose for the
constellations is to help us locate stars in the sky by dividing the sky into more
manageable regions as memory aids.
Historians believe that many of the myths associated with the constellations were
invented to help the farmers remember them. When they saw certain constellations,
they would know it was time to begin the planting or the reaping. Just like a visual
calendar.
2.3 Celestial Maps 57

Fig. 2.9 London underground map does not conform to the geographical configuration

The ancient Babylonians and Egyptians had constellation figures before the
Greeks. In some cases, these may correspond with later Greek constellations; in
other cases, there is no correspondence; and in yet other cases an earlier figure might
be represented in a different part of the sky. The constellation figures of the Northern
Hemisphere are over 2,000 years old.
Peter Whitfield describes the history of celestial cartography in The Topography
of the Sky (Whitfield 1999). One of the hallmarks of ancient astronomy was that
precise observation coexisted with a sense of mystery and transcendence. The
Babylonians, in particular, devised powerful mathematical systems for predicting
the positions of celestial bodies, while still considering those bodies to be “gods” of
the night. The practice of early civilizations was crucial for the development of star
mapping.
Early astronomers grouped stars in patterns to identify and to memorize regions
of the sky. Different cultures perceived different star patterns. By about 2000 BC,
both Egyptians and Babylonians had identified star groups, which typically took the
form of animal or mythic-human figures. Since the purpose was for everyone to
remember, there was hardly anything more suitable than animals or mythic-human
figures. The main point was to recognize an area of the sky.
The use of animals and mythic-human figures in constellations raises a deeper
question about the nature of their significance. From cave paintings, to constellation
figures, and to the message plaques on Pioneer and Voyager space probes, what is
the most suitable carrier of our intended message?
58 2 Mapping the Universe

2.3.1 The Celestial Sphere Model

The Egyptians and Babylonians did not produce models of the cosmos that could
account for the observed motions of the heavens or reveal the true shape of the
earth. A rational, theoretical approach to these problems began with the Greek
philosophers of the fifth century BC. People believed that the Sun, the Moon,
planets, and starts were embedded on the surfaces of several concentric spheres
centered at the center of the Earth, and that these spheres constantly revolved about
the Earth. This spherical model became a cornerstone of the Greek astronomy and
their civilization. The Greeks also had developed skills in spherical geometry that
enabled them to measure and map star positions.
We know that the stars are not set in one sphere. But for the purposes of
observation and mapmaking, this model works quite well. The Greek celestial
spherical model enabled astronomers and cartographers to construct globes and
armillary to show the stars, the poles, equator, ecliptic, and tropics.
Eudoxus of Cnidus first described many constellations into which we still use
today. Some constellations came from the Babylonians, such as the Scorpion,
the Lion, and the Bull. On the other hand, Perseus, Andromeda, and Hercules
are Greek mythic figures. These figures marked different regions in the sky. The
earliest representation of the classical constellations is the Farnese Atlas. The Museo
Nazionale in Naples houses a marble statue of the mythological character, Atlas,
who supports the heavens on his shoulders (See Fig. 2.10). Figure 2.11 shows some
constellation figures on the celestial globe. The hands on either side are the hands
of Atlas.
Figures 2.12 and 2.13 are star maps of the 48 classical constellations in the
Northern and Southern Hemisphere, respectively, published in the 1795 edition of
The Constellations of Eratosthenes by Schaubauch.
Celestial mapping relies on two fundamental inventions by Greek astronomers:
a spherical model of the heavens and star constellations in the sky. The symbol
of the ancient Greek astronomy is Ptolemy of Alexandria, who compiled a text
in the second century AD that remained fundamental to astronomy until the
sixteenth century. Known by its Arabic name, Almagest (Greatest), is a catalogue
identifying 1,022 of the brightest stars with their celestial coordinates, grouped
into 48 constellations. Ptolemy compiled this catalogue with the aid of naked-eye
sighting devices, but he was indebted to earlier catalogues such as that of the Greek
astronomer Hipparchus (146–127 BC). While Ptolemy specified how to draw the
stars and constellation figures on a globe, there is nothing in Almagest to suggest
that he made two-dimensional star maps (Whitfield 1999).
In order to draw accurate star maps in two dimensions, astronomers needed a
means of projecting a sphere of the sky onto a flat surface while still preserving
correct star positions. A star chart cannot be simply a picture of what is seen in the
sky because, at any given time of night, only about 40 % of the sky is visible.
2.3 Celestial Maps 59

Fig. 2.10 Atlas with the


celestial sphere on his
shoulders. This is the earliest
surviving representation of
the classical constellations
(Courtesy of www.
cosmopolis.com)

Fig. 2.11 Most of the 48 classical constellation figures are shown, but not the stars comprising
each constellation. The Farnese Atlas, 200 BC from the National Maritime Museum, London
60 2 Mapping the Universe

Fig. 2.12 Constellations in the northern Hemisphere in 1795s. The Constellations of Eratosthenes

Ptolemy was familiar with the science of map projection through his work
in terrestrial geography. In Planisphaerium, he described the polar stereographic
projection that is ideal for star charts. This projection divides the heavens into
northern and southern hemispheres and spreads each onto planes centered on the
celestial poles. Celestial latitude is stretched progressively away from the poles
toward the equator, and all the stars in one hemisphere can be positioned correctly.
Islamic scholars picked up Ptolemy’s science between the eighth and twelfth
centuries. They described the brightest stars, modeled on Ptolemy’s Almagest, and
illustrated each constellation with charts. They also made beautiful, precise celestial
globes. Islamic astronomers perfected a sophisticated scientific instrument called the
astrolabe, which was an essential tool of astronomers until the seventeenth century.
The astrolabes are used to show how the sky looks at a specific place at a given time.
Only two kinds of star maps have survived from the centuries of classical
and medieval astronomy – that embodied by the astrolabe and the image of the
single constellation. Until the fifteenth century, European scientists and scholars
2.3 Celestial Maps 61

Fig. 2.13 Constellations in the southern hemisphere in 1795s. The Constellations of Eratosthenes

did not draw charts of the entire northern or southern heavens for purposes of
study and demonstration. From 1440, star maps began to feature the 48 classical
constellations.
The Renaissance in Europe revived the need for a celestial map, as a counterpart
of terrestrial explorers’ world map. The fascination with constellations as artistic
topics influenced astronomical imagery for four centuries. Constellations became
the subject of a number of Renaissance paintings (See Fig. 2.14).
During the European Renaissance, the celestial globe was imported from the
Islamic world to Europe. The celestial globe had a significant impact on celestial
cartography. Most star charts drawn during the sixteenth to eighteenth centuries
mapped the constellations in reverse, as shown on a globe. Globes are models
of the celestial sphere with viewers standing on the outside. Some cartographers
followed this convention when they made star maps. However, some chose to show
the constellations as they appeared from earth.
62 2 Mapping the Universe

Fig. 2.14 The painting of constellations by an unknown artist in 1575 on the ceiling of the Sala
del Mappamondo of the Palazzo Farnese in Caprarola, Italy. Orion the Hunter and Andromeda are
both located to the right of the painting (Reprinted from Sesti 1991)

The first star charts atlases, commonly showing the 48 constellations, appeared
during the sixteenth century. One of the finest of these was Giovanni Gallucci’s
Theatrum Mundi of 1558, in which Gallucci positioned the principal stars within
vigorously drawn pictures of the constellations. Ptolemy’s star catalogue remained
as the source for comprehensive star charts throughout the sixteenth century.
No one else had undertaken a new sky survey. But at the end of the century,
two revolutionary changes occurred: Tycho Brahe completely re-measured all of
Ptolemy’s star positions with unprecedented accuracy, and the Dutch navigator
Pieter Keyser organized the southern stars into twelve new constellations – the first
additions to the topography of the sky for 2,000 years.
These new southern constellations took the form of exotic animals: the Toucan,
the Bird of Paradise, and the Flying Fish, along with a figure of an Indian. The
star groups first appeared on globes by the Dutch mapmaker Willem Blaeu and
in the atlas Uranometria, published in 1603 by Johann Bayer. Bayer used Brahe’s
star catalogue, grading the stars for magnitude. The German-Polish astronomer
Johannes Hevelius added seven more constellations in 1690 in his collection of
charts Uranographia. He grouped the stars between existing constellations into new
constellations.
The arms and insignia of most of the royal houses of Europe were once
used to model new constellations, but they were not accepted by the scientific
world and did not last. John Flamsteed catalogued almost 4,000 stars visible from
the Royal Observatory in Greenwich between 1700 and 1720. The atlas drawn
from Flamsteed’s catalogue, elegantly engraved by the artist James Thornhill, was
published after Flamsteed’s death. As telescopes became more and more powerful,
astronomers began including more and more stars in their catalogues. Eventually,
scientists agreed upon a total of 88 constellations. The last hand-drawn star maps
were made by Friedrich Argelander in 1863, containing a staggering total of 324,189
stars with no decorative constellation figures.
2.3 Celestial Maps 63

The sky is divided up into 88 areas, known as constellations, which serve as a


convenient way of locating the position of stars in the sky. Constellations come in
many different shapes and sizes. Some constellations consist of easily recognizable
patterns of bright stars, such as Orion, while others are faint and difficult to identify.
The tradition of dividing the sky into constellations began thousands of years
ago when ancient peoples assigned certain star patterns the names of their gods,
heroes, and fabled animals. With few exceptions, the star patterns bear very
little resemblance to the people and creatures they are supposed to represent; the
connections are symbolic rather than literal.
The ancient Greeks recognized a total of 48 constellations. Various other
constellations were added at later times. Early cartographers were free to introduce
new constellations of their own invention. In 1930, the International Astronomical
Union, astronomy’s governing body, adopted the list of 88 constellations, and set
their exact boundaries.

2.3.2 Constellations

Constellation maps represent some of the most imaginative organizational


metaphors that hold isolated starts in an intact image. Here we highlight the
metaphorical details concerning the constellation figures such as Andromeda and
Orion the Hunter so as to identify the nature of metaphoric representations.
The French astronomer Charles Messier (1730–1817) created a catalog of
nebulae and star clusters. The Messier catalog lists 110 deep sky objects cataloged
by M numbers: M1 through M110. Pictures of Messier objects are accessible from
the web, for example, the Messier picture gallery.1 John Louis Emil Dreyer (1852–
1926) published the New General Catalogue (NGC) in 1888 as an attempt to make
a complete list of all nebulae and star clusters known at the time. In 1895 and 1908,
he published supplements to the NGC, which he called the Index Catalogues (IC).
Nearly all of the bright, large, nearby non-stellar celestial objects have entries in
one of these three catalogues. Astronomers use the catalogue numbers to refer to
these objects, preceded by the catalogue acronyms, NGC and IC. For example, the
Andromeda galaxy is coded M31 in the Messier catalog and NGC 224 in the NGC
catalogue.
The Andromeda constellation (M-31/NGC-224) is the closest large spiral
galaxy to our own Milky Way. John Flamsteed (1646–1719), the first Astronomer
Royal, compiled his celestial atlas, Atlas Coelestis at Greenwish. His catalogue of
about 3,000 stars visible from Greenwich was published 10 years after his death
(1729, 1753). Figure 2.15 shows a Hubble Space Telescope (HST) photograph of
the Andromeda galaxy (left) alongside an Andromeda constellation figure from
Flamsteed’s catalogue (right).

1
http://www.astr.ua.edu/gallery2t.html
64 2 Mapping the Universe

Fig. 2.15 Left: M-31 (NGC-224) – the Andromeda Galaxy; Right: The mythic figure Andromeda

Andromeda is a spiral galaxy with as twice as many stars of our Milky Way. It is
the most distant object visible to the naked eye. According to the Greek mythology,
Andromeda was King Cepheus’ daughter and Poseidon was the god of the sea. One
day her mother Cassiopeia boasted that she and Andromeda were more beautiful
than Poseidon’s daughters. Poseidon was angry and sent floods to the lands ruled
by Cassiopeia and her husband. King Cepheus found out from an oracle that the
only way to calm down Poseidon was to sacrifice his daughter. Andromeda was
chained to a rock, waiting to be sacrificed to a sea monster, when Perseus arrived
just in time and killed the sea monster and saved the princess. Not surprisingly,
the Andromeda constellation is next to the Perseus constellation as well as the
Cassiopeia constellation (See Fig. 2.16).
The Orion constellation is one of the most recognizable constellations in the
Northern Hemisphere. Orion the Hunter is accompanied by his faithful dogs, Canis
Major and Canis Minor. They hunt various celestial animals, including Lepus (the
rabbit) and Taurus (the bull). According to Greek mythology, Orion once boasted
that he could kill all wild beasts. The goddess of the earth Gaea wanted to punish
Orion for his arrogance and she sent the scorpion to kill him. The scorpion stung
Orion on the heel. So in the night sky, as Scorpio (the scorpion) rises from the eastern
horizon, Orion sets in the west. However, Asclepius from with the constellation
Ophiuchus healed Orion and crushed the scorpion. Orion rises again in the east and
Asclepius (Ophiuchus) crushes Scorpio into the earth in the west. To the Greek’s,
in the sky Orion waves his club in his right hand and he holds a lion’s skin trophy
aloft in his left hand (See Fig. 2.17). There are several other versions of the story.
For example, the scorpion did kill Orion and the gods put them on the opposite side
of the sky so that the scorpion would never hurt Orion again.
The red glow in the middle of Orion’s sword is the Orion Nebula. Hanging down
from Orion’s belt is his sword that is made up of three fainter stars. The central “star”
of the sword is the Great Orion Nebula (M-42), one of the regions most studied by
astronomers in the whole sky. Nearby is the Horsehead Nebula (IC-434), which is
a swirl of dark dust in front of a bright nebula. Figure 2.18 is another illustration of
Orion the Hunter.
2.3 Celestial Maps 65

Fig. 2.16 Perseus and Andromeda constellations in John Flamsteed’s Atlas Coelestis (1729)
(Courtesy of http://mahler.brera.mi.astro.it/)

Fig. 2.17 Taurus and Orion in John Flamsteed’s Atlas Coelestis (1729) (Courtesy of http://mahler.
brera.mi.astro.it/)
66 2 Mapping the Universe

Fig. 2.18 Orion the Hunter (Courtesy of http://www.cwrl.utexas.edu/syverson/)

Greek mythology provides a “memory palace” for us to remember the overall


layout of stars in groups. The influence of terrestrial cartography on celestial
cartography is evidence, for example, the use of twin hemispheres, polar stereo-
graphic projection, terrestrial and celestial globes. Both terrestrial and celestial maps
represent macroscopic phenomena in the world. Similar organizational metaphors
have been developed for mapping microscopic phenomena. Before we leave the
topic of celestial mapping, let us see how scientists pursuit the big picture of the
universe.

2.3.3 Mapping the Universe

Why do scientists map the Universe? Stephen Landy gave an informative overview
of the history of mapping the universe (Landy 1999). Astronomers study galaxies.
Cosmologists study nature on its very largest scales; a galaxy is the basic unit of
matter. There are billions of galaxies in the observable universe. These galaxies
form clusters three million or more light-years across. Figure 2.19 is an illustration
on Scientific American in June 1999, showing the scales in the universe. Modern
cosmology has a fundamental assumption about the distribution of matters in the
universe – the cosmological principle, which says that the universe is overall
homogeneous. On large scales, the distributions of galactic bodies should approach
2.3 Celestial Maps 67

Fig. 2.19 Large-scale structures in the Universe (Reprinted from Scientific American, June 1999)

uniformity. But scientists face a paradox: how can the uniformity on the ultimate
scale be reconciled with the clumpy distributions on smaller scales? Mapping the
universe may provide vital clues.
In the late 1970s and early 1980s, cosmologists began to systematically map
galaxies (Gregory and Thompson 1982). Cosmo-cartographers discovered that on
scales of up to 100 million light-years, galaxies are distributed as a fractal with a
dimension of between one and two. The fractal distribution of matter would be a
severe problem for the cosmological principle because a fractal distribution is never
homogeneous and uniform. However, subsequent surveys indicated that on scales of
hundreds of millions of light-years, the fractal nature broke down. The distributions
of galaxies appeared to be random on these scales. The cosmological principle was
saved just before it ran into its next challenge.
Astronomer John Huchra at the Harvard-Smithsonian Center for Astrophysics
(CfA) is well known for his work on mapping the Universe. Between 1985 and 1995,
John Huchra, Margaret Geller and others measured relative distances via redshifts
for about 18,000 bright galaxies in the northern sky to make maps of the distribution
of galaxies around us. The CfA used redshift as the measure of the radial coordinate
in a spherical coordinate system centered on the Milky Way. This initial map was
quite surprising; the distribution of galaxies in space was not random, with galaxies
actually appearing to be distributed on surfaces, almost bubble like, surrounding
large empty regions, or “voids.” Great voids and elongated structures are clearly
indicating organized structure of matter on large-scales. Any cosmological theory
must explain how these structures evolved from an almost uniform universe.
CfA’s redshift survey revealed a “Great Wall” of galaxies 750 million light-years
long, more than 250 million light-years wide and 20 million light-years thick (See
Fig. 2.20). This Great Wall is now called the CfA Great Wall to differentiate it from
68 2 Mapping the Universe

Fig. 2.20 The CfA Great Wall – the structure is 500 million light-years across. The
Harvard-Smithsonian Center for Astrophysics redshift survey of galaxies in the northern celestial
hemisphere of the universe has revealed filaments, bubbles, and, arching across the middle of the
sample

the even bigger Sloan Great Wall discovered a few years later in 2003. The CfA
Great Wall is like a giant quilt of galaxies across the sky (Geller and Huchra 1989).
A random distribution cannot readily explain such a coherent structure. Even larger
mapping and surveying projects were undertaken. Stephen Landy (1999) explained
the Las Campanas Redshift Survey, which took place between 1988 and 1994. It
would take a lengthy explore time to photograph the most distant galaxies because
they were faint. The Las Campanas survey chose to slice through the universe and
concentrated on a very deep and wide but think slice (See Fig. 2.21).
Astronomers have begun to catalogue 100 million of the brightest galaxies
and 100,000 quasars, which are the exploding hearts of galaxies, using a device
called two-degree field spectrograph (2dF). The 2dF Galaxy Redshift survey is an
international collaboration involving more than 30 scientists from 11 institutions. It
is due to complete in 2006. The survey aims to learn more about the structure of the
Universe, how galaxies are made and how they form into larger structures.
The 2dF instrument is one of the most complex pieces of astronomical “camera”
ever built. It uses 400 optical fibers, all of which can be positioned by an incredibly
accurate robotic arm in about one hour. The 2dF instrument allows astronomers to
observe and analyze 400 objects at once, and on a long clear night, they can log
the positions of more than 2,000 galaxies. It has taken less than 2 years to measure
the distances for 100,000 galaxies. Without the 2dF instrument, this project would
have taken decades. Figure 2.22 shows a virtual scene of flying through a three-
dimensional model of the universe.
2.3 Celestial Maps 69

Fig. 2.21 Slice through the


Universe (Reprinted from
Scientific American,
June 1999)

Fig. 2.22 Flying through in the 3D universe map (Courtesy of http://msowww.anu.edu.au/)

2.3.3.1 Sloan Digital Sky Survey

The Sloan Digital Sky Survey (SDSS) is one of the most ambitious and influential
surveys in the history of astronomy.2 It is designed to collect astronomical data for
the study of the origin and evolution of the Universe, mapping large-scale structures
in the universe, and the study of quasars and their evolution. According to the
official site of SDSS, SDSS-I (2000–2005) and SDSS-II (2005–2008) covered more
than a quarter of the sky and created three-dimensional maps containing more than
930,000 galaxies and more than 120,000 quasars. SDSS-III is currently in operation
(2008–2014).
Some of the recent discoveries were only possible with the large amount of
data collected by the SDSS survey. For example, astronomers were able to detect

2
http://www.sdss.org/
70 2 Mapping the Universe

cosmic magnification caused by the gravitational effect of dark matter throughout


the universe with observations of 13 million galaxies and 200,000 quasars from the
SDSS.
SDSS generates a vast volume of astronomical data of large-scale structures,
galaxies, quasars, and stars. It has made a series of data releases to the public. The
website for publicly-released data (skyserver.sdss.org) receives millions of hits per
month. In parallel, astronomers have used SDSS data in their research and produced
a rapidly growing body of scientific literature.
Mapping the universe is a dream of many generations. The first map of
the universe is the Logarithmic Map of the Universe3 , created by a group of
astronomers, including Richard Gott, Mario Juric, and David Schlegel, and Michael
Vogeley. The logarithmic map depicts the entire visible Universe in a rectangular
shape with the Earth as the bottom line of the map and the Big Bang as the top
of the map. The rectangular map includes SDSS galaxies and quasars as well as
astronomical objects that one can see from the Earth, such as the Sun, the moon, and
stars in famous constellations. A computer printout of the map stretches from the
floor all the way to the height of an office door. Figure 2.23 shows part of the map.
This portion of the map shows astronomical objects beyond 100 megaparsecs (mpc)
from the Earth. The scale in mpc is shown on the left-hand side of the map. At about
100 mpc, there is the CfA2 Great Wall. Coma Cluster is about the same distance
from the Earth. The Sloan Great Wall is located about 120 mpc. Because the map
is on a logarithmically transformed scale, the upper part of map is compressed at a
higher rate than the lower part of the map. SDSS galaxies start from about 100 mpc.
SDSS quasars started from about 2,000 mpc. Several high redshift SDSS quasars
are marked on the right-hand half of the map with ‘C’ signs. Near the top left strip
just passed the 1,000 mpc, a dashed line is marked the birth of the first stars after the
Big Bang. Above the dashed line is a line for cosmic microwave background. Right
above of it, a solid line marks the time of the ‘Big Bang.’
A point on the celestial sphere can be identified by its right ascension and
declination degrees. The rectangular map contains the positions of SDSS galaxies
and quasars in terms of right ascension and their distances measured from the Earth.
With the rectangular map, viewers can easily tell how far away an astronomic
object or structure is from us. Figure 2.24 shows a circular map of the universe
we generated in 2007 based on the SDSS data. The map was selected by the 3rd
iteration of the Places & Spaces in 2007. In 2008, a modified version of the map
was entered to the NSF and Science & Engineering Visualization Challenge and
received a semifinalist award.
The circular map of the universe depicts astronomical objects and scientific
activities associated with some of the astronomical objects. The radius of the
circular map represents the look-back time, or the approximate time elapsed from
the beginning of the universe. The further away from the Earth an object on the map,
the closer it was to the beginning of the universe. Figure 2.25 shows the sketch of

3
http://www.astro.princeton.edu/universe/
2.3 Celestial Maps 71

Fig. 2.23 Part of the rectangular logarithmic map of the universe depicting major astronomical
objects beyond 100 mpc from the Earth (The full map is available at http://www.astro.princeton.
edu/universe/all100.gif. Reprinted from Gott et al. 2005)

the design. The scale is logarithmically transformed to compress the large amount of
voids in space into a compact map. The Earth is at the center of the map because the
SDSS telescope located on the Earth is used to measure the distance an astronomical
object is from us. Quasars, for example, formed in the early stages of the universe,
appear near to the outer rim of the circular map. Each red dot in the outer belt depicts
a quasar found by the SDSS survey.
The map conveys 14 types of information, including various astronomical objects
such as high redshift quasars found by SDSS, extrasolar planets, stars, and space
probes. In addition to astronomical objects found by the SDSS survey, the circular
map of the universe contains the positions of several other types of objects such
as galaxies found by the CfA2 survey, galaxies on the Messier catalog, and the
brightest stars in the sky. Some of the objects are associated with information about
when they were discovered and the time periods in which articles about these objects
attracted bursts of citations. Figure 2.26 shows the types of objects on the map, the
sub-total of each type of objects, and examples.
72 2 Mapping the Universe

Fig. 2.24 A map of the universe based on the SDSS survey data and relevant literature data from
the web of science. The map depicts 618,223 astronomic objects, mostly identified by the SDSS
survey, including 4 space probes (A high resolution version of the map can be found at http://
cluster.cis.drexel.edu/ cchen/projects/sdss/images/2007/poster.jpg)

Fig. 2.25 The design of the circular map of the universe

Figure 2.27 shows the center of the circular map of the universe. The Earth
is at the center of the depicted universe – we are of course aware of what the
Copernicus revolution was all about. The logarithmic scale shown on the map, along
the Northeast direction, gives us a rough idea how far away an object is from the
2.3 Celestial Maps 73

Fig. 2.26 The types of objects shown in the circular map of the universe

Fig. 2.27 The center of the circular map of the universe

Earth. For example, artificial satellites are orbiting the Earth 10,000–100,000 km
above the Earth. The distance between the Sun and the Earth is one astronomical unit
(AU). Space probe Pioneer 10 was about 100 AU away from the Earth at the time
of the SDSS survey. Sirius, the brightest star in the sky, is slightly over one parsec
74 2 Mapping the Universe

Fig. 2.28 Major discoveries in the west region of the map. The 2003 Sloan Great Wall is much
further away from us than the 1989 CfA2 Great Wall

(pc) away from us; according to the Hipparcos astrometry satellite, it is 2.4 pc, or
8.48 light-years away. About 100 pc away, there are over 8,000 objects identified by
Hipparcos satellite.
Major discoveries in astronomy are also marked on the map, for example, the
discovery of Neptune in 1846 and the discovery of the first quasar 3C 273 in 1963,
the CfA2 Great Wall in 1989, and the Sloan Great Wall in 2003 (See Fig. 2.28).
As of 2012, the Sloan Great Wall is still the largest known cosmic structure in the
universe. It was discovered by J. Richard Gott III, Mario Juric, and their colleagues
in 2003 based on the SDSS data. The Sloan Great Wall is gigantic. It is 1.38 billion
light years in length. This is approximately 1/60 of the diameter of the observable
universe. It is about one billion light-years from the Earth. The Sloan Great Wall
is 2.74 times longer than the CfA2 Great Wall of galaxies discovered in 1989 by
Margaret Geller and John Huchra.
J. Richard Gott III and Mario Juric generously shared with us the data and the
source code they used to generate their rectangular-shaped logarithmic map of the
universe, which is the first scientifically accurate map of the universe, and our
circular map is the second. One of the valuable lessons we learned is the invaluable
role of a computer programming language in facilitating interdisciplinary research.
The computer programming language in fact provided a firm common ground for
astrophysicists and information scientists.
It is easy to tell from Fig. 2.28, the scope of the SDSS survey is much closer to
the beginning of the universe than the scope of the CfA2, marked by the galaxies
in yellow color. The red dots are high redshift quasars found by the SDSS survey.
The blue dots are galaxies found by the SDSS. The yellow dots are galaxies found
by the 1989 CfA2 survey. The depth of the SDSS survey is deeper than what the
Hubble Deep Field (HDF) had reached in 1995. The HDF is an image of a small
region obtained by assembling 342 separate exposures of Hubble over 10 days from
December 18 till December 28 in 1995 consecutively, which is, as I am writing,
almost exactly 7 years from today, December 27, 2012. Because the HDF image
reveals some of the youngest and most distant galaxies ever known, it has become a
landmark image in the study of the early universe.
2.3 Celestial Maps 75

Fig. 2.29 The Hubble Ultra Deep Field (HUDF) is featured on the map of the universe

Figure 2.29 shows the Northeast quadrant of the map. The Hubble Ultra Deep
Field (HUDF), located near the upper left corner of the image, reached even deeper
than the 1995 HDF. In other words, the HUDF reveals an even earlier stage of
the universe. It looks back approximately 13 billion years, which is about 400–
800 million years after the Big Bang. Its marker is approaching the 10 Gigaparsecs
(gpc) mark on the distance scale. One gigaParsec (gpc) is 3.0857  1025 m, or 3.26
billion light-years. The HUDF’s record was recently updated by the eXtreme Deep
Field (XDF), released on September 25, 2012. The XDF reveals galaxies formed
only 450 million years after the Big Bang.
In addition to the depiction of astronomical objects such as galaxies and quasars,
the circular map of the universe also presents information about which astronomical
objects have attracted the attention of astronomers in terms of citation bursts.
We will explain the concept of citation burst in detail in later chapters of the
book. Simply speaking, a citation burst of a scientific publication measures the
acceleration of the citations it has received. A strong citation burst is a sign that
the article has generated a significant level of interest in the scientific community,
in this case, astronomers. Figure 2.30 shows that a citation burst was found with
the object QSO J1030 C 0524 between 2003 and 2004. This object, as it turns out,
was the most distant quasar known when it was discovered. Astronomers measure
the redshift of an object with a metric z, which is the change in the wavelength of
the object divided by the rest wavelength of the light. The quasar was discovered to
have a z of 6.28 at the time, which was very high. The next quasar labeled below the
QSO J1030 C 0524 is QSO J1044-0125, which has a citation burst between 2000
and 2004. It is a high redshift quasar as well (z D 5.73). The third labeled quasar,
QSO J1048 C 4637, also has a high redshift (z D 6.23).
76 2 Mapping the Universe

Fig. 2.30 SDSS quasars


associated with citation bursts

Fig. 2.31 A network of co-cited publications based on the SDSS survey. The arrow points to an
article published in 2003 on a survey of high redshift quasars in SDSS II. A citation burst was
detected for the article

Figure 2.31 shows the literature resulted from the SDSS survey. Each dot
represents a published article. The size of the tree-ring indicates the citations
received by the corresponding article. The yellow arrow points to an article by Fan
et al. in 2003 on a survey of high redshift quasars in SDSS II, the second stage of
the SDSS project. The article was found to have a burst of citations, indicating the
attention it attracted from the scientific community. In later chapters in this book,
we will discuss this type of literature visualization in more detail.
The SDSS example has practical implications on science mapping. First, as-
tronomy provides a natural framework to organize and display the large amount
of astronomical objects. The distance between a high redshift quasar and the Earth
is meaningful. It can be precisely explained in scientific terms. The mission of the
scientific frontier in this context is to understand the early universe. The attention
of the research frontier is clearly directed to the high redshift quasars because they
were formed soon after the Big Bang. An intuitive indicator of the progression of the
frontier is the look-back time, i.e. how closely objects formed after the Big Bang can
be observed. The structure of the universe in this case provides an intuitive reference
2.4 Biological Maps 77

to represent where the current scientific frontier is and where its next move might
be. Just imagine for a moment what if we don’t have such an organizing structure to
work with.
Second, the structure of the universe provides an intellectual playground for
astronomers. It is clear that astronomers, as expected, do not devote their attention
evenly across the universe. Once we have developed a good understanding of our
local environment, the search is extended to other parts of the universe, literally far
and wide. The organizing metaphor in astronomy coincides with the universe. The
isomorphic relation raises a new question: is there a situation in which the nice and
intuitive structure may limit our creativity? Are there theories that are proven to
be valuable in one part of the universe would be potentially valuable if they were
applied to elsewhere in the universe? The visualization of the relevant literature
shows a different structure. In other words, the physical world and the conceptual
world have different structures. Things are connected not simply because they are
in proximity. Likewise, things separated by a vast space of void in the universe may
be close to each other in the conceptual world.
It seems more likely to be common rather than exceptional that we will deal with
multiple perspectives of the same phenomena and each perspective may lead to a
unique picture. What do we need to do to reconcile multiple perspectives? Do we
need to reconcile at all? What can we gain from having multiple views and what do
we have to lose?

2.4 Biological Maps

The most representative microscopic phenomenon is in the biological world.


Astronomers use powerful telescopes to probe stars that are so far away for our
naked eyes. Biologists use sophisticated microscopes to detect structures that are
too small to be invisible to our naked eyes. A good example is the services provided
at the website string-db.org, where one can search and navigate through some of the
most comprehensive information about proteins, including evidence and literature
and many other types of information.

2.4.1 DNA Double Helix

The history of deoxyribonucleic acid (DNA) research began with a Swiss biologist
Friedrich Miescher. In 1868 he carried out the first chemical studies on the nuclei
of cells. Miescher detected a substance that he called nuclein and showed that
nuclein consisted of an acidic portion, which included the DNA we know today
and other things. Later he found a similar substance in the heads of salmon sperm
cells. Although he separated the nucleic acid fraction and studied its properties,
78 2 Mapping the Universe

the covalent structure of DNA did not become known with certainty until the late
1940s. Miescher suspected that nuclein or nucleic acid might play a key role in cell
inheritance, but others ruled out such a possibility. It was not until 1943 that the first
direct evidence emerged for DNA as the bearer of genetic information. In that year,
Oswald Avery, Colin MacLeod, and Maclyn McCarty, working at the Rockefeller
Institute provided the early evidence that DNA is the carrier of genetic information
in all living cells. In the 1950s biologists did not know what the DNA molecule
looked like or how the parts of it were arranged.
At King’s College in London an English physicist Maurice Wilkins together
with another English scientist Rosalind Franklin spent most of 1951 using an
x-ray method of photography to work out the structural shape and nitrogenous
base arrangements of DNA. Rosalind Franklin was an expert in using X-ray
crystallography to study imperfectly crystalline matter, such as coal. She discovered
the two forms of DNA. The easily photographed A form was dried, while the B
form was wet. While much harder to photograph, her pictures of the B form showed
a helix. Since the water would be attracted to the phosphates in the backbone, and
the DNA was easily hydrated and dehydrated, she guessed that the backbone of the
DNA was on the outside and the bases were therefore on the inside. This was a
major step forward in the search for the structure of DNA.
In May 1952, Franklin got her first good photograph of the B form of DNA,
showing a double helix. This was another major breakthrough, but Franklin missed
it and continued working on the A form. James Watson and Francis Crick started
their work together on DNA in 1951 at Cambridge University. By the end of
1952, Watson approached Maurice Wilkins who gave him one of Franklin’s x-ray
photographs. Watson started to build a new model of DNA revealing DNA’s
structure as a double helix or spiral. In 1962, together with Maurice Wilkins,
Watson & Crick were awarded a Nobel Prize for their discovery of DNA structure.
Figure 2.32 shows the original structure of DNA’s double helix.
Despite proof that DNA carries genetic information from one generation to the
next, the structure of DNA and the mechanism by which genetic information is
passed on to the next generation remained the single greatest unanswered question
in biology until 1953. It was in that year that James Watson, an American geneticist,
and Francis Crick, an English physicist, working at the University of Cambridge
in England proposed a double helical structure for DNA (Watson and Crick
1953). This was a key discovery to molecular biology and modern biotechnology.
Using information derived from a number of other scientists working on various
aspects of the chemistry and structure of DNA, Watson and Crick were able to
assemble the information like pieces of a jigsaw puzzle to produce their model
of the structure of DNA. Watson gave a personal account of the discovery in
(Watson 1991).
2.4 Biological Maps 79

Fig. 2.32 The original


structure of DNA’s double
helix (Reprinted from
Watson 1968)

2.4.2 Acupuncture Maps

Acupuncture began with the original Chinese medical text, the Yellow Emperor’s
Classic of Internal Medicine (475 BC). In this text, all six Yang Meridians were
said to be directly connected to the Auricle, whereas the six Yin meridians were
indirectly connected to the ear. These ancient Chinese Ear Points were arranged as
a scattered array of points on the ear. Figure 2.33 is an ear acupacture point map.
What is the best organizing metaphor?
The auricle of the ear is a complete miniature of the human body. There
are over 200 specific acupuncture points. In auriculotherapy, the auricle of the
external ear is utilized to alleviate pain, dysfunction and disease as represented and
manifest throughout the body. All vertebras, sympathetic/parasympathetic nerves,
spinal nerves, visceral organs and the central nervous system, and including all
anatomical sites and many functional points are represented on the ear. While
originally based upon the ancient Chinese practice of acupuncture, the somatic
tropic correspondence of specific parts of the body to specific parts of the ear
was first developed by Paul Nogier, a French doctor of medicine, in late 1950s.
According to Nogier, the auricle mirrors the internal organs and auricular points can
be mapped to an inverted projection of an embryo.
Nogier developed a somatatopic map of the ear based upon the inverted fetus
concept. His work was first presented in France and then published by a German
80 2 Mapping the Universe

Fig. 2.33 Ear acupacture point map. What is the best organizing metaphor? (Courtesy of http://
www.auriculotherapy-intl.com/)

acupuncture society and then finally translated into Chinese. In 1958, a massive
study was initiated in China to verify the clinical value of his inverted-embryo
model. In 1980, a study at UCLA by Richard Kroeuning and Terry Oleson verified
the scientific accuracy of auricular diagnosis. There was a statistically significant
level of 75 % accuracy achieved in diagnosing musculoskeletal pain problems in 40
pain patients. Figure 2.34 is a map showing musculoskeletal points.
Auricular therapy has numerous applications. A lot of work has been done to
establish the relationship between the auricle and the body as a whole; the location
and the distribution of auricular points, the function and specificity of the auricular
points; in addition to verify Nogier’s theory.
2.4 Biological Maps 81

Fig. 2.34 Musculoskeletal points (©1996 Terry Oleson, UCLA School of Medicine. http://www.
americanwholehealth.com/images/earms.gif)

2.4.3 Genomic Maps

Due to the publicity of the Human Genome project, genomic maps, gene expression
visualization, and bioinformatics have become the buzzwords in mass media.
Traditionally the common practice of analyzing expression data is done in a single
dimension. Single-dimensional analysis places genes in a total ordering, limiting the
ability to see important relationships.
Kim et al. (2001) visualize the C. elegans expression data in three dimensions.
Groups of related genes in this three-dimensional approach appear as mountains,
and the entire transcriptome appears as a mountain range. Distances in this synthetic
geography are related to gene similarity, and mountain heights are related to the
density of observed genes in a similar location. Expression visualization allows us
to hypothesize potential gene-gene relationships that can be experimentally tested.
To find out which genes are co-expressed, Kim et al. first assembled a gene
expression matrix in which each row represents a different gene (17,817 genes) and
each column corresponds to a different microarray experiment (553 experiments).
The matrix contains the relative expression level for each gene in each experiment
(expressed as log2 of the mornalized Cy3/Cy5 rations). They calculated the Pearson
82 2 Mapping the Universe

Fig. 2.35 Caenorhabditis elegans gene expression terrain map created by VxInsight, showing
three-dimensional representation of 44 gene mountains derived from 553 microarray hybridiza-
tions and consisting of 17,661 genes (representing 98.6 % of the genes present on the DNA
microarrays) (Reprinted from Kim et al. 2001)

correlation coefficient between every pair of genes. For each gene, the similarity
between it and the 20 genes with the strongest positive correlations were used to
assign that gene to an x-y coordinate in a two-dimensional scatter plot with the use
of force-directed placement. Each gene is placed to other genes that are similar in
gene expression. Figure 2.35 shows a terrain map of Caenorhabditis elegans gene
expressions.

2.4.4 A Map of Influenza Virus Protein Sequences

In May 2009, as H1N1 was rapidly spreading across many countries, there was a
rich body of knowledge about influenza pandemics in the literature. Figure 2.36
shows a similarity map of 114,996 influenza virus protein sequences. Each dot is
an individual influenza virus protein sequence. Two sequences are connected if they
are similar in terms of protein structure. Structural similarity is one way to organize
protein sequences. There could be other ways, for example, based on similarities
of biological properties. Once again, multiple perspectives can be applicable. The
question is what would be the best combination of information provided by various
views to solve problems at hand.
2.4 Biological Maps 83

Fig. 2.36 114,996 influenza virus protein sequences (Reprinted from Pellegrino and Chen 2011)

In later chapters, we will propose a generic computational approach that can


be used to identify the best paths to accomplish our goals in the framework of a
complex adaptive system. In particular, the dynamics of scientific frontiers can be
characterized as a special case of an exploratory search problem.
In summary, started with the basic principles of cartography for visual communi-
cation – simplicity and clarity, we have elaborated the role of organizing metaphors
with examples from terrestrial maps, celestial maps, and biological maps in order to
highlight the most fundamental needs for effective visual communication. In conclu-
sion, a metaphor for grouping abstract concepts should be assessed against a number
of similar criteria. For example, a metaphor must afford an intact image. Narratives
such as Greek mythology are useful for connecting individual components together.
A metaphor must represent themes in a way that can be understood by viewers
with the minimum amount of specialized knowledge. Sometimes cartographers
could never know, as in the design of Pioneer’s plaque. Cartographers can only
assume the least amount of prior knowledge required to understand a thematic map.
Selecting an appropriate metaphor that can be understood by a wide variety of
viewers is probably the most challenging task in the entire process of cartographic
design, especially when we move from a concrete and tangible world to an abstract
and fluid world in next chapter. Finally, we have demonstrated the prevalence of
multiple perspectives that one may encounter when dealing with real-world complex
systems. Furthermore, the differences between multiple perspectives may not be
reducible. A key message is that we may well consider how to take advantages of
the presence of multiple perspectives rather than look for ways to avoid it.
84 2 Mapping the Universe

References

Fan XH, Strauss MA, Schneider DP, Becker RH, White RL, Haiman Z, Gregg M, Pentericci L,
Grebel EK, Narayanan VK, Loh YS, Richards GT, Gunn JE, Lupton RH, Knapp GR, Ivezic
Z, Brandt WN, Collinge M, Hao L, Harbeck D, Prada F, Schaye J, Strateva I, Zakamska N,
Anderson S, Brinkmann J, Bahcall NA, Lamb DQ, Okamura S, Szalay A, York DG (2003) A
survey of z > 5.7 quasars in the Sloan Digital Sky Survey. II. Discovery of three additional
quasars at z > 6. Astron J 125(4):1649–1659. doi:10.1086/368246
Geller MJ, Huchra JP (1989) Mapping the universe. Science 246:897
Gregory SA, Thompson LA (1982) Superclusters and voids in the distributions of galaxies. Sci
Am 246(3):106–114
Hearnshaw HM, Unwin DJ (1994) Visualization in geographical information systems. New York:
John Wiley & Sons
Kim S, Lund J, Kiraly M, Duke K, Jiang M, Stuart J et al (2001) A gene expression map for
Caenorhabditis elegans. Science 293:2087–2092
Landy SD (1999) Mapping the universe. Sci Am 280(6):38–45
Lin X (1997) Map displays for information retrieval. J Am Soc Inf Sci 48(1):40–54
Pellegrino DA, Chen C (2011) Data repository mapping for influenza protein sequence analysis.
Paper presented at the 2011 Visualization and Data Analysis (VDA)
Richard Gott J III, Juric M, Schlegel D, Hoyle F, Vogeley M, Tegmark M, Bahcall N, Brinkmann
J (2005) A map of the universe. Astrophys J 624:463–484
Sesti GM (1991) The glorious constellations: history and mythology. Harry N. Abrams, Inc.,
New York
Watson J (1991) The double helix: a personal account of the discovery of the structure of DNA.
Mass Market Paperback
Watson JD (1968) The double helix. Atheneum, New York
Watson JD, Crick FHC (1953) A structure for deoxyribose nucleic acid. Nature 171:737–738
Whitefield P (1999) The topography of the sky: Celestial Maps gave order to the universe.
Mercator’s World, Eugene
Wise JA, Thomas JJ, Pennock K, Lantrip D, Pottier M, Schur A et al (1995) Visualizing the
non-visual: spatial analysis and interaction with information from text documents. Paper
presented at the IEEE symposium on information visualization’95, Atlanta, Georgia, USA,
30–31 October 1995
Chapter 3
Mapping Associations

The eyes are not responsible when the mind does the seeing.
Publilius Syrus (circa 85–43 BC)

In Chap. 2, we have introduced a series of examples of how cartography selects and


depicts terrestrial, celestial, and human biological features of physical phenomena.
Geographic and oceanographic maps help us find our way on land and sea. Star
maps help us explore the universe. In this chapter, we turn our attention inward and
explore the design of mind maps, maps that represent our thought, our experience,
and our knowledge. In traditional cartography, a thematic map always has a base
map and a thematic overlay. For many physical phenomena, a geographic map is
probably the best base map we may ever have: intuitive, solid, and real. Now we
want to produce a map of the mind. In this category of phenomena, a geographic
connection may be not valid anymore. We cannot take for granted a geographic
base map. What metaphor do we use to hold everything as fluid as our thought
together? What are the design principles in constructing a metaphoric base map that
can adequately represent what is by its nature invisible, intangible, and intractable?

3.1 The Role of Association

In this chapter, we focus on the most basic requirements for representing abstract,
dynamic, and often-evasive abstractions of a structure with no inherit connections
between its content and a concrete, tangible form. We are particularly looking
for possible extensions and adaptations of cartographic techniques, for example,
terrain relief maps, landscape views, and constellation maps. But terrain maps
and constellation maps now acquire new meanings and transcend the boundary
of geometry, topology, and their appearance. The semantics of geometric features,
topological patterns, and temporal rhythms now need to be conveyed effectively

C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 85


DOI 10.1007/978-1-4471-5128-9 3, © Springer-Verlag London 2013
86 3 Mapping Associations

Fig. 3.1 Liberation by


Escher. Rigid triangles
are transforming into
more lively figures
(© Worldofescher.com)

and accurately. Visual attributes of geometric and topologic configurations must


transform the invisible and intangible knowledge into something visible, tangible,
concrete, and meaningful; just like Escher’s Liberation (Fig. 3.1).
To understand the fundamental requirements of the challenge, we explore a
wide variety of examples in the realm of mind mapping, such as concept maps,
semantic maps, mind maps, knowledge maps. By examining these examples, we
aim to identify the most fundamental elements in the practice of mind mapping. We
first introduce some basic concepts, followed by more detailed examples of concept
mapping and illustrate the essence of creating and interpreting conceptual maps with
examples of co-word analysis.

3.1.1 As We May Think

In Chap. 2, we have seen the power of constellations in celestial cartography in


holding otherwise isolated stars in an easy to remember image. Constellations are
good examples of how one can make association easily with the help of a metaphor,
a framework, or an image.
Making association is an important part of our thinking. We make connections
all the time. Vannevar Bush (1890–1974) proposed a device called Memex to mimic
the way we think when we organize and search for information (Bush 1945). The
concept of association is central. We learn new concepts by associating them with
3.1 The Role of Association 87

familiar ones. In Memex, the idea is to make such connections accessible to other
people. Connections made in this way are called trails. Bush referred to people who
are making such trials as trailblazers. Trailblazers are builders of an ever-growing
information space. Memex itself has never materialized, but it has a gigantic nearest
kin – the World-Wide Web.
We know that the Web relies on hypertext reference links to pull millions of
documents together. In fact, studies of small-world networks have found the Web
has many features of a small-world network. We will return to small-world networks
in later chapters, but here an interesting thing to know is that the Web has a diameter
of about 16, which means that given an arbitrary pair of documents on the Web, we
can reach one from the other by following a chain of, on the average, 16 hyperlinks.
A central issue for the Web is how to make sure that users can find their way in
this gigantic structure. The predecessor of the Web is a group of hyper-referencing-
enabled information systems – hypertext systems. Research in hypertext started in
the late 1980s was marked by a number of classic hypertext systems such as Apple’s
HyperCard and the NoteCards from Xerox PARC. Navigation has been a central
research issue for hypertext over the last two decades. For example, Canter and his
colleagues distinguished five types of search in hyperspace (Canter et al. 1985):
• Scanning: covering a large area without depth
• Browsing: following a path until a goal is achieved.
• Searching: striving to find an explicit goal.
• Exploring: finding out the extent of the information given.
• Wandering: purposeless and unstructured globetrotting.
An overview map is a commonly used solution to the notorious lost-in-
hyperspace problem first identified by Jeff Conklin (1987). A wide variety of
techniques have been developed over the last two decades for generating overview
maps automatically. The sheer size of the Web poses a tremendous challenge. Many
algorithms developed prior to the Web need to be scaled up before they can handle
the Web. New strategies have been developed to avoid brute-force approaches.

3.1.2 The Origin of Cognitive Maps

The origin of cognitive maps can be traced back to Edward Tolman’s famous study
published in 1948 on the behavior of rats in a maze1 (Tolman 1948). He studied the
behavior of those rats that managed to find the food placed in a maze and realized
that his rats had obviously managed to remember the layout of the maze. Prior to
Tolman’s study, it was thought that rats in a maze were only learning at particular
turning points to make left or right turns. Tolman called this internalized layout a
cognitive map. He further proposed that rats and other organisms develop cognitive
maps of their environments.

1
http://psychclassics.yorku.ca/Tolman/Maps/maps.htm
88 3 Mapping Associations

Humans’ mental image of a city’s layout is another example of cognitive maps.


Different people may develop different cognitive maps, even when they live in the
same city. Many researchers study sketches of people’s internal cognitive maps. For
example, geographers often ask people to sketch a map of an area with directions
to a landmark or other location, or ask them to name as many places as possible
in a short period of time. In this way, one can estimate the strength of association
between two places in someone’s cognitive map.
Geographic maps and urban design have provided a rich source of design
metaphors for cognitive structures. In 1960s, Kevin Lynch (1960) tried to restore
the social and symbolic function of the street and other public spaces and to make
modern cities “legible.” In his The Image of the Environment, Lynch stressed that
we need to know where we are within a city and we need to have a workable image
of each part of the city. In particular, the concept of legibility depends on people’s
mental maps.
Legibility refers to whether the layout of a place is easy to understand. Lynch
identified five elements of legibility: paths, edges, districts, nodes and landmarks.
Paths are familiar routes that people use to move out. A city has a network of
major routes and a network of minor routes in the neighborhood.
Districts are areas with perceived internal homogeneity. They are medium-to-
large sections of the city. They share some common identifying character.
Edges are paths that separate districts.
Landmarks are visually prominent points of reference in a city, for example, the
Big Ben in London and the Eiffel Tower in Paris. Landmarks help people orient
themselves in a city.
Nodes are centers of attraction in a city, such as the Trafalgar Square in London
and the Forbidden City in Beijing. Where a landmark is a distinct visual object, a
node is a distinct hub of activity.
In The Image of the City, Lynch described his studies in three American cities.
For example, Manhattan is organized by a grid structure. Travelers who are aware
of this organization can use this information to guide their journey. For urban
designers, Lynch’s innovative use of graphic notation links abstract ideas of urban
structure with the human perceptual experience.
Rob Ingram and Steve Benford at Nottingham University in Britain incorporated
legibility features into the design of information visualization systems (Ingram
and Benford 1995). Here is an outline of their approach. First, represent the
interrelationships among a set of documents as a graph of nodes and links. Then
apply Zahn’s Minimum Spanning Tree algorithm (Zahn 1917) to the graph to obtain
a minimal spanning tree. Traverse the minimal spanning tree and remove links
that are significantly longer than others nearby. As a result, the minimal spanning
tree is split into several sub-trees. Each sub-tree forms a district in the information
space.
Ingram and Benford also included landmarks in their displays. A landmark is
added wherever there are three mutually adjacent districts in the information space.
The three centroids of these adjacent districts define a triangle. Landmarks are
3.1 The Role of Association 89

placed at the center of such triangles. Edges are drawn to show the boundaries of
large districts. Features such as signposts, history and backtracking mechanisms
were also considered in their city image metaphor, but they were not fully
implemented.
Legibility of a city helps people traveling in the city. The more spatial knowledge
we have of a city, the easier we can find our way in it. Thorndyke and Hayes-Roth
distinguished three levels of such spatial knowledge (Thorndyke and Hayes-Roth
1982) as landmark knowledge, procedural knowledge, and survey knowledge.
Landmark knowledge is the most basic awareness of specific locations in a city
or a way-finding environment. If all we know about London is the Big Ben and
the Trafalgar Square, then our ability to navigate through London would be rather
limited. Procedural knowledge, also known as route knowledge, allows a traveler to
follow a particular route between a source and a destination. Procedural knowledge
connects isolated landmark knowledge into larger, more complex structures. Now
we should know at least one route leading from the Big Ben to the Trafalgar Square.
At the level of Survey knowledge we have fully connected topological information
about a city. Survey knowledge is essential in performing way-finding tasks. A good
example of survey knowledge is the Knowledge of London examination that
everyone applying for a taxi license must have. The official Transport of London
says to each applicant:
You must have a thorough knowledge of London, including the location of
streets, squares, clubs, hospitals, hotels, theatres, government and public buildings,
railway stations, police stations, courts, diplomatic buildings, important places of
worship, cemeteries, crematoria, parks and open spaces, sports and leisure centers,
places of learning, restaurants and historic buildings; in fact everything you need to
know to be able to take passengers to their destinations by the most direct routes.
You may be licensed either for the whole of London or for one or more
of the 16 suburban sectors. The “All London” license requires you to have
a detailed knowledge of the 25,000 streets within a six-mile radius of Charing Cross
with a more general knowledge of the major arterial routes throughout the rest of
London. If you wish to work as a taxi driver in central London or at Heathrow
Airport you need an “All London” license.
We will briefly introduce the famous traveling salesman problem (TSP) in
Chap. 4. The salesman needs to figure out a tour of a number of cities such that he
visits each city for once only and the overall distance of the tour must be minimal.
If the salesman is in London, it looks his best bid is to take a taxi. Figure 3.2 shows
the coverage of London taxi drivers’ survey knowledge.
The most sound survey knowledge is acquired directly from first-hand navigation
experience in an environment – London’s taxi drivers have certainly demonstrated
their first-hand navigation experience in London. Alternatively, we can develop our
survey knowledge by reading maps. However, survey knowledge acquired in this
way tends to be orientation-specific, which means that the navigator may need
to rotate the mental representation of the space to match the environment. This
concern led Marchon Levine to explore how this phenomenon should be taken into
90 3 Mapping Associations

Fig. 3.2 The scope of the Knowledge of London, within which London taxi drivers are supposed
to know the most direct route by heart, that is, without resorting to the A–Z street map

account by map designers (Levine et al. 1982, 1984). Levine stressed that maps
should be congruent with the environment so that we can quickly locate our current
position and orientation on the map and in the environment. Levine laid down three
principals for map design:
The two-point theorem – a map reader must be able to relate two points on the map
to their corresponding two points in the environment.
The alignment principle – the map should be aligned with the terrain. A line between
any two points in space should be parallel to the line between those two points
on the map.
The forward-up principle – the upward direction on a map must always show what
is in front of the viewer.
Researchers have adapted much of the real-world way-finding strategies for way-
finding task in virtual environments. For example, Rudolph Darken and others
provide an informative summary in their article on way-finding behavior in virtual
environments (Darken et al. 1998).
3.2 Identifying Structures 91

3.1.3 Information Visualization

Information visualization emerged as a field of study since the 1990s. There has
been a widely spread interest across research institutions and the commercial
market. Applications of information visualization range from dynamic maps of the
stock market to the latest visualization-empowered patent analysis laboratories. It is
one of the most active research areas that can bring technical advances into a new
generation of science mapping.
The goal of information visualization is to reveal invisible patterns from abstract
data. Information visualization is to bring new insights to people, not merely pretty
pictures. The greatest challenge is to capture something abstract and invisible with
something concrete, tangible, and visually meaningful. The design of an effective
information visualization system is more of an art than science. Two fundamental
components of information visualization are structuring and displaying.

3.2 Identifying Structures

The purpose of structural modeling is to characterize underlying relationships


and structures. Commonly used structural models are lists, trees, and networks.
These structures are often used to describe complex phenomena. Ben Shneider-
man proposed a task-by-data-type taxonomy to divide information visualization
(Shneiderman 1996).
Networks represent a wide spectrum of phenomenon in the conceptual world as
well as a real world. For example, the Web is a network of web pages connected
by hypertext reference links. Scientific literature forms another network of articles
published in journals and conference proceedings. Articles are connected through
bibliographic citations. A set of images can be regarded as a network based on
visual attributes such as color, texture, layout, and shape. In content-based image
retrieval (CBIR), the emphasis is on the ability of feature extraction algorithms to
measure the similarity between two images based on a given type of feature.

3.2.1 Topic Models

In information retrieval, it is common to deal with a set of documents, or a


collection, and to study how the collection responds to specific queries. The
similarity between a query and a document, indeed, a document and another
document, can be determined by an information retrieval model, for example, the
vector space model, the latent semantic indexing model, or the probabilistic model.
These models typically derive term-document and document-document matrices,
which are in turn equivalent to network representations. The vector space model
92 3 Mapping Associations

has a term independent assumption, which says the occurrences of one term can be
regarded as independent from the occurrences of another term. However, it may not
be the case.
When dealing with text documents, a commonly encountered problem is known
as the vocabulary mismatch problem. In essence, people may choose different
vocabulary to describe the same thing.
There are two aspects to the problem. First, there is a tremendous diversity in the
words people use to describe the same object or concept; this is called synonymy.
Users in different contexts, or with different needs, knowledge or linguistic habits
will describe the same information using different terms. For example, it has been
demonstrated that any two people choose the same main keyword for a single, well-
known object less than 20 % of the time on average. Indeed, this variability is much
greater than commonly believed and this places strict, low limits on the expected
performance of word-matching systems.
The second aspect relates to polysemy, a word having more than one distinct
meaning. In different contexts or when used by different people the same word
takes on varying referential significance (e.g., “bank” in river bank versus “bank” in
a savings bank). Thus the use of a term in a search query does not necessarily mean
that a text object containing or labeled by the same term is of interest. Because
human word use is characterized by extensive synonymy and polysemy, straight-
forward term-matching schemes have serious shortcomings – relevant materials
will be missed because different people describe the same topic using different
words and, because the same word can have different meanings, irrelevant material
will be retrieved. The basic problem is that people want to access information
based on meaning, but the words they select do not adequately express intended
meaning. Previous attempts to improve standard word searching and overcome the
diversity in human word usage have involved: restricting the allowable vocabulary
and training intermediaries to generate indexing and search keys; hand-crafting
thesauri to provide synonyms; or constructing explicit models of the relevant domain
knowledge. Not only are these methods expert-labor intensive, but also they are
often not very successful.
Latent Semantic Indexing (LSI) is designed to overcome the vocabulary mis-
match problem faced by information retrieval systems (Deerwester et al. 1990;
Dumais 1995). Online services of LSI are available, for example, http://lsa.colorado.
edu/. Individual words in natural language provide unreliable evidence about
the conceptual topic or meaning of a document. LSI assumes the existence of
some underlying semantic structure in the data that is partially obscured by the
randomness of word choice in a retrieval process, and that the latent semantic
structure can be more accurately estimated with statistical techniques.
In LSI, a semantic space is constructed based on a large matrix of term-document
association observations. LSI uses a mathematical technique called Singular Value
Decomposition (SVD). One can approximate the original, usually very large, term
by document matrix by a truncated SVD matrix. A proper truncation can remove
noise data from the original data as well as improve the recall and precision of
information retrieval.
3.2 Identifying Structures 93

Perhaps the most compelling claim from the LSI is that it allows an informa-
tion retrieval system to retrieve documents that share no words with the query
(Deerwester et al. 1990; Dumais 1995). Another potentially appealing feature is
that the underlying semantic space can be subject to geometric representations. For
example, one can project the semantic space into a Euclidean space for a 2D or 3D
visualization. On the other hand, large complex semantic spaces in practice may not
always fit into low-dimension spaces comfortably.

3.2.2 Pathfinder Network Scaling

Pathfinder network scaling is a method originally developed by cognitive psy-


chologists for structuring modeling (Schvaneveldt et al. 1989). It relies on a
triangle inequality condition to select the most salient relations from proximity data.
Pathfinder networks (PFNETs) have the same set of vertices as the original graph.
The number of edges in a Pathfinder network, on the other hand, can be largely
reduced.
The notion of semantic similarity has been a long-standing theme in charac-
terising sementic structures, including Multidimensional Scaling (Kruskal 1977),
Pathfinder (Schvaneveldt et al. 1989) and Latent Semantic Indexing (Deerwester
et al. 1990). Triangular inequality is an important property of an Euclidean space,
which specifies that the distance between two points is less than or equal to the
distance of a path connecting the two points via a third point. Triangular inequality
is one of the key concepts in Pathfinder network scaling. Pathfinder network scaling
selects important links into the final network representation.
Similarly in Pathfinder, not only is there a triangle inequality to be compared
between a direct link and an alternative path through one other point, but also
between a direct link and all the possible routes connecting a given pair of points.
The maximum length of such routes is N-1. In terms of the metaphor of a traveling
salesman, he may choose to visit all the other cities before the final destination if
this extraordinary travel plan makes more sense to him than travel to the destination
directly. Semantically, if we can assign meanings to such travels, then the direct path
becomes pretty much redundant, and there is no need to consider such paths in later
analysis. This is the central idea to pathfinder network scaling.
Pathfinder network scaling relies on a criterion known as the triangle inequality
condition to select the most salient relations from proximity data. Results of
Pathfinder network scaling are called Pathfinder networks (PFNETs), consisting
of all the vertices from the original graph. The number of edges in a Pathfinder
network, however, is determined by the intrinsic structure of semantics. On the one
hand, a Pathfinder network with the least number of edges is identical to a minimum
spanning tree. On the other hand, additional edges in a Pathfinder network indicate
salient relationships that might have been missed from a minimum spanning tree
solution.
94 3 Mapping Associations

The topology of a PFNET is determined by two parameters q and r and the


corresponding network is denoted as PFNET(r, q). The q-parameter controls the
scope that the triangular inequality condition should be imposed. The r-parameter
refers to the Minkowski metric used for computing the distance of a path. The
weight of a path P with k links, W(P), is determined by weights w1 , w2 , : : : , wk
of each individual link as follows:
! 1r
X
k
W .P / D wri
i D1

The Minowski distance (geodetic) depends on the value of the r-metric. For
r D 1, the path weight is the sum of the link weights along the path; for r D 2, the
path weight is computed as Euclidean distance; and for r D 1, the path weight is
the same as the maximum weight associated with any link along the path.
8 k
ˆ
ˆ X
ˆ
ˆ wi r D 1
ˆ
ˆ
ˆ
ˆ D1
! 1r ˆ
ˆ
i
X k < ! 12
W .P / D w r
D X k
i
ˆ
ˆ w2i r D2
i D1 ˆ
ˆ
ˆ
ˆ D1
ˆ
ˆ
i
ˆ
:̂ max wi r D 1
i

The q-parameter specifies that triangle inequalities must be satisfied for paths
with k  q links:

k1  1r
P
wn1 nk D wrni niC1 8k  q
iD1

When a PFNET satisfies the following three conditions, the distance of a path is
the same as the weight of the path:
1. The distance from a document to itself is zero.
2. The proximity matrix for the documents is symmetric; thus the distance is
independent of direction.
3. The triangle inequality is satisfied for all paths with up to q links.
If q is set to the total number of nodes less one, then the triangle inequality is
universally satisfied over the entire network. Increasing the value of parameter r or
q can reduce the number of links in a network. The geodesic distance between two
nodes in a network is the length of the minimum-cost path connecting the nodes. A
minimum-cost network (MCN), PFNET(r D 1, q D n  1), has the least number of
3.2 Identifying Structures 95

Fig. 3.3 Nodes a and c are connected by two paths. If r D 1, Path 2 is longer than Path 1,
violating the triangle inequality; so it needs to be removed

links. Figure 3.3 illustrates how a link is removed if it violates the triangle inequality.
See (Chen 1999a, b; Chen and Paul 2001; Schvaneveldt et al. 1989) for further
details.
The spatial layout of a Pathfinder network is determined by a force-directed
graph-drawing algorithm (Kamada and Kawai 1989). Because of its simplicity and
intuitive appealing, force-directed graph drawing becomes increasingly popular in
information visualization.
Typical applications of Pathfinder networks include modeling a network of
concepts based on similarity ratings given by human experts, constructing proce-
dural and protocol analysis models of complex activities such as air-traffic control,
and comparing learners’ Pathfinder networks at various stages of their learning
(Schvaneveldt 1990).
Pathfinder networks display links between objects explicitly. Structural patterns
are easy for our perceptions to detect. In addition, Pathfinder network scaling is an
effective link-reduction mechanism, which prevents a network from being cluttered
by too many links. Figure 3.4 shows a Pathfinder network of 20 cities in the US.
The colors of nodes indicate the partition of the network based on the degree of
each node: white nodes have the degree of 3, blue nodes 2, and green nodes 1. The
size of each node indicates the centrality of the node. In this case, the Pathfinder
network turns out to be the unique minimum spanning tree. Figure 3.5 shows the
partition of the Pathfinder network by the degree of each node. The larger the size
of a node, the closer it is to the center.

3.2.3 Measuring the Similarity Between Images

Now we use an example from content-based image retrieval (CBIR) to illustrate


the use of Pathfinder and GSA (Chen et al. 2000). GSA is generic. Not only is it
suitable for text documents, but also can handle other types of entities in a similar
way. In the following example, we demonstrate how to derive a structure of images.
In addition, the structure of images provides additional insights to the quality of
similarity measures and characteristics of different feature extraction algorithms.
96 3 Mapping Associations

Fig. 3.4 A Pathfinder network of the 20-city proximity data

Fig. 3.5 A Pathfinder network of a group of related concepts

If two images have the same size in terms of pixels, we can compare the
difference of the two pixel by pixel. If we have 100 images of the size of 64  64
pixels, the structure of these images can be represented as a so-called manifold in a
3.2 Identifying Structures 97

high-dimensional space. To be precise, the dimensionality of the space is the number


of pixels on an image: 64  64 D 4,096. The MDS and PCA techniques introduced
later in this chapter can be applied to such sets of images.
The key issue in content-based image retrieval (CBIR) is how to match two
images according to computationally extracted features. Typically, the content of
an image can be characterized by a variety of visual properties known as features.
It is common to compare images by color, texture, and shape, although these entail
different levels of computational complexity. Color histograms are much easier to
compute than a shape-oriented feature extraction.
Computational approaches, on the other hand, typically rely on feature-extraction
and pattern-recognition algorithms to match two images. Feature-extraction algo-
rithms commonly match images according to the following attributes, also known
as query classes:
• Color
• Texture
• Shape
• Spatial constraints
Swain and Ballard (1991) matched images based solely on their color. The
distribution of color was represented by color histograms, and formed the images’
feature vectors. The similarity between a pair of images was then calculated using
a similarity measure between their histograms called the normalized histogram
intersection. This approach became very popular due to its robustness, computa-
tional simplicity, and low storage requirements. A common extension to color-based
feature extraction is to add textural information. There are many texture analysis
methods available, and these can be applied either to perform segmentation of the
image, or to extract texture properties from segmented regions or the whole image.
In a similar vein to color-based feature extraction, He and Wang (1990) used a
histogram of texture, called the texture spectrum. Other types of features include
layout and shape.
In the following example, we visualized a set of 279 visualization images.
The majority of these images are synthetic graphics generated by computer or
screenshots of information visualization systems. The size, resolution, and color
depth of these images vary. Images were grouped together by a human user in order
to provide a point of reference for the subsequent automatically generated models.
We asked the user to group these images according to their overall visual similarity,
but no specific guidelines were given on how such similarity should be judged.
Similarity measures between these images were computed by the QBIC system
(Flickner et al. 1995). The three networks correspond to similarities by color, layout,
and texture. We expected that images with similar structures and appearances should
be grouped together in Pathfinder networks.
Figure 3.6 is the screenshot of the visualization. The Pathfinder network was
derived from similarities determined by color histograms. The layout of the
visualization is visually appealing. Several clusters of images have homogenous
colors. The largest image cluster includes images typically with line-drawing-like
98 3 Mapping Associations

Fig. 3.6 Visualization of 279 images by color histogram

diagrams and visualization displays. Figures 3.7 and 3.8 show the screenshots of
two visualization models of the InfoViz image database by layout and by texture,
respectively. Both layout and texture similarities were computed by the QBIC
system.
The overall structure of the layout-based visualization is different from the color-
based visualization shown in Fig. 3.6. This is expected due to the self-organizing
nature of the spring-embedder model. On the other hand, visualizations based
on the two schemes share some local structures. Several clusters appear in both
visualizations. The spring embedder algorithm tends to work well with networks of
less than a few hundreds of nodes.
Unlike the layout version, the texture-based visualization has a completely
different visual appearance from the color-based visualization. In part, this is
because the color histogram and color-layout schemes share some commonality in
the way they deal with color.
Now we compare the Pathfinder networks generated by different features
extracted from images. The number of links in each network and the number
of links in common are used as the basis for network comparisons. The degree
of similarity between two networks is determined by the likelihood that a number of
common links are expected given the total number of links in the networks involved.
Alternatively, one may consider use the INDSCAL method outlined later in this
3.2 Identifying Structures 99

Fig. 3.7 Visualization of 279 images by layout

chapter to explore the difference between structures detected by different feature


extraction techniques.
Color- and layout-based visualization schemes turned out to have significantly
similar structures (Table 3.1). The magnitude of structural similarity is 0.182. This
suggests that these two visualizations reveal some salient characteristics of the
image database.
Pathfinder networks of images by color and by texture are completely different.
They share only two common links (Table 3.2). This confirms our visual inspection
of the networks. The network similarity is 0.004.
Layout- and texture-based visualizations are also very different (See Table 3.3).
They share only one common link. The network similarity is 0.002. The color-based
visualization has the least number of links (279). The layout-based version has the
largest number of links (319).
100 3 Mapping Associations

Fig. 3.8 Visualizations of 279 images by texture

Table 3.1 Comparison Number of images 279


between color- and
Links in PF by color 271
layout-based visualizations
Links in PF by layout 319
Common links 91
Expected common links 2:23
Point probability 0:00
Information 406:94

Table 3.2 Comparisons


of color- and texture-based Number of images 279
visualizations Links in PF by color 271
Links in PF by texture 284
Common links 2
Expected common links 1:98
Point probability 0:27
Information 0:76
3.2 Identifying Structures 101

Table 3.3 Comparisons of


layout- and texture-based Number of images 279
visualizations Links in PF by layout 319
Links in PF by texture 284
Common links 1
Expected common links 2:34
Point probability 0:23
Information 0:14

3.2.4 Visualizing Abstract Structures

Information visualization has a long history of using terrain models and relief maps
to represent abstract structures. Information visualization based on word frequencies
and distribution patterns has been a unique research branch, especially originated
from information retrieval applications.

3.2.4.1 ThemeView

The changing patterns at the lexical level have been used to detect topical themes.
Some intriguing visualization technologies have been developed over the past few
years (Hetzler et al. 1998).
The most widely known example in this category is ThemeView, developed at
Pacific Northwest National Laboratory (Wise et al. 1995). James Wise described
an ecological approach to text visualization and how they used the relief map as a
model of a thematic space (Wise 1999). ThemeView enables the user to establish
connections easily between the construction and the final visualization. Figure 3.9
is a screenshot of PNNL’s ThemeView, showing word frequency distributions as
peaks and valleys in a virtual landscape.

3.2.4.2 VxInsight

Sandia National Laboratory developed a visualization system called VxInsight to


model clustered information in the form of a virtual landscape. It adapts the popular
landscape model to visualize underlying data. In particular, researchers at Sandia
National Laboratory used VxInsight to visualize cluster structures derived from
Science Citation Index (SCI). VxInsight allows the user to move back and forth in
the virtual landscape. Figure 3.10 shows a virtual landscape produced by an earlier
version of VxInsight.
VxInsight was applied to the analysis of patents (Boyack et al. 2000). Thematic
terms and patenting companies are cross-references in landscapes over a few periods
of time by labeling key thematic terms and coloring different companies. Figure 3.11
102 3 Mapping Associations

Fig. 3.9 Valleys and peaks in ThemeView (© PNNL)

Fig. 3.10 A virtual landscape in VxInsight


3.2 Identifying Structures 103

Fig. 3.11 A virtual landscape of patent class 360 for a period between 1980 and 1984 in
VxInsight. Companies’ names are color-coded: Seagate-red, Hitachi-green, Olympus-blue, Sony-
yellow, IBM-cyan, and Philips-magenta (Courtesy of Kevin Boyack)

shows a virtual landscape of patent class 360 for a period of 4 years between 1980
and 1984. Further issues concerning patent analysis and visualization are discussed
in Chap. 5.

3.2.4.3 Self-Organized Feature Maps

Another popular metaphor for information visualization organizes information into


adjacent regions on a flat map. Self-organized feature maps (SOMs) (Kohonen
1989) have been used in information retrieval. “ET-Map” is a multi-level category
SOM map of the information space of over 100,000 entertainment related Web
pages listed by Yahoo! Hsinchun Chen and his colleagues developed the map at
University of Arizona, USA (Chen et al. 1998).
Andre Skupin takes self-organized map techniques further and provides the kind
of look-and-feel of a common geographic map except underneath the familiar
cartographic surface is an abstract space instead of land. Figure 3.12 shows an
example of his work. The base map is constructed based on over 22,000 abstracts
submitted to the Annual Meeting of the Association of American Geographers
104 3 Mapping Associations

Fig. 3.12 A SOM-derived base map of the literature of geography (Reprinted from Skupin 2009)

(AAG) between 1993 and 2002. Each abstract was first represented as a document
in a 2,586-dimension vector space. Then a two-dimensional model of the document
space was generated using SOM. Finally, the SOM configuration was visualized in
standard GIS software.
3.2 Identifying Structures 105

Fig. 3.13 The process of visualizing citation impact in the context of co-citation networks (© 2001
IEEE)

3.2.4.4 Constructing a Virtual World of Scientific Literature

Now we give a brief introduction to the use of techniques for mapping. A more
detailed analysis from co-citation points of view is in the next chapter. Figure 3.13
illustrate the process of structuring and visualizing citation impact in the context
of co-citation networks. Indeed, the process is very generic, applicable to a wide
spectrum of types of phenomena.
First, select authors who have received citations above a threshold. Intellectual
groupings of these authors represent snapshots of the underlying knowledge do-
main. Co-citation frequencies between these authors are computed from a citation
database, such as ISI’s SCI and SSCI. ACA uses a matrix of co-citation frequencies
to compute a correlation matrix of Pearson correlation coefficients. According to
(White and McCain 1998), such correlation coefficients best capture the citation
profile of an author.
106 3 Mapping Associations

Pearson correlation coefficients can be calculated as follows, where X and Y are


data points in an N-dimensional space. Xmean and Ymean are the mean of X and the
mean of Y, respectively.
   
X D x1 x2 : : : : : : xN I Y D y1 y2 : : : : : : yN

The standard deviation of X, ¢ x, and that of Y, ¢ y, are defined as follows:


v v
u N 2 u N 2
u P u P
u u
t i D1 xi  Xmean t i D1 yi  Ymean
x D I y D
N 1 N 1

Finally, the standardized scores zx and zy are used to calculate the correlation
coefficient rxy, which in turn forms the correlation matrix.

X  Xmean Y  Xmean
zx D zy D
X Y

P
N
zx  zy
i D1
rxy D
N 1
Second, apply Pathfinder network scaling to the network defined by the correla-
tion matrix. Factor analysis is a standard practice in ACA. However, in traditional
ACA, MDS and factor analysis rarely appear in the same graphical representations.
In order to make knowledge visualizations clear and easy to interpret, we overlay
the intellectual groupings identified by factor analysis and the interconnectivity
structure modeled by the Pathfinder network scaling. Authors with similar colors
essentially belong to the same specialty and they should appear as a closely
connected group in the Pathfinder network. Therefore, one can expect to see the
two perspectives converge in the visualization. This is the third step.
Finally, display the citation impact of each author on top of the intellectual
groupings. The magnitude of the impact is represented by the height of a citation bar,
which in turn consists of a stack of color-coded annual citation sections. Figure 3.14
illustrates the construction of a three-dimensional knowledge landscape.
Figure 3.15 shows virtual landscape views of three different subject domains, the
upper middle one for computer graphics and applications (Chen and Paul 2001),
the lower left for hypertext (Chen 1999b; Chen and Carr 1999a, b), and the lower
right one for virtual reality. In the computer graphics example, we visualized author
co-citation patterns found in the journal IEEE Computer Graphics and Applications
(CG&A) between 1982 and 1999. The CG&A citation data include articles written
by 1,820 authors and co-authors. These authors cited a total of 10,292 unique
articles, written by 5,312 authors (first author only). Among them, 353 authors
who have received more than five citations in CG&A entered into author co-citation
analysis. Intellectual groupings of these 353 authors provide the basis for visualizing
3.2 Identifying Structures 107

Fig. 3.14 The design of ParadigmView (© 2001 IEEE)

the knowledge domain of computer graphics, although this is a snapshot from a


limited viewpoint – the literature of computer graphics is certainly much wider than
the scope of CG&A. The original author co-citation network contains as many as
28,638 links, which is 46 % of all the possible links, not including self-citations.
This amount of links will clutter visualizations. So we applied Pathfinder network
scaling to reduce the number of links. The number of links in the Pathfinder network
is 355.
We used a three-dimensional virtual landscape to represent author co-citation
structures. Most influential scientists in the knowledge domain tend to appear
near to the center of the intellectual structure. In contrast, researchers who have
unique expertise are likely to appear in periphery areas. The virtual landscape also
allows users to access further details regarding a particular author in the intellectual
structure, for example, a list of most cited work of the author, abstracts and even
full content of his/her articles. In the next chapter, we introduce animations to the
citation profiles so that the dynamics of the citation tendency of relevant specialties
over two decades can be captured and replayed within seconds.

3.2.5 Visualizing Trends and Patterns of Evolution

The map of the universe conveys two types of information simultaneously – both
spatial and temporal. While spatial information specifies the distance between
galaxies and how astronomical objects are related to each other in the universe,
temporal information provides an equivalent interpretation of the spatial property in
108 3 Mapping Associations

Fig. 3.15 Examples of virtual landscape views (© 2001 IEEE)

terms of a high redshift quasar that is far away from us could be formed a dozen of
billion years ago in the early universe. The notion of timeline is widely adopted in
many visualizations of abstract information. Most notably, the evolution of events
can be organized along a timeline. Although usually a timeline design tends to move
spatial patterns to the background to even vanished completely, there are visual
designs that intend to preserve and convey both spatial and temporal patterns in the
same display. A visual analytic system called GeoTime, for instance, successfully
accommodates spatial and temporal patterns. We will discuss visual analytics in
detail in the next few chapters.

3.2.5.1 Thematic Streams

A particularly influential design of timeline visualization is made by ThemeRiver.


The visualization is designed based on a simple and intuitive metaphor of a time
3.2 Identifying Structures 109

Fig. 3.16 Streams of topics in Fidel Castro’s speeches and other documents (Reprinted from
Havre et al. 2000)

river, in which topics of interest flow along a dimension of time, usually placed
horizontally and pointing from the left to the right. The timeline provides an
organizing framework so that a wide variety of information can be organized
according to its state at a particular point of time. Figure 3.16 shows a ThemeRiver
visualization of topics found in a collection of Fidel Castro’s speeches, interviews,
articles, and other text. The visualization represents the variations of topics from the
beginning of 1960 through the middle of 1961. The famous Cuban missile crisis
took place in about the same period of time. The topics are represented by the
frequencies of relevant terms appeared in each month. Major events are annotated
on the top with dashed lines drawn vertically at the time of events. For example, the
timeline visualization shows that Cuba and Soviet resumed diplomatic relations in
May 1960 and Castro confiscates American refineries around the end of June 1960.
Castro mentioned Soviet 49 times in September. The continuity of a topic is shown
as a continuous stream of varying width across time.
The ThemeRiver-style timeline visualization is suitable to a wide range of
applications. New York Times, for example, featured an interactive visualization of
popular movies in terms of their box office revenue.2 The streams of movies tend
to be short lived, which is understandable because our attention span to a particular
movie won’t last forever. The appearance of streams makes them more like the
peaks of mountains. Perhaps just to be more consistent with the timeline metaphor,
we should consider them as the tips of icebergs floating from the past to the future.

2
http://www.nytimes.com/interactive/2008/02/23/movies/20080223 REVENUE GRAPHIC.html
110 3 Mapping Associations

Fig. 3.17 The evolution of topics is visualized in TextFlow (Reprinted from Cui et al. 2011)

The height of a layer of a movie, or the width of the stream, indicates the weekly box
office revenue of the movie. Its color indicates its level in terms of the total domestic
gross up to Feb. 21, 2008, the cut-off date of the data. According to the visualization,
the most popular movies in 2007 include Transformers, Harry Potter and the Order
of the Phoenix, I am Legend, and National Treasure: Book of Secretes. The color of
National Treasure indicates that its total domestic gross is one level below I am
Legend.
The idea of using the variation of a thematic stream to reveal topic evolution
is appealing because it is intuitive and easy to understand. On the other hand,
computationally identifying a topic from unstructured text remains to be a challenge.
Identifying the evolution of a topic is an even more challenging subject. Research
in topic modeling has advanced considerable over the past decade. A good example
of integrating topic modeling and information visualization to address the ability to
track the evolution of topics over time is TextFlow (Fig. 3.17).
TextFlow follows the thematic river metaphor but extends the metaphor with
features that are more suitable for analyzing the evolution of topics. TextFlow adopts
the use of the width of a stream to indicate its strength. It addresses some of the
most common dynamics between streams, for example, how a stream splits into
multiple streams and how a few streams merge into a single stream. These patterns
of dynamics are of particular interest in analyzing the development of a subject
domain. Another remarkable aspect of TextFlow is how it presents a seemingly
simplistic display of a complex issue.

3.2.5.2 Alluvial Maps

My favorite design of timeline visualization is that of alluvial maps by Martin


Rosvall. Martin Rosvall maintains a website so that one can generate alluvial maps
for their own network data.3 To generate an alluvial map, the alluvial generator
needs to have a series of networks as the input. Each network corresponds to the

3
http://www.mapequation.org/alluvialgenerator/index.html
3.3 Dimensionality Reduction 111

Fig. 3.18 Alluvial map of scientific change (Reprinted from Rosvall and Bergstrom 2010)

structure of an evolving network at a specific point of time. Each network is divided


into a number of clusters. The corresponding clusters in adjacent networks form a
sequence of views of how the same clusters evolve over time. The split and merge
of thematic patterns can be visualized as multiple streams flow smoothly over time
(See Fig. 3.18).
Given a sequence of network models, one can generate alluvial maps of a diverse
range of sources of data. CiteSpace provides a function for batch exporting a series
of networks to the Pajek .net format. The exported .net files can be loaded to the
alluvial map generator (See Fig. 3.19). Figure 3.20 shows an alluvial map generated
based on networks of co-occurring terms in publications related to research in
regenerative medicine. Top 300 most frequently occurred terms each year are used
to construct the network of that year. The iPSCs stream identified as the most
recent thread corresponds to the research that was awarded the 2012 Nobel Prize in
Medicine. We will describe what we found in a scientometric study of regenerative
medicine shortly.
For example, Fig. 3.21 shows an alluvial map of popular tweet topics as
Hurricane Sandy moved along the east coast of the US. Each ribbon-like stream
represents a topic identified in re-tweeted tweets of each hour. For example,
the purple ribbon on Sandy Tweets runs through the entire window of analysis.
Similarly, the green ribbon, labeled as East Coast, runs through the entire night
as Hurricane Sandy moved from the South to the North along the East Coast.
Figure 3.22 shows a more complex process – a lead compound optimization. Each
stream is a depiction of the thread of the search in chemical space.

3.3 Dimensionality Reduction

Principle Component Analysis (PCA) and Multidimensional Scaling (MDS) are


classic and widely used techniques for dimensionality reduction. They are simple
to implement, efficiently computable, and guaranteed to discover the true structure
112 3 Mapping Associations

Fig. 3.19 Load a network in .net format to the alluvial map generator

Fig. 3.20 An alluvia map generated based on networks of co-occurring terms in publications
related to regenerative medicine. Top 300 most frequently occurred terms are chosen each year

of data on or near to a linear subspace of the high-dimensional input space. Robert


McCallum discussed the relations between factor analysis and multidimensional
scaling (McCallum 1974). Joseph Kruskal discussed the relationship between MDS
and clustering (Kruskal 1977). Subsequent advances have overcome some of the
major limitations of traditional techniques in handling data, especially when their
true structures correspond to a non-linear subspace of the high-dimensional input
space. A good general book on correspondence analysis is (Greenacre 1993).
3.3 Dimensionality Reduction 113

Fig. 3.21 An alluvial map of popular tweet topics identified as Hurricane Sandy approaching

Fig. 3.22 An alluvial map of co-occurring patterns of chemical compound fragments

3.3.1 Geometry of Similarity

The strength of association is often measured in terms of proximity, similarity,


or relatedness. Multidimensional scaling (MDS) is a commonly used method to
reduce the dimensionality of such data and to depict data points in a two- or three-
dimensional spatial configuration. The principal assumption of MDS is that the
similarity data can be transformed to inter-point distances in a metric space by a
linear or monotonic decreasing function. The stronger the similarity between two
data points in the source data, the closer the corresponding points are in the metric
space. The key concern is how well such mapping preserves the structure in the
original data. The goodness of fit is most commonly measured by a stress value.
While symmetric similarity measures are common, similarity measures can
be asymmetric in nature. Amos Tversky (1977) questioned both the metric and
dimensional assumptions of similarity data. He proposed a feature-matching model
in an attempt to establish that the common features tend to increase the perceived
similarity of two concepts, and distinct features tend to diminish perceived simi-
larity. Furthermore, Tversky’s model claims that our judgments of similarity are
asymmetric. Common features tend to have more influencing power than distinct
features over the way we gauge similarity conceptually.
Carol Krumhansl (1978) proposed a distance-density model in response to the
objections to geometric models raised by Tversky. She suggested that the similarity
between objects is a function not only of inter-point distance in a metric space but
also the spatial density of points in the surrounding configuration. In short, the
density of the metric space reduces the strength of the perceived similarity. Two
points in a relatively dense region of a stimulus space would appear to have a smaller
similarity measure than two points of equal inter-point distance but located in a less
dense region of the space.
114 3 Mapping Associations

Krumhansl analyzed the implications of the feature-matching model. She con-


cluded that when we judge the similarity between two entities, we actively seek
what they have in common. She suggested that the distance-density model is more
accurate than the feature-match model when it comes to accounting for variations
in similarity data.
In later chapters, we will explain the role of Pathfinder network scaling in
preserving salient structures with explicit links. This is related to Gestalt psychol-
ogy, which identifies pattern-inviting features such as proximity, similarity, and
continuity. However, the original Gestalt psychology overlooks the role of explicit
linkage in helping us recognize a pattern more easily. Pathfinder networks provide
representations that can enhance pattern recognition.

3.3.2 Multidimensional Scaling

PCA finds a low-dimensional embedding of the data points that best preserves their
variance as measured in the high-dimensional input space. Classic MDS finds an
embedding that preserves the pairwise point distances (Steyvers 2000). PCA and
MDS are equivalent if Euclidean distances are used.
Let us illustrate what MDS does with some real-world examples, including
distances between cities and similarities between concepts. In general, there are two
types of MDS: Metric and Non-metric MDS. Metric MDS assumes that the input
data is either ratio or interval data, while the non-metric model requires simply that
the data be in the form of ranks.
A metric space is defined by three basic axioms, which are assumed by a
geometric model:
1. Metric minimality: for the distance function d and any point x, the equation d(x,
x) D 0 holds.
2. Metric symmetry: for any data points x and y, the equation d(x, y) D d(y, x) holds.
3. Metric triangle inequality: for any data points x, y, and z, the inequality d(x,
y) C d(y, z)  d(x, z) holds.
Multidimensional scaling (MDS) is a standard statistical method used on mul-
tivariate data (See Fig. 3.23). In MDS, N objects are represented as d-dimensional
vectors with all pairwise similarities or dissimilarities (distances) defined between
the N objects. The goal is to find a new representation for the N objects as k-
dimensional vectors, where k < d such that the interim proximity nearly matches
the original similarities or dissimilarities. Stress is the most common measure of
how well a particular configuration reproduces the observed distance matrix.
Given a matrix of distances between a number of major cities from the back of
a road atlas or an airline flight chart, we can use these distances as the input data
to derive an MDS solution. Figure 3.6 shows the procedure of generating an MDS
map. When the results are mapped in two dimensions, the configuration should look
very close to a conventional map, except that you might need to rotate the MDS
3.3 Dimensionality Reduction 115

Fig. 3.23 The simplest


procedure of generating an
MDS map

Fig. 3.24 A geographic map showing 20 cities in the US (Copyright © 1998–2012 USA-
Tourist.com, LLC http://www.usatourist.com/english/tips/distances.html)

map so that the north–south and east–west dimensions conform to conventions. To


reproduce the geographic layout completely, one may need to have enough data
entries.
In the following example, we take the distances between 20 cities in the USA as
the input for MDS. Figure 3.24 is a geographic map of the United States, showing
20 cities. When we compared the resultant MDS map with the usual geographic
map, it is easy for us to understand the mechanisms behind various properties of
MDS mapping.
116 3 Mapping Associations

Fig. 3.25 An MDS configuration according to the mileage chart for 20 cities in the US

Fig. 3.26 The mirror image of the original MDS configuration, showing an overall match to the
geographic map, although Orlando, Miami should be placed further down to the South

We input the city distance data to MDS, in this case using ALSCAL in SPSS,
and use Euclidean distance for the model. Figure 3.25 shows the resultant MDS
configuration, which is like a mirror image of the usual geographic map with New
on the right instead of left. We can legitimately rotate and flip an MDS configuration
to suit our custom. If we take the mirror image of the MDS configuration, the result
is indeed very close to the US map (Fig. 3.26).
Now let us look at a more abstract example, in which each data point represents
a car and the distance between two different cars is measured by a number of per-
formance indicators. This example is based on a widely available multidimensional
data set, the CRCARS data set, prepared by David Donoho and Ernesto Ramos
(1982).
3.3 Dimensionality Reduction 117

Fig. 3.27 The procedure of


generating an MST-enhanced
MDS map of the CRCARS
data. Nodes are placed by
MDS and MST determines
explicit links

The CRCARS data set includes 406 cases of cars. Each case consists of
information from 8 variables: miles per gallon (MPG), the number of cylinders,
engine displacement in cubic inches, horsepower, vehicle weight in pounds, 0–
60 mph acceleration time in seconds, the last two digits of the year of model, and
the origin of car, i.e. USA as 1, European 2, and Japanese 3. For example, a record
of a BMW 2002 shows that it was made in Europe in 1970, with a 26 mile per gallon
fuel consumption, 4 cylinders, 0–60 mph acceleration in 12.5 s, and so on. The A
procedure of combining MDS and MST is shown in Fig. 3.27 (Basalaj 2001). The
resultant MDS configuration of 406 cars in the CRCARS data set is reproduced in
Fig. 3.28.
Figure 3.29 is a procedural diagram of a journal co-citation study (Morris and
McCain 1998). More examples of co-citation analysis are provided in Chap. 5. This
example here is to illustrate the use of MDS to map more abstract relationships.
This is also a good example to show that clustering and MDS may result in different
groupings. When it happens, analysts need to investigate further and identify the
nature of discrepancies. Figure 3.30 shows the cluster solution. Each data point is a
journal. Note that the journal “Comput Biol Med” belongs to cluster BIOMEDICAL
COMPUTING, whereas the journal “Int J Clin Monit Comput” belongs to cluster
COMPUTING IN BIOMEDICAL ENGINEERING. In Fig. 3.31, the results of
clustering are superimposed on top of the MDS configuration. Now see how close
the two journals are located. This example indicates that one should be aware of the
limitations of applying clustering algorithms directly on MDS configurations.
In this example, both MDS and clustering took input directly from the similarity
matrix. This approach has some advantages. For example, between MDS and
Clustering, we might identify patterns that could be overlooked by either method
alone. We will also present an example in which MDS and clustering are done
sequentially. If that is the case, we need to bear in mind we are totally relying on
MDS alone because the subsequent clustering does not bring additional information
into the process.
118 3 Mapping Associations

Fig. 3.28 An MDS configuration of the 406 cars in the CRCARS data, including an MST overlay.
The edge connecting a pair of cars is coded in grayscale to indicate the strength of similarity: the
darker, the stronger the similarity. The MST structure provides a reference framework for assessing
the accuracy of the MDS configuration (Courtesy of http://www.pavis.org/)

Fig. 3.29 The procedure of


journal co-citation analysis
described in Morris and
McCain (1998)

Kruskal and Wish (1978) suggested that a two-dimensional MDS configuration


is far more useful as a base map than a three-dimensional one. In MDS, the overall
fitness between the similarity data and a spatial configuration is measured by a
stress value. In general, the lower the stress, the better the fit is. However, the stress
value is not the only criterion. A pragmatic rule is to look at the overall clarity and
simplicity of the map and then decide whether the layout is good enough at the
present stress level. The computational cost of reducing the stress value tends to
3.3 Dimensionality Reduction 119

Fig. 3.30 Cluster solution for SCI co-citation data (Reproduced from Morris and McCain (1998).
Note that “Comput Biol Med” and “Int J Clin Monit Comput” belong to different clusters)

increases exponentially as the stress value decreases. After all, if the original data
is of high-dimension in nature, it is not always possible to find a perfect fit in a
lower-dimensional space. For example, it is almost certain that we have to settle on
a higher stress value when mapping N statements on a general topic than mapping
the distances of N cities. Furthermore, if the distances among cities were measured
by something of higher dimension in nature, such as the perceived quality of life,
it would be equally unlikely for MDS to maintain the same level of goodness of
fit. Indeed, Trochim (1993) reported that the average stress value across 33 concept
map projects was 0.285 with a range from 0.155 to 0.352. After all, the goal of MDS
mapping is not merely to minimize the stress value; rather, we want to produce a
meaningful and informative map that can reveal hidden structures in the original
data.

3.3.3 INDSCAL Analysis

INSCAL was developed by John Carroll and J. Chang of Bell Telephone Laborato-
ries in the 1970s to explain the relationship between subjects’ differential cognition
of a set of stimuli, or objects. For N subjects and p objects, INDSCAL takes a set
of N matrices as its input. Each matrix is a symmetric p  p matrix of similarity
120 3 Mapping Associations

Fig. 3.31 SCI multidimensional scaling display with cluster boundaries (Reproduced from Morris
and McCain (1998). Note the distance between “Comput Biol Med” and “Int J Clin Monit Comput”
to the left of this MDS configuration)

measures between the p objects. The model explains differences between subjects’
cognition by a variant of the distance model. The p objects are represented as points
in a space known as a master space, a shared space, or a group space. The subjects
perceive this space differently because individuals afford a different salience or
weight to each dimension of the space. The INDSCAL model assumes that subjects
are systematically distorting the group space and it seeks to reconstruct both the
individual private, distorted spaces and the aggregate “group” space. Similarity
measures can be derived from aggregated groups as well as from individuals’
ratings. For example, in judging the differences between two houses an architect
might primarily concentrate on style and structure, whereas a buyer might be more
concerned with the difference in price.
Carroll and Chang illustrated INDSCAL with an example of analyzing how
people perceive the distances between six different areas of a city. They asked three
subjects to estimate the distance between each of the pairs of areas. Each subject
estimated a total of 15 such pairs, (6  5)/2 D 15.
The INDSCAL model interprets individual differences in terms of subjects
applying individual sets of weights to the dimension of a common “group” or
3.3 Dimensionality Reduction 121

“master” space. The main output of an INDSCAL analysis is a group space in


which the stimuli, or objects, are depicted as points. In this example, six areas of
a city appear as points in the group space. The configuration of objects in this
group space is in effect a compromise between different individual’s configurations.
Therefore the configuration may not be identical to the configuration of any
particular individual.
The INDSCAL also generates a subject space that represents each individual as a
point. Recall that the INDSCAL assumes a systematic distortion from an individual,
the position of an individual in the subject space reflects the “weights” which the
individual assigns to each dimension, just like a home buyer would give more
weights on the price dimension. Unlike with factor analysis and multidimensional
scaling, INDSCAL produces a unique orientation of the dimensions of the group
space. It is not legitimate to rotate the axes of a group space to a more meaningful
orientation. Furthermore, each point in the subject space should be interpreted as a
vector drawn from the origin. The length of this vector is roughly interpretable as
the proportion of the variance in the subject’s data accounted for by the INDSCAL
solution. All subjects whose weights are in the same ratio will have vectors oriented
in the same direction. The appropriate measure for comparing subjects’ weights is
the angle of separation between their vectors.
In Helm’s study (1964), the observations of subject with normal color sight
mapped as a circle corresponding to the color wheel, with the orthogonal axes of the
two-dimensional map anchored by red and green and by blue and yellow, whereas
color-blind subjects’ observations mapped as ellipses – they did not consider the
red–green (or blue-yellow) information as strongly when making color-matching
decisions. Figure 3.32 shows two red-green color-deficient subjects’ individual
differences scaling results.
Figures 3.33 and 3.34 show SCI and SSCI weighted INSCAL displays, re-
spectively (Morris and McCain 1998). Contributors to SCI indexed journals and
those to SSCI indexed journals have different preferences and different levels
of granularity. If journals are wide spread along one dimension, it implies that
the corresponding subject fields have more sophisticated knowledge for scientists
to make finer distinctions. If journals are concentrated within a relatively small
range of a dimension, then it suggests that corresponding knowledge domains have
distinguished to a less extent.

3.3.4 Linear Approximation – Isomap

Scientists in many fields face the problem of simplifying high-dimensional data


by finding low-dimensional structure in it. MDS aims to map a given set of
high-dimensional data points into a low-dimensional space. The Isomap algorithm
(Tenenbaum et al. 2000) and the locally linear embedding (LLE) algorithm (Roweis
and Saul 2000) provide demonstrated improvements in dimensionality reduction.
Both were featured in the December 2000 issue of Science.
122 3 Mapping Associations

Fig. 3.32 Individual


differences scaling results of
two red-green color-deficient
subjects. The Y axis is not
fully extended as normal
subjects

Fig. 3.33 SCI weighted individual differences scaling display (Reproduced from Morris and
McCain 1998)

PCA and MDS have been routinely used to reduce the dimensionality of linear
data. Euclidean distances provide reliable measures of a linear structure in a
high-dimensional space. The problem is that when we deal with a non-linear
structure, Euclidean distances may not be able to detect the true structure. The
3.3 Dimensionality Reduction 123

Fig. 3.34 SSCI weighted individual differences scaling display (Reproduced from Morris and
McCain 1998)

difference between Euclidean distances and geodesic distances is explained in the


following example. For a passenger on a particular line of the London Underground,
the geodesic distance between two stations is measured along the rail tracks, which
form a curved or wiggled one-dimensional data. The geodesic distance is how far
the train has to cover. For a passenger on a hot-air balloon, on the other hand, the
distance between the two stations could be measured along a straight line connecting
the two stations. The straight-line distance is the Euclidean distance, which is often
shorter than the geodesic distance. In classic PCA and MDS, there is no built-in
mechanism to distinguish geodesic distances and Euclidean distances. Manifold
scaling algorithms, also known as non-linear MDS, are designed to address this
problem. Because they are more generic than standard PCA and MDS, and given
the popularity of PCA and MDS, manifold scaling algorithms have a potentially
broad critical mass of users.
The basic idea is linear approximation. When we look the railway track
immediately underneath our feet, they are straight lines. On the other hand, if we
look far ahead, the track may bend smoothly in distance. An important step in linear
124 3 Mapping Associations

Fig. 3.35 The Swiss-roll data set, illustrating how Isomap exploits geodesic paths for nonlinear
dimensionality reduction. Straight lines in the embedding (the blue line in part a) now represent
simpler and cleaner approximations to the true geodesic paths than do the corresponding graph
paths (the red line in part b) (Reproduced from Tenenbaum et al. (2000) Fig. 3. http://www.
sciencemag.org/cgi/content/full/290/5500/2319/F3)

approximation is to transform a non-linear data into many smaller linear data and
then re-construct a global solution from local solutions of linear structures. Both
algorithms explained below are tested on a Swiss-roll-like non-linear data structure
of 20,000 data points.
The Isomap algorithm extracts meaningful dimensions by measuring the distance
between data points along the surface (Tenenbaum et al. 2000). Isomap works best
for shapes that can be flattened out, like cylinders or Swiss rolls. Isomap measures
the distance between any two points on the shape, then uses these geodesic distances
in combination with the classic MDS algorithm in order to make a low dimensional
representation of that data. Figure 3.35 demonstrates how Isomap unfolds data
shaped like a Swiss roll.
In the Isomap algorithm, the local quantities computed are the distances between
neighboring data points. For each pair of non-neighboring data points, Isomap finds
the shortest path through the data set connecting them, subject to the constraint
that the path must hop from neighbor to neighbor. The length of this path is
an approximation to the distance between its end points, as measured within the
underlying manifold. Finally, the classical method of MDS is used to find a set
of low-dimensional points with similar pairwise distances. The Isomap algorithm
worked well on several test data, notably face images with three degrees of freedom,
up-down pose, left-right pose, and lighting direction (Fig. 3.36) and hand images
with wrist rotation and finger extension as two degrees of freedom (Fig. 3.37). In
other words, the true dimension of the face image data is 3 and that of the hand data
is 2. The residual variance of Isomap drops faster than PCA and MDS, which means
that PCA and MDS tend to overestimate the dimensionality, in contrast to Isomap
(Tenenbaum et al. 2000).

3.3.5 Locally Linear Embedding

The Locally Linear Embedding (LLE) algorithm uses linear approximation to model
a non-linear manifold (Roweis and Saul 2000). It is like using a lot of small pieces
3.3 Dimensionality Reduction 125

Fig. 3.36 Face images varying in pose and illumination (Fig. 1A) (Reprinted from Tenenbaum
et al. 2000)

of two-dimensional planes to patch up a three-dimensional sphere. Cartographers


use similar techniques when they transform the spherical surface of the earth to a
flat map and the mapping must preserve the local relationships between places.
The LLE algorithm divides a set of high-dimensional data into small patches
that each can be easily flattened. These flattened small patches are reassembled in a
lower dimensional space, but the relative positions of data points within each patch
are preserved as much as possible.
LLE computes the best approximation of each data point by a weighted linear
combination of its neighbors. Then the algorithm finds a set of low-dimensional
points, each of which can be linearly approximated by its neighbors with the
same coefficients that were determined from the high-dimensional data points. Both
Isomap and LLE produce impressive results on some benchmark artificial data sets,
as well as on “real world” data sets. Importantly, they succeed in learning nonlinear
manifolds, in contrast to algorithms such as PCA, which has no built-in mechanism
to detect geodesic distances along a non-linear structure in a high-dimensional
space. Figure 3.38 shows how LLE unfolds the Swiss roll data.
Both Isomap and LLE algorithms introduce some distortions of the data, espe-
cially for more complicated shapes that include curves. The different approaches
may prove to be better or worse for different types of data. Isomap, based on
126 3 Mapping Associations

Fig. 3.37 Isomap (K D 6) applied to 2,000 images of a hand in different configurations (Repro-
duced from Supplemental Figure 1 of Tenenbaum et al. (2000) http://isomap.stanford.edu/handfig.
html)

Fig. 3.38 The color-coding illustrates the neighborhood-preserving mapping discovered by LLE
(Reprinted from Roweis and Saul 2000)

estimating and preserving global geometry, may distort the local structure of the
data. LLE, based only on local geometry, may distort the global structure.
Given the role of classic PCA and MDS in mapping concepts, the interest in
manifold scaling algorithms is likely to increase in the near future. It is largely
unknown whether typical data structures from concept mapping and the co-citation
3.4 Concept Mapping 127

structures to be explained in Chap. 5 are essentially linear or non-linear. Another


issue is the scale-up question. Both algorithms handled the 20,000-point Swiss-roll
data well. It is a promising direction to investigate the potential of applying such
algorithms to concept mapping and science mapping data.

3.4 Concept Mapping

Concept maps provide a visual representation of knowledge structures and argument


forms. In many disciplines various forms of concept map are used as formal
knowledge representation systems, for example, semantic networks in artificial
intelligence, bond graphs in mechanical and electrical engineering, Petri nets in
communications, and category graphs in mathematics. Here we describe an example
from William Trochim of Cornell University (Trochim 1989; Trochim et al. 1994;
Trochim and Linton 1986).

3.4.1 Card Sorting

Card sorting is one of the earliest methods used for concept mapping. Earlier works
on card sorting include George Miller’s “A psychological method to investigate
verbal concepts” (Miller 1969) and Anthony Biglan’s “The characteristics of subject
matter in different academic areas” (Biglan 1973).
We illustrate the process of concept mapping with the following example drawn
from William Trochim and his colleagues at Cornell University, see for example
(Trochim 1989). They follow a similar process as what we see in Chap. 2 for creating
a thematic map – a base map is superimposed by a thematic overlay (See Fig. 3.39).
In particular, the process utilizes MDS and clustering algorithms.
The process started with a brainstorm session, in which individual participants
were asked to sort a large set of N statements on a chosen topic into piles.
They should put statements into the same pile if they thought they were similar.
The results of each individual participant’s sorting were represented as an N  N
similarity matrix. If a participant put statement i and statement j into the same pile,
the value of eij in the matrix was set to 1; if they were not in the same pile, the value
was set to 0. Then they aggregated the matrices of all the participants into a matrix
(Eij ). The value of Eij therefore is the number of participants who had put statement
i and statement j into the same pile. Because a statement is always sorted into the
same pile as itself, the diagonal of the aggregated matrix always equals N.
The structure of the similarity matrix was depicted through a two-dimensional
non-metric MDS configuration, which was followed by a hierarchical cluster
analysis of the MDS coordinates to divide the spatial configuration into district-
like groups. Finally, participants were led through a structured interpretation session
designed to help them understand the maps and label them in a meaningful way.
128 3 Mapping Associations

Fig. 3.39 The procedure


used for concept mapping

When participants sorted statements into piles, they also rated each statement
on one or more variables. Most typically, each statement was rated for its relative
importance on a 5-point scale, from 1 for unimportant through 5 for extremely
important. The results of such rating were subsequently used as a thematic overlay
on top of the base map (See Fig. 3.40).

3.4.2 Clustering

There are two broad types of approaches to hierarchical cluster analysis: agglom-
erative and divisive. In agglomerative, the procedure starts with each point as its
own branch end-point and decides which two points to merge first. In each step, the
algorithm determines which two points and/or clusters to combine next. Thus, the
procedure agglomerates the points together until they are all in one cluster. Divisive
hierarchical cluster analysis works in the opposite manner, beginning with all points
together and subsequently dividing them into groups until each point is its own
groups. Ward’s method is an agglomerative approach.
Three methods of analysis are closely related to MDS. These are principal
component analysis (PCA), correspondence analysis (CA) and cluster analysis.
Principal components analysis (PCA) is performed on a matrix A of N entities
observed p variables. The aim is to search for new variables, called principal
components, which are based on a linear combination of the original variables
and they can account for most of the variation in the original variables. When
these distances are Euclidean distances, the coordinates contained in X do represent
3.4 Concept Mapping 129

Fig. 3.40 An MDS-configured base map of topical statements and ratings of importance shown
as stacked bars

the principal coordinates, which would be obtained when doing PCA on A. This
approach is called principal coordinates analysis, or classical scaling. A more
detailed account of this correspondence can be found in Everitt and Rabe-Hesketh
(1997).
Correspondence analysis is classically used on a two-way contingency table
in order to visualize the relations between the row and column categories. The
unfolding models do the same: subjects (row-categories) and objects (column-
categories) are visualized in a way that the order of the distances between a
subject-point and the object-points reflects the preference ranking of the subject.
The measure of “proximity” used in CA is the Chi-square distance between the
profiles. A short description of CA and its relation to MDS can be found in Borg
and Groenen (1997).
Cluster analysis models are equally applicable to proximity data including two-
way (asymmetric) square and rectangular data as well as three-way two-mode data.
The main difference with the MDS models is that most models for cluster analysis
lead to a hierarchical structure. Path distances under a number of restrictions
approach the dissimilarities.
130 3 Mapping Associations

Fig. 3.41 Hierarchical cluster analysis divided MDS coordinates into nine clusters

Celestial cartography divides the sky into 88 constellations to help us explore


stars and galaxies. Cities are divided into legible districts for easy navigation.
Similarly, we often divide a concept map into meaningful regions. Concept mapping
uses sorting results and MDS to produce the basic point map as the base map. Just
as in geographic mapping, there are times when we want more detail and other
times we want less. The point map generated by MDS is a fairly detailed map. One
way to generate a map at a higher level than a point map is to produce a cluster
map, in which data points are into clusters, by using clustering procedures such as
Hierarchical Cluster Analysis. The input to the cluster analysis is the point map,
specifically the coordinates for all of the points on the MDS map. Using the MDS
configuration as input to the cluster analysis forces the cluster analysis to partition
the MDS configuration into non-overlapping clusters in two-dimensional space. We
will come across other examples involving the concept of partition later in this
chapter.
In concept mapping, hierarchical cluster analysis is usually conducted using
Ward’s algorithm (Everitt 1980). Ward’s algorithm is especially appropriate for the
type of distance data that comes from the MDS analysis. The hierarchical cluster
analysis takes the point map and constructs a hierarchy, or a “tree”. At the root of
the tree, there is only one cluster – all points belong to the same trunk, whereas
the leaves of the tree have as many clusters as the total number of data points.
Anywhere in between, a cluster may contain a number of data points. Figure 3.41
shows a cluster-map of MDS configuration. The clusters were derived from MDS
coordinates instead of the original data.
3.5 Network Models 131

Just as in geographic mapping, the cartographer makes decisions about scale and
detail depending on the intended uses of the map. There is no hard and fast rule to
determine the best number of clusters can be selected.
In Trochim’s concept mapping, rating data was used to provide the third-
dimension on a 2-dimensional map, a vertical overlay that depicts the height of
various regions. In a cluster map, the layers of a cluster depicted the average
importance rating of all statements within the cluster.
Meaningful text labels are essential to identify the nature of point groupings
and clusters simple and clear. Automatically generating meaningful labels is still
a challenge. The most straightforward way to generate labels is to ask people to
do it. If individuals gave different labels, simply choose the label that makes most
sense.

3.5 Network Models

Graph theory is a branch of mathematics that studies graphs and networks. A graph
consists of vertices and edges. A network consists of nodes and links. Many impor-
tant phenomena can be formulated as a graph problem, such as telecommunication
networks, club membership networks, integrated electric circuits, and scientific
networks. Social networks, for example, are graphs in which vertices represent
people and edges represent interrelationships between people. Acquaintanceship
graphs, co-author graphs, and collaboration graphs are examples of social networks.
To a mathematician, they are essentially the same thing. In graph theory, the focus
is on the connectivity of a graph – the topology, rather than the geometry. One of
the earliest graph theoretical studies was dated back to 1736 when Leonhard Euler
(1707–1783) published his paper on the solution of the Königsberg bridge problem.
Another classical problem in graph theory is the famous Traveling Salesman
Problem (TSP). In the twentieth-century graph theory has become more statistical
and algorithmic, partly because we are now dealing with some very large graphs
such as the Web, telephone call graphs, and collaboration graphs. In this section,
two types of graphs are of particular interest to us: random graphs and small-world
networks.

3.5.1 Small-World Networks

The phrase “six degrees of separation” describes the phenomenon of a small


world where any random two people can discover a link through a chain of six
acquaintances. Ithiel de Sola Pool (1917–1984) pioneered the study of contact
networks, a line of work that becomes known as “the small world” phenomenon
(Kochen 1989). There was even a movie called “Six Degrees of Separation.”
132 3 Mapping Associations

Fig. 3.42 A structural hole between groups a, b and c (Reprinted from Burt 2002)

In the 1950s and 1960s Anatol Rapoport studied social networks as random
graphs (Rapoport and Horvath 1961). He showed that if the placement of edges was
not completely random, it could produce a graph with a lower overall connectivity
and a larger diameter. Sociologist Mark Granovetter (1973) argued that it is through
casual acquaintances, or weak ties, that we obtain new information, rather than
through strong ties, or close personal friends. The weak ties across different groups
are crucial in helping communities mobilize quickly and organize for common
goals easily. In this vein, Ronald Burt (1992) extended the strength of weak ties
argument and argued that it was not so much the strength or weakness of a tie that
determined its information potential, but rather whether there was a structural hole
between someone’s social network. A structural hole can be seen as a person who
has strong between-cluster connections but weak within-cluster connections in a
social network. Figure 3.42 illustrates two persons’ connections in a social network
(Burt 2002). While both Robert and James have six strong ties and one weak tie,
Robert is in a more informed position than James because much information for
James would be redundant. Robert, on the other hand, is a bridge to cluster A and C.
Therefore, the number of connections in a social network is important, but the value
of each connection depends on how important it is for maintaining the connectivity
of a social network.
The degree of separation between two people is defined as the minimum length
of such chains between them. In a graph, this is equivalent to the diameter of the
graph. You may have heard that everyone on Earth is separated from anyone else
by no more than six degrees of separation. Normally, the social world we know is
confined to a group of our immediate acquaintances; most of them know each other.
3.5 Network Models 133

Our average number of acquaintances is very much less than the size of the global
population. So the claim that any people in the world are just six degrees apart does
seem mysterious.
Stanley Milgram conducted a study in 1967 to test the small-world phenomenon
(Milgram 1967). He asked volunteers in Nebraska and Kansas to deliver packets
addressed to a person in Boston through people they know and who might get
it closer to its intended recipient. Milgram kept track of the letters and the
demographic characteristics of their handlers. He found a median chain length of
about 6, 5.5 to be precise. However, two-thirds of the packets were never delivered
at all, and that the reported path length of 5.5 nodes was an average, not a maximum.
Over the past few years, there has been a surge of revived interest in this topic
among mathematicians, statisticians, physics, and psychologists (Watts 1999; Watts
and Strogatz 1998). Brian Hayes wrote two-part features in American Scientist to
introduce some of the latest studies of far-reaching implications of the small-world
phenomenon (Hayes 2000a, b).

3.5.2 The Erdös-Renyi Theory

Random graphs are among the most intensively studied graphs. The Hungarian
mathematician Paul Erdös (1913–1996) and his colleague Alfred Renyi found that a
random graph has an important property: that is, when the number of edges exceeds
half the number of vertices, a “giant component” merges suddenly so that most of
the vertices become connected by the single piece. This is known as the Erdös-Renyi
theory.
Given that many huge graphs in the real world are not random graphs, it is
particularly interesting if there are such giant components in these graphs. For
example, a giant component in a citation network would indicate some mainstream
literature of a particular discipline. A giant component in the cyber-graph of the
World-Wide Web would identify the core users of the Web and the core customers
of e-commerce. James Abello of the AT&T Shannon Laboratories in New Jersey
studied the evolution of call graphs, in which the vertices are telephone numbers,
and the edges are calls made from one number to another (Abello et al. 1999).
Within 20 days, the graph grew to a gigantic network of 290 million vertices and 4
billion edges. This is simply too big to analyze with current computing resources.
Abello and his colleagues analyzed a one-day call graph, containing 53,767,087
vertices and 170 million edges. Among 3.7 million components, most of them
tiny, they found one giant connected component that connects 44,989,297 vertices
together, which is more than 80 % of the total number of vertices. The gigantic
component has a diameter of 20, which implies that any telephone number in the
component can be linked to any other through a chain of no more than 20 calls. The
emergence of a giant component is characteristic of Erdös-Renyi random graphs,
but the pattern of connections in the call graph is certainly not random.
134 3 Mapping Associations

A clique is a fully connected graph, also known as a complete graph, in which


every vertex is linked to every other vertex directly. Abello et al. found more than
14,000 cliques of 30 vertices in their call graph. Each clique represented a distinct
group of 30 individuals in which everyone talked with everyone else at least once
on the phone during that day. Within such cliques, the degree of separation is one.
The Web is by far the largest real-world graph. More and more researchers are
turning their attention to the structure and evolution of the Web. Physicist Albert-
László Barabási and his colleagues at University of Notre Dame in Indiana, USA,
studied the topology of the Web and found an amazing feature that web pages had
19 degrees of separation (Albert et al. 1999; Barabási et al. 2000). They counted
hyperlinks between 260,000 unique sites on the Web and found that the distribution
of links followed a power law (Barabási et al. 2000). The power law implies that web
pages with just a few links are most common, but pages with hundreds of links may
still exist even though they are rare. The age of a site did not seem to have much
to do with its number of links; all sites were not created equal. In fact, the more
links a web page has, the more new links it will attract. The rich get richer. Here
we go again. Furthermore, web pages with a large number of links are important
in forming a gigantic component of the Web and reducing the degree of separation
between web pages. Two special kinds of link-rich web pages were studied by Jon
Kleinberg of Cornell University, Prabhakar Raghavan and Sridhar Rajagopalan of
the IBM Almaden Research Center: hubs and authorities (Kleinberg 1998). Hubs
have a large number of outgoing links. Authorities have many incoming links.

3.5.3 Erdös Numbers

Paul Erdös (1913–1996) was a productive Hungarian mathematician. Figure 3.24 is


a photograph of Erdös. He has been regarded as the most brilliant mind in graph
theory. He published over one thousands of articles. When he died of a heart attack
in 1996, New York Times wrote:
Concentrating fully on mathematics, he traveled from meeting to meeting, carrying a half-
empty suitcase and staying with mathematicians wherever he went. His colleagues took
care of him, lending him money, feeding him, buying him clothes and even doing his taxes.
In return, he showered them with ideas and challenges – with problems to be solved and
brilliant ways of attacking them.

The Erdös number of a mathematician is in fact defined as the degree of


separation between Erdös and the mathematician in a collaboration graph. If a
mathematician has published a joint article with Erdös, then his or her Erdös number
is one. The Erdös number of someone who did not write with Erdös directly, but
wrote with someone with an Erdös number of one, would be two, and so on. It was
thought that this collaboration graph should have a well-connected component with
Erdös at the center and linking to almost all active scientists.
While mathematicians have their Erdös numbers, Hollywood actors and actresses
can have their Bacon numbers. The “Hollywood graph” is a collaboration graph
3.5 Network Models 135

that represents movie stars as vertices and edges connecting them if they ever
starred in a movie together. A version of the Hollywood graph in 2001 represents
355,848 actors and actresses from 170,479 movies. In this graph, the focus is on
the centrality of Hollywood actor Kevin Bacon in the film industry. This Hollywood
graph has gained widespread publicity, partly because researchers have found a way
to replicate some key characteristics of this graph (Watts and Strogatz 1998). Brett
Tjaden and Glenn Wasson of the University of Virginia maintain The Oracle of
Bacon on the Web that calculates Bacon numbers.
Small-world networks are defined by three properties: sparseness, clustering, and
small diameter (Watts 1999). Sparseness means that the graph has relatively few
edges. Clustering means that edges are not uniformly distributed among vertices;
instead there tend to be clumps in the graph. Small diameter means that the longest
shortest path across the graph is small.
In 1998, Duncan Watts and Steven Strogatz of Cornell University found these
properties in the Hollywood graph and several other huge graphs have (Watts and
Strogatz 1998). Watts and Strogatz used a rewiring strategy to produce a graph
somewhere between a random graph and a regular graph. The rewiring process
started with a regular lattice and then rewired some of the edges according to a
probability p, ranging from 0 to 1. If p is equal to 0, then everything remains
unchanged. If p is equal to 1, then every edge is re-arranged randomly and the lattice
becomes a random graph. They calculated the minimum path length L averaged
over all pairs of vertices and found that L dropped dramatically when just a few of
the edges were rewired. Watts and Strogatz also measured the degree of clustering
in their hybrid graphs using a clustering coefficient C. They found the clustering
coefficient C remained high until the rewiring probability was rather large. The
Hollywood graph demonstrated a good match to their model.

3.5.4 Semantic Networks

Semantic networks are useful tools as representations for semantic knowledge and
inference systems. Historically, semantic networks refer to the classic network the-
ory of Collins and Quillian (1969) in which concepts are represented as hierarchies
of interconnected nodes with nodes linked to certain attributes. It is important to
understand the organization of large-scale semantic networks. By applying graph-
theoretic analyses, the large-scale structure of semantic networks can be specified
by distributions over a few variables, such as the length of the shortest path between
two words and the number of connections per word. Researchers have shown that
the large-scale organization of semantic networks reveals a small-world structure
that is very similar to the structure of several other real-life networks such as the
neural network of the worm C. elegans, the collaboration network of film actors
and the WWW. We have seen examples of Erdos numbers and Bacon numbers. We
return to C. elegans in later chapters for an example of gene expression visualization
of C. elegans.
136 3 Mapping Associations

Mark Steyvers and Josh Tenenbaum analyzed three types of semantic networks:
associative networks, WordNet, and Roget’s thesaurus (Steyvers and Tenenbaum
2001). They found that these semantic networks demonstrate some typical features
of a small-world structure: sparse, short average path-lengths between words, and
strong local clustering. In these semantic networks it was also found that the
distributions of the number of connections follow power laws, suggesting a hub
structure similar to the WWW. They built a network model that acquires new
concepts over time and integrates them into the existing network. If new concepts
grow from well-connected concepts and their neighbors in the network, this network
model demonstrates the small-world characteristics of semantic networks and the
power-law distributions in the number of connections. An interesting prediction
of their model is that concepts that are learned early in the network acquire more
connections over time than concepts learned late.
For an example of a shortest pathway running through major scientific disciplines
instead of concepts, see Henry Small’s work on charting the pathways in scientific
literature (Small 2000), although he did not study these pathways as a small-
world phenomenon. In Chap. 5, we will introduce another trailblazing example
from Small’s work on specialty narratives (Small 1986). The small-world model
of semantic networks predicts that the earlier a concept is learned in the network,
the more connections it will get. This doesn’t sound surprising. Sociologist Robert
Merton’s Matthew’s Effect, or the rich get richer, leads us to think the characteristics
of scientific networks. After all, the small-world phenomenon was originated from
the society. Practical implications of these small-world studies perhaps lie in how
one can find strong local clusters and build shortest paths to connect to these clusters.
These findings may also influence the way we see citation networks.

3.5.5 Network Visualization

3.5.5.1 Pajek

In Slovene, the word pajek means spider. A computer program Pajek is designed
for analysis of large networks of several thousands of vertices (Batagelj and Mrvar
1998). It is freely available for noncommercial use.4 Conventionally, a network with
more than hundreds of vertices can be regarded as large. There are even larger
networks, such as the Web, with estimated billions of web pages, forms a super-
large network.
Réka Albert, Hawoong Jeong, Albert-László Barabási analyzed the error and
attach tolerance of complex networks. The tool they used was Pajek. They illustrated

4
http://vlado.fmf.uni-lj.si/pub/networks/pajek/
3.5 Network Models 137

the difference between an exponential and a scale-free network by visualizing a


network of 130 nodes and 215 links with the Paject program (Albert et al. 2000).
An exponential network is homogeneous in terms of the way links are distributed
between nodes. Most nodes have about the same number of links. A scale-free
network, on the other hand, is inhomogeneous, which means a few nodes have the
number of links much more than their “fair share” and the remaining nodes can
have as few as one or two links. It is these link-rich nodes that keep the entire
network stay in one piece. Pajek’s network analysis functions allow the researchers
to visually demonstrate the crucial difference. The five “richest” nodes are colored
in red and their first neighbors are in green. In the exponential network, the five
most connected nodes reach only 27 % of the nodes. In contrast, in the scale-free
network, more than 60 % are reached.
The topology of a scale-free network provides an interesting point of reference
for us. Many visualizations of intellectual networks in subsequent chapters of this
book are indeed very much resemble to the topology of a scale-free network,
although in many cases we achieve this by extracting a scale-free network from
a much larger network, which could be exponential.
If every vertex from a subset of all vertices is connected to at least k vertices
from the same subset, this subset of vertices is called k-core. If every vertex from a
subset is connected to every other vertex from the subset, such subsets of vertices
are called cliques.

3.5.5.2 Gephi

Gephi is probably the most popular network visualization software currently


available. Building on the rich resources of the graph drawing and information
visualization communities, Gephi offers an extensible and user-friendly platform
to analyze and visualize large-scale networks. It is flexible and it supports popular
network formats such as GraphML and Pajek’s .net format. In some areas, Gephi
has become competitive even to the most mature and widely used software available
from the earlier generations such as Pajek. It can gracefully handle large networks.
Figure 3.43 is an example generated based on a layout produced by Gephi and
rendered by CiteSpace. It shows an extensive network of 18,811 references shaped
by the citation behavior of 4,000 publications each year from 2000 till 2011 in
relation to regenerative medicine. The colors indicate the time of publication. Early
publications are in darker colors, whereas more recent ones are in yellow and orange
colors. Labels on the map highlight the names of authors of the most highly cited
references. The area that corresponds to the iPSCs cluster is located at the upper left
corner of the network in orange, where the names of Takahashi and Yu are labeled.
Networks visualized at this level may provide a good starting point to make sense of
the dynamics of the evolving field. On the other hands, the devils are in the details.
Differentiating topics, hypotheses, and findings at document is essential to the study
of an evolving scientific field.
138 3 Mapping Associations

Fig. 3.43 A visualization of a co-citation network associated with research in regenerative


medicine. The colors indicate the time of publication

3.5.5.3 Large Graph Layout (LGL)

The layout of our map of influenza virus protein sequences was generated by LGL.
It first determines the layout of a large graph from a minimum spanning tree of the
graph. LGL is one of the computer programs openly available for visualizing large
graphs. It is written in C and the source code is available.5 It has been mostly used
in biomedical studies. More details are available on the web, but the project is no
longer actively maintained, to compile it one has to download some legacy libraries
such as boost 1.33.1.6
LGL has been used to generate some of the most intriguing maps of the Internet
in 2003–2005. Examples can be found at http://www.opte.org/maps/.

3.6 Summary

In summary, in this chapter we have explored some of the most popular ideas and
techniques for mapping the mind. A good strategy of working with abstract data is
to apply the same technique to some concrete data or data that we are familiar with.

5
http://lgl.sourceforge.net/
6
http://sourceforge.net/projects/lgl/forums/forum/584294/topic/3507979
References 139

Such exercises will help us to understand the characteristics of various algorithms


and improve our ability to grasp the message conveyed by visualizations. In the
next chapter, Chap. 4, we introduce a broader range of information visualization
principles and techniques. We explain how they can be applied for mapping
scientific frontiers in later chapters.

References

Abello J, Pardalos PM, Resende MGC (1999) On maximum clique problems in very large graphs.
In: Abello J, Vitter J (eds) External memory algorithms. American Mathematical Society,
Providence, pp 119–130
Albert A, Jeong H, Barabási A-L (1999) Diameter of the World Wide Web. Nature 401:130–131
Albert R, Jeong H, Barabási A-L (2000) Attack and error tolerance in complex networks. Nature
406:378–382
Barabási A-L, Albert R, Jeong H, Bianconi G (2000) Power-law distribution of the World Wide
Web. Science 287:2115a
Basalaj W (2001) Proximity visualization of abstract data. Retrieved November 5, 2001, from
http://www.pavis.org/essay/index.html
Batagelj V, Mrvar A (1998) Pajek: a program for large network analysis. Connections 21(2):47–57
Biglan A (1973) The characteristics of subject matter in different academic areas. J Appl Psychol
57:195–203
Borg I, Groenen P (1997) Modern multidimensional scaling. Springer, New York
Boyack KW, Wylie BN, Davidson GS, Johnson DK (2000) Analysis of patent databases using
Vxinsight (No. SAND2000-2266C). Sandia National Laboratories, Albuquerque
Burt RS (1992) Structural holes: the social structure of competition. Harvard University Press,
Cambridge, MA
Burt RS (2002) The social capital of structural holes. In: Guillen NF et al (eds) New directions in
economic sociology. Russell Sage Foundation, New York
Bush V (1945) As we may think. Atl Mon 176(1):101–108
Canter D, Rivers R, Storrs G (1985) Characterizing user navigation through complex data
structures. Behav Info Technol 4(2):93–102
Chen C (1999a) Information visualisation and virtual environments. Springer, London
Chen C (1999b) Visualising semantic spaces and author co-citation networks in digital libraries.
Info Process Manag 35(2):401–420
Chen C, Carr L (1999a) Trailblazing the literature of hypertext: author co-citation analysis (1989–
1998). Paper presented at the 10th ACM conference on hypertext (Hypertext’99), Darmstadt,
February 1999
Chen C, Carr L (1999b) Visualizing the evolution of a subject domain: a case study. Paper presented
at the IEEE visualization’99, San Francisco, 24–29 October 1999
Chen C, Paul RJ (2001) Visualizing a knowledge domain’s intellectual structure. IEE Comput
34(3):65–71
Chen H, Houston AL, Sewell RR, Schatz BR (1998) Internet browsing and searching: user
evaluations of category map and concept space techniques. J Am Soc Inf Sci 49(7):582–608
Chen C, Gagaudakis G, Rosin P (2000) Content-based image visualisation. Paper presented at
the IEEE international conference on information visualisation (IV 2000), London, 19–21 July
2000
Collins AM, Quillian MR (1969) Retrieval time from semanticmemory. J Verbal Learn Verbal
Behav 8:240–248
Conklin J (1987) Hypertext: an introduction and survey. Computer 20(9):17–41
140 3 Mapping Associations

Cui W, Liu S, Tan L, Shi C, Song Y, Gao Z et al (2011) TextFlow: towards better understanding of
evolving topics in text. IEEE Trans Vis Comput Graph 17(12):2412–2421
Darken RP, Allard T, Achille LB (1998) Spatial orientation and wayfinding in large-scale virtual
spaces: an introduction. Presence 7(2):101–107
Deerwester S, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent
semantic analysis. J Am Soc Info Sci 41(6):391–407
Donoho D, Ramos E (1982) PRIMDATA: data sets for use with PRIM-H. Retrieved November 5,
2001, from http://lib.stat.cmu.edu/data-expo/1983.html
Dumais ST (1995) Using LSI for information filtering: TREC-3 experiments. In Harman D (ed)
The 3rd text REtrieval conference (TREC3), National Institute of Standards and Technology
Special Publication, pp 219–230
Everitt BS, Rabe-Hesketh S (1997) The analysis of proximity data. Arnold, London
Everitt B (1980) Cluster analysis. Halsted Press, New York
Flickner M, Sawhney H, Niblack W, Sahley J, Huang Q, Dom B et al (1995) Query by image and
video content: the QBIC system. IEEE Comput 28(9):23–32
Granovetter M (1973) Strength of weak ties. Am J Sociol 8:1360–1380
Greenacre MJ (1993) Correspondence analysis in practice. Academic, San Diego
Havre S, Hetzler B, Nowell L (2000) ThemeRiver: visualizing theme change over time. In:
Proceedings of IEEE symposium on information visualization, Salt Lake City, 9–10 October
2000, pp 115–123
Hayes B (2000a) Graph theory in practice: part I. Am Sci 88(1):9–13
Hayes B (2000b) Graph theory in practice: part II. Am Sci 88(2):104–109
He DC, Wang L (1990) Texture unit, texture spectrum, and texture analysis. IEEE Trans Geosci
Remote Sens 28(4):509–512
Helm CE (1964) Multidimensional ratio scaling analysis of perceived color relations. J Opt Soc
Am 54:256–262
Hetzler B, Whitney P, Martucci L, Thomas J (1998) Multi-faceted insight through interoperable vi-
sual information analysis paradigms. Paper presented at the IEEE information visualization’98,
Los Alamitos, 19–20 October 1998
Ingram R, Benford S (1995). Legibility enhancement for information visualisation. Paper presented
at the 6th annual IEEE computer society conference on visualization, Atlanta, October 1995
Kamada T, Kawai S (1989) An algorithm for drawing general undirected graphs. Info Process Lett
31(1):7–15
Kleinberg J (1998) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632
Kochen M (ed) (1989) The small world: a volume of recent research advances commemorating
Ithiel de Sola Pool, Stanley Milgram, Theodore Newcomb. Ablex Publishing Corporations,
Norwood
Kohonen T (1989) Self-organization and associate memory, 3rd edn. Springer, New York
Krumhansl CL (1978) Concerning the applicability of geometric models to similar data: the
interrelationship between similarity and spatial density. Psychol Rev 85(5):445–463
Kruskal JB (1977) The relationship between multidimensional scaling and clustering. In: van
Ryzin J (ed) Classification and clustering. Academic, New York, pp 17–44
Kruskal JB, Wish M (1978) Multidimensional scaling, Sage university paper series on quantitative
applications in the social sciences. SAGE Publications, Beverly Hills
Levine M, Jankovic IN, Palij M (1982) Principles of spatial problem solving. J Exp Psychol Gen
111(2):157–175
Levine M, Marchon I, Hanley G (1984) The placement and misplacement of You-Are-Here maps.
Environ Behav 16(2):139–157
Lynch K (1960) The image of the city. The MIT Press, Cambridge, MA
McCallum RC (1974) Relations between factor analysis and multidimensional scaling. Psychol
Bull 81(8):505–516
Milgram S (1967) The small world problem. Psychol Today 2:60–67
Miller GA (1969) A psychological method to investigate verbal concepts. J Math Psychol
6:169–191
References 141

Morris TA, McCain K (1998) The structure of medical informatics journal literature. J Am Med
Inform Assoc 5(5):448–466
Rapoport A, Horvath WJ (1961) A study of a large sociogram. Behav Sci 6(4):279–291
Rosvall M, Bergstrom CT (2010) Mapping change in large networks. PLoS One 5(1):e8694
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear
embedding. Sci Mag 290(5500): 2323–2326, DOI: 10.1126/science.290.5500.2323.
http://www.sciencemag.org/content/290/5500/2323
Schvaneveldt RW (ed) (1990) Pathfinder associative networks: studies in knowledge organization.
Ablex Publishing Corporations, Norwood
Schvaneveldt RW, Durso FT, Dearholt DW (1989) Network structures in proximity data. In: Bower
G (ed) The psychology of learning and motivation, 24. Academic, New York, pp 249–284
Shneiderman B (1996) The eyes have it: a task by data type taxonomy for information visualization.
Paper presented at the IEEE workshop on visual language, Boulder, 3–6 September 1996
Skupin A (2009) Discrete and continuous conceptualizations of science: implications for knowl-
edge domain visualization. J Informetr 3(3):233–245
Small H (1986) The synthesis of specialty narratives from co-citation clusters. J Am Soc Inf Sci
37(3):97–110
Small H (2000) Charting pathways through science: exploring Garfield’s vision of a unified index
to science web of knowledge – a Festschrift in Honor of Eugene Garfield. Information Today
Inc., New York, pp 449–473
Steyvers M (2000) Multidimensional scaling encyclopedia of cognitive science. Macmillan
Reference Ltd., London
Steyvers M, Tenenbaum J (2001) Small worlds in semantic networks. Retrieved December 2001,
from http://www-psych.stanford.edu/ msteyver/small worlds.htm
Swain M, Ballard H (1991) Color indexing. Int J Comput Vis 7:11–32
Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear
dimensionality reduction. Science 290(5500):2319–2323
Thorndyke P, Hayes-Roth B (1982) Differences in spatial knowledge acquired from maps and
navigation. Cogn Psychol 14:560–589
Tolman EC (1948) Cognitive maps in rats and men. Psychol Rev 55:189–208
Trochim W (1989) Concept mapping: soft science or hard art? Eval Program Plann 12:87–110
Trochim W (1993) The reliability of concept mapping. In: Annual conference of the American
Evaluation Association, Dallas
Trochim W, Linton R (1986) Conceptualization for evaluation and planning. Eval Program Plann
9:289–308
Trochim W, Cook J, Setze R (1994) Using concept mapping to develop a conceptual framework
of staff’s views of a supported employment program for persons with severe mental illness.
Consult Clin Psychol 62(4):766–775
Tversky A (1977) Features of similarity. Psychol Rev 84(4):327–352
Watts DJ (1999) Small worlds: the dynamics of networks between order and randomness. Princeton
University Press, Princeton
Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature
393(6684):440–442
White HD, McCain KW (1998) Visualizing a discipline: an author co-citation analysis of
information science, 1972–1995. J Am Soc Inf Sci 49(4):327–356
Wise JA (1999) The ecological approach to text visualization. J Am Soc Inf Sci 50(13):1224–1233
Wise JA, Thomas JJ, Pennock K, Lantrip D, Pottier M, Schur A et al (1995). Visualizing the non-
visual: spatial analysis and interaction with information from text documents. Paper presented
at the IEEE symposium on information visualization’95, Atlanta, Georgia, 30–31 October 1995
Zahn CT (1917) Graph-theoretical methods for detecting and describing Gestalt clusters. IEEE
Trans Comput C 20:68–86
Chapter 4
Trajectories of Search

Science is what you know, philosophy is what you don’t know.


Bertrand Russell (1872–1970)

In Chap. 3, we have introduced basic principles of cartography for mapping abstract


structures commonly resulted from our thinking, ranging from concept mapping
based on card sorting, co-word maps derived from word co-occurrence analysis, to
generic structures represented as networks, especially the interesting properties of a
class of gigantic graphs known as small-world networks. We have described typical
dimensionality reduction techniques such as the classic multidimensional scaling
and the latest advances in non-linear multidimensional scaling.

4.1 Footprints in Information Space

Information visualization can be seen as a process of two stages: construction and


use. Now we focus on the use and how to gather information from usage and
feedback to the construction so that the virtual environment becomes responsive.
Following the like-minded people is a widely used strategy by many of us.
Trailblazing is an important concept in Memex, a global and persistent device
envisaged by Bush (1945) for storing and retrieving information. In Memex, users
are also builders by adding trails of their own into the information space. Such
trails provide valuable navigational cues for other users to find their way through
the enriched information space. The central idea of trailblazing is to preserve such
valuable information and make use of it as an integral part of the information space.
This vision of Bush has inspired several examples of visualizing trails and
intellectual pathways. The notion of intellectual pathways has been explored
in trailblazing scientific literatures (Chen 1999b; Chen and Carr 1999; Small
1986, 1999). Researchers have estimated the degree of relatedness between two

C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 143
DOI 10.1007/978-1-4471-5128-9 4, © Springer-Verlag London 2013
144 4 Trajectories of Search

documents according to the likelihood that users would visit one document from
another via hyperlinks (Pirolli et al. 1996). In the following examples, we first
introduce a travel planning problem in the real world and then discuss real-world
navigation strategies in a virtual world.

4.1.1 Traveling Salesman Problem

The traveling salesman problem (TSP) is a classic example in algorithms. Given a


finite number of cities along with the cost of travel between each pair of them, the
salesman must find the cheapest way of visiting all the cities and returning to the
starting point. The TSP problem belongs to a class of hard problems. No known
algorithms have the complexity of polynomial or faster. Therefore, the total number
of cities in a TSP indicates the hallmark of the strength of a TSP solution. The size
of solved TSP examples has been steadily increasing. The largest number of cities in
a TSP solution is 15,112. Alexander Schrijver gave a good survey on this topic in his
paper “On the history of combinatorial optimization (till 1960)” (Schrijver 2001).
In an 1832 manual for the successful traveling salesman, the problem was
formulated without using mathematics. The manual suggested five tours through
Germany. Martin Groetschel (1977) published a 120-German-city TSP solution in
1977. The largest TSP solution to date is a traveling salesman problem through
15,112 cities in Germany, exceeding the 13,509-city tour through the United States
solved in 1998. The computation was carried out on a network of 110 processors
located at Rice University and at Princeton University. The optimal tour is equivalent
to a trip of approximately 66,000 km through Germany. It was proved to be the
optimal solution in April 2001. Figure 4.1 shows three famous traveling salesman
tours in Germany. Note that the projections of the map and the city data do not quite
match.
There are three reasons why I include examples of the traveling salesman
problem in a book about mapping knowledge structures in scientific frontiers. First,
the traveling salesman problem represents one of the most common tasks we do
with a map as a journey planner. In the abstract world, an equivalent question would
be “what is the shortest path that will link up all the necessary publications to help
me understand a research subject matter?” In fact, Henry Small at ISI extracted such
pathways that can do precisely a virtual tour of a scientist through intellectual cities
in scientific literature (Small 2000). A topic in subsequent discussion will focus
on users’ search patterns in a visual-spatial environment on computer. The second
reason is that in addition to geographic locations and thematic overlays in thematic
maps, there is another dimension worth noting: actions and events. The structure of
knowledge acquires more meaning in the context of such activities. The third reason
is related to the concept of trailblazing, which leads to the following examples.
Transitions from real-world navigation to virtual-world navigation are made by
studying how people navigation in virtual environments that replicate some common
navigation cues found in the real world. Darken and Sibert (1996) noted that
4.1 Footprints in Information Space 145

Fig. 4.1 Three Traveling Salesman tours in German cities: the 45-city Alten Commis-Voyageur
tour (green), the Groetschel’s 120-city tour (blue), and by far the latest 15,112-city tour (red)
(Courtesy of http://www.math.princeton.edu/)

survey knowledge acquired from a map tends to be orientation-specific. In contrast,


prolonged exposure to navigating an environment directly is more likely to result in
survey knowledge that is orientation-independent. The virtual reality-based visual
navigation therefore is likely to increase the opportunities for users to get familiar
with the underlying information structure.
Darken and Sibert (1996) found in their study that users were often disoriented
in virtual worlds without any landmarks, paths or cues. Simply adding cues like
borders, boundaries and gridlines significantly improved navigation performance.
An organizational metaphor, with landmarks and navigational cues, was of utmost
importance in successfully navigating these virtual worlds.
146 4 Trajectories of Search

Fig. 4.2 Knowledge garden

Fig. 4.3 A scene in StarWalker when two users exploring the semantically organized virtual space

4.1.2 Searching in Virtual Worlds

In Knowledge Garden (Crossley et al. 1999), a knowledge management system


developed at BT Laboratory in the UK, documents are visualized as plants in a
garden (See Fig. 4.2). Although users’ trails are not directly visible, when a branch
starts to move back and forth, it means that someone else is reading that document.
Figure 4.3 is a screenshot of StarWalker when two users were exploring the
semantically organized virtual space (Chen 1999a). Figure 4.4 shows more users
are gathering in the scene.
Figure 4.5 is a map of a website produced by see POWER showing the layout
of an entire web site, with the colored contours representing the number of hits an
4.1 Footprints in Information Space 147

Fig. 4.4 More users gathering in the scene

Fig. 4.5 A site map produced by see POWER. The colored contours represent the hit rate of a
web page. The home page is the node in the center (Courtesy of http://www.compudigm.com/)
148 4 Trajectories of Search

individual page has received. The home page is the node in the center, and the lines
linked to this represent navigation paths. Navigation issues can be quickly identified,
as can the effect of content changes.

4.1.3 Information Foraging

Searching for information is in many ways like how human beings and animals
hunt for food. Research in biological evolution and optimal foraging identifies some
of the most profound factors that may influence our course of action. Whenever
possible, we prefer to minimize the consumption of our energy in searching for
information. We may also consider reducing other forms of cost. The bottom line
is that we want to maximize the returns by giving away the minimum amount of
resources. The perceived risk and expected gains will affect where we search and
how long we keep searching in the same area. A theory adapted from anthropology,
the optimal information foraging theory (Pirolli and Card 1995), can explain why
this type of information can be useful.
Sandstrom (1999) analyzed scholars’ information searching behavior as if they
were hunting for food based on the optimal foraging theory developed in anthropol-
ogy. She focused on author co-citation relationships as a means of tracing scholars in
their information seeking. Sandstrom derived a novelty-redundancy continuum on
which information foragers gauged the costs and benefits of their course of search.
She found three types of center-periphery zones in the mind map of scholars: one’s
home zone, core groupings for others and the rest of clusters of scholars.
Sandstrom’s study showed that scholars’ searching and handling mechanisms
varied by zone, and the optimal foraging theory does explain the variations.
For example, regular reading, browsing, or relatively solitary information seeking
activities often yielded resources belonging mostly to the peripheral zones of
scholars’ information environments. Peripheral resources tended to be first-time
references and previously unfamiliar to citing authors, whereas core resources
emerged from routine monitoring of key sources and the cited authors are very
familiar with such resources.
Sandstrom’s work draws our attention from the strongest and most salient
intellectual links in traditional author co-citation analysis to the weak bibliographic
connections and less salient intellectual links. Weak links that could lead to the
establishment of an overlooked connection between two specialties are particularly
significant for information foragers and scholars.
In order to understand users’ navigation strategies in information foraging,
the profitability of a given document can be defined according to this cost-effect
principle. For example, one can estimate the profitability with the proportion of
relevant documents in a specific area of an information space divided by the
time it will take to read all the documents within this area. In their study of the
Scatter/Gatherer system, Pirolli and Card found that even a much simplified model
of information foraging shows how users’ search strategies can be influenced. For
4.1 Footprints in Information Space 149

Fig. 4.6 Modeling trails of information foragers in thematic spaces

example, users are likely to search widely in an information space if the query is
simple, and more focused if the query is harder (1995). According to the profitability
principle, harder queries entail higher cost to resolve and the profitability of each
document is relatively low. In general, users must decide whether or not to pursue a
given document on the course of navigation based on the likelihood profitability of
the document.
In order to study sequential patterns in users’ trails, we decided to visualize doc-
uments visited by users in sequence. One would expect that the trail of a successful
information forager should lead to the target area and spend a considerable amount
of time in that area. The success of one user may provide insightful information to
another user to overcome the weakest link problem.

4.1.4 Modeling a Foraging Process

We introduce a theoretical framework to accommodate the optimal information


foraging theory and modeling and visualization techniques. Figure 4.6 shows
the structure of this framework. First, the framework includes the design of
150 4 Trajectories of Search

spatial-semantic interfaces. A significant number of overview maps are designed


in this way. Second, it contains task models of visual navigation. We refer to
Shneiderman’s information visualization taxonomy (Shneiderman 1998). In the
following example, four thematic topics were chosen to form the basis of an exper-
imental design. Information visualization techniques such as Pathfinder networks
and minimum spanning trees (MST) were used in the experiment. Hidden Markov
Models (HMMs) can be derived from users’ navigation sequences recorded in each
session. Each user’ performance is measured in terms of recall and precision, as
traditionally done in information retrieval. Additional data, in particular, the course
of navigation, were also collected. The navigation trails of the most successful users
were used to feed in the HMM modeling process. Finally, synthesized user trails
are generated from the HMMs and animated within the corresponding thematic
spaces.
In order to study users’ information foraging behavior, we constructed four
thematic spaces based on news articles from Los Angeles Times retrieved from
the Text Retrieval Conference (TREC) test data. Each thematic space contains the
top 200 news articles retrieved through a single keyword query to the document
collection. The four keywords used were alcohol, endangered, game, and storm.
Corresponding spaces were named accordingly by these keywords.
For each thematic space, we generated a document-to-document similarity
matrix using Latent Semantic Indexing (LSI) (Chen and Czerwinski 1998; Deer-
wester et al. 1990). A few types of spatial-semantic interfaces were produced,
including Pathfinder networks (PF) and minimum spanning trees (MST). In MST-
based visualization, N-1 explicit links connect all the documents together. Users
can see these links on their screen. In PF-based visualization, additional explicit
links are allowed as long as the triangular inequality condition is satisfied. In our
examples, the PF version tends to have only a handful of extra links at most in
comparison to the MST version. Detailed descriptions of the use of these techniques
for information visualization can be found in (Chen 1999a).
Sixteen users, nine male and seven female students from a British university,
participated the experiment. They performed a series of tasks in each thematic space
through spatial-semantic interfaces. Usage data was logged to a computer file in
each session, including the occurrence of an event, the time stamp of the event, and
the target document on which the event takes place.
The design of the tasks follows Shneiderman’s mantra: overview, zoom, filter,
details on demand, which highlights users’ cognitive needs at various strategic
stages in visual information retrieval. At the top level with Task A, users need to
locate and mark documents relevant to a search topic. For example, in the Alcohol
space, users were asked to locate and mark any documents that mention an incident
of drink driving. Twenty to twenty five documents were judged as relevant by
experts for TREC conferences. Task B is more specific than Task A, and so on.
We expect that users would narrow down the scope of their search from Task A
through Task D and that this should be evident in their trails of navigation.
4.1 Footprints in Information Space 151

We introduce an integrated approach to the study of visual navigation strategies


based on a combination of the optimal information foraging theory and Hidden
Markov Models (HMMs). This approach visualizes users’ navigation trails through
an information space with reference to an indicator of the profitability of each
document. The information space is organized based on a spatial-semantic mapping
so that similar documents tend to appear near to each other. Explicit links highlight
strongly similar documents. The profitability of a document therefore relies on the
semantics of the immediate neighboring area in which the given document resides.
If we know that one area contains one document relevant to the query, then it is
more likely that its nearby neighboring documents are also relevant to the query. In
this way, we can translate the optimal information foraging theory into observable
attributes associated with users’ visual navigation strategies. Next, we present a
conceptual framework that accommodates the optimal information foraging theory,
Hidden Markov Models, spatial-semantic interfaces, and a taxonomy of visual
navigation. Then, we describe each component of the framework. The overall
approach is illustrated through an example in which visual navigation data were
drawn from an information retrieval experiment. Finally, implications of this
approach for understanding users’ navigation strategies are discussed.

4.1.4.1 Hidden Markov Models

Hidden Markov Models (HMMs) are widely used in signal processing and speech
recognition. If we conceptualize users’ navigation as a sequence of observable
actions, such as clicking on a node or marking a node, we would expect that
behavioral patterns of navigation are likely to be governed by a latent cognitive
process, which is opaque to observers. For example, cognitive processes behind the
scene may include estimating the profitability of a document cluster and assessing
the relevance of a particular document. HMMs provide a potentially useful tool
to model such dual-process sequences. Given a sequence of observed actions, one
may want to know the dynamics of the underlying process. Given a model of an
underlying process, one may like to see what sequence is most likely to be observed.
Thus an HMM-based approach provides a suitable way to study users’ navigation
strategies as an information foraging process.
Hidden Markov Models are defined in terms of states and observations. States
are not observable, whereas observations are observable and they are probabilistic
functions of states. A stochastic process governs state transitions, which means at
each step the process of change is controlled by probabilities. Observations are also
a stochastic process. An HMM can be defined as follows:
N denotes the number of hidden states
Q denotes the set of states Q D f1, 2, : : : , Ng
M denotes the number of symbols, or observations
V denotes the set of symbols V D f1, 2, : : : , Mg
A denotes the state-transition probability matrix
152 4 Trajectories of Search

2 3
a11 a12 ::: a1N
6 ::: ::: aij ::: 7
AD6
4 :::
7
::: ::: ::: 5
aN1 aN 2 ::: aNN

where aij D P (qt D j qt  1 D i), 1  k  M


B denotes the observation probability distribution:
Bj (k) D P (ot D k j qt D j), 1  k  M
  denotes the initial state distribution:
 i D P (q1 D i), 1  i  N
œ denotes the entire HMM model œ D (A, B,  )
An HMM is completely defined by œ D (A, B,  ), which are known as parameters
of the model. HMMs are typically used in the following scenarios:
Given observation O D (o1, o2, : : : , oT) and model œ D (A, B,  ), efficiently
compute P(Ojœ).
Given two models œ1 and œ2, this can be used to choose the better one.
Given observation O D (o1, o2, : : : , oT) and model œ find the optimal state sequence
q D (q1, q2, : : : , qT).
Given O D (o1, o2, : : : , oT), estimate model parameters œ D (A, B,  ) that
maximize P(Ojœ).
A well-known algorithm from Viterbi has been widely used to find the most
likely path through a given HMM for each sequence, although for small state spaces
it is possible to work out the answer using a brute-force approach.
In order to apply HMMs to users’ navigation sequences observed in each
thematic space, we derived the transition matrix and observation probability as
follows. The state space is defined by all the documents in a thematic space. Each
document defines a unique state:
d1 ! S1 , d2 ! S2 , : : : , dN ! SN
A user’s trail is defined by a sequence of profitability estimates of documents
perceived by the user in a course of visual navigation. Since this is not directly
observable, we modeled such sequences as a stochastic process. Thus each trail
corresponds to a state transition sequence S D fSi1 , Si2 , Si3 , : : : g. The state transition
probability matrix is derived from the sequence of documents visited by a user in
his/her session.
di ! dj
aij D
di
The observation probabilities reflect the underlying stochastic process – the
perceived profitability of a sequence of documents. Three observation symbols
are defined: ok, K D 1, 2, 3. O1 denotes the user’s mouse cursor moves over a
document. O2 denotes the user clicks on the document. O3 denotes the user marks
the document as relevant. A sequence of observed symbols could be O D f1, 1, 1, 2,
1, 1, 2, 3, : : : g. The observation probability is also estimated from the log files:
4.1 Footprints in Information Space 153

ok .di /
bi k D
di

The reason we choose the profitability function of a document space as the state
space and the three types of events as the observation symbols is based on the
consideration that the sequence of activities such as node over, node click, and node
mark is a stochastic process. This observable process is the function of a latent
stochastic process – the process of estimating the profitability of documents in the
thematic space by a user because which document the user will move to in his/her
next step is very much opaque to observers.
We constructed HMMs based on the actual trails recorded from sessions of the
experiment. HMMs are both descriptive and normative – not only can one describe
what happened with information foraging sessions, but also can one predict what
might happen in similar situations. HMMs provide insights into how users would
behave as they are exposed to the same type of structural and navigational cues in
the same thematic space.
We defined the basic problems as follows. The first basic question states that
given observation O D (o1, o2, : : : , oT), which is a sequence of information
foraging actions of a user, and model œ D (A, B,  ), efficiently compute P(Ojœ).
Given two models œ1 and œ2, this can be used to choose the better one. We first
derived an HMM model from the log files of two users: one has the best performance
score, but without any node click events; the other has all types of events. This model
is denoted as œlog . Given an observation sequence, it is possible to estimate model
parameters œ D (A, B,  ) that maximize P(Ojœ), denoted as œseq . The navigation
sequence of the most successful user provided the input to the modeling process.
The second basic question states that given observation O D (o1, o2, : : : , oT)
and model œ find the optimal state sequence q D (q1, q2, : : : , qT). In this case,
we submited the navigation sequences of users to the model œlog and animated the
optimal state sequences within the thematic space. In this way, we can compare the
prevalent navigation strategies. Such animation will provide additional navigational
cues to other users.
Finally, the third basic question states that given observation O D (o1, o2,
: : : , oT), estimate model parameters œ D (A, B,  ) that maximize P(Ojœ). We
focused on the most successful user in searching a given thematic space. If a user
is clicking and marking documents frequently, it is likely that the user has found a
high profitable set of documents.

4.1.4.2 Visualizing Trails of Foraging

Figure 4.7 is an annotated screenshot of the graphical interface design, which


explains how users’ navigation sequences are animated. Documents in red are not
relevant to the search tasks. The course of navigation appears as dotted yellow
links. Relevancy judgments made by experts are provided in the TREC test data.
Documents relevant to the original search are marked with a bright yellow dot in the
154 4 Trajectories of Search

Fig. 4.7 Legend for the visualization of foraging tails

center. If the user marks a document as relevant in a search session, this document
will be colored in blue. Upon the user visits a document, a dark circle is drawn
around the current document. The time spent on a document is denoted by a growing
green belt until the user leaves the document. If the user comes back to a previously
visited document, we will see a new layer of dark circle and an additional layer of
green belt will start to be drawn. One can choose to carry these discs grown from
one task into the next task and a red disc indicates how long the user has spent on it
in the previous task.
We expect to observe the following patterns concerning users’ navigation
strategies:
Spatial-semantic models may reduce the time spent on examining a cluster of
documents if the spatial-semantic mapping preserves the latent semantic structure.
Spatial-semantic models may mislead information foragers to over-estimate the
profitability of a cluster of documents if the quality of clustering is low.
Once users locate a relevant document in a spatial-semantic model, they tend to
switch to local search.
If we use the radius of disc to denote the time spent on a document, the
majority of large discs should fall in the target area in the thematic spaces. Discs
of subsequent tasks are likely to be embedded in discs of preceding tasks.

4.1.5 Trajectories of Users

Because of the superior performance results with MST-based interfaces, we restrict


our discussions to navigation strategies associated with the use of the MST version
of the ALCOHOL thematic space. Figure 4.8 shows an overview map of the
ALCOHOL space. Documents relevant to Task A are marked with bright yellow
dots in the center. All the relevant documents are clustered in the branch located at
the lower right hand corner of the map, with the exceptional documents number 63
4.1 Footprints in Information Space 155

Fig. 4.8 Relevant documents for Task A in the ALCOHOL space (MST)

Fig. 4.9 Overview first: user jbr’s trails in searching the alcohol space (Task A)

and number 21. Another special node in the map is number 57. Three out of four
users we studied chose this node as the starting point for their navigation.
Each trajectory map shows the course of visual navigation of a particular user.
Figure 4.9 shows user jbr’s navigation trail for Task A in the alcohol space, who
performed the best in this group. Task A corresponds to the initial overview task in
Shneiderman’s taxonomy. Users must locate clusters of relevant documents in the
map. Subsequent tasks are increasingly focused.
As shown in the trajectory map, user jbr started from the node 57 and moved
downwards along the branch. Then the trajectory jumped to node 105 and followed
the long spine of the graph. Finally, the user reached the area where relevant
documents are located. We found an interesting trajectory pattern – once the user
156 4 Trajectories of Search

locates a relevant document, he tends to explore documents in the immediate


neighboring area, just as we expected. The frequency of long-range jumps across
the space decreased as the user became familiar with the structure of the space. The
trajectory eventually settled to some fine-grained local search within an area where
the majority relevant documents are placed, and it didn’t move away from that area
ever since, which was also what we expected.
In the trajectory replay, the time spent on a document is animated as the radius
of a green disc growing outward from where the document is located. This design
allows us to find out whether the majority of large green discs appear in areas with a
high density of relevant documents, and whether areas with a low density of relevant
documents will only have sporadic passing navigation trails.
We found that users were able to mark certain documents extremely fast. For
example, user jbr apparently spent almost no time to determine the relevancy of
documents 80, 20, and 64 and marked them in blue. It seems once users have
identified two relevant documents, they tend to identify relevant documents in
between very quickly. Explicit links in the visualization play a crucial role in
guiding the course of navigation of users. Not only users follow these links in their
navigation, but also make their relevance judgment based on the cues provided by
these visible links. In other words, users have relied on these explicit links to a
considerable extent when they assess the profitability of a document.
Trajectory maps are designed so that an outline of the trajectory from the previous
task can be preserved and carried over to the next task. If a user spends a long time at
a document in task A, the accumulative trajectory map starts with this information.
We expected to see a user would gradually narrow down the scope of active search
areas. In addition, as users become increasingly familiar with the structure and
content of the underlying thematic space, there would be no need for them to re-
visit areas with low profitability.
Figure 4.10 shows the “Zoom in” stage of the search. The search trail never went
to the area identified in the immediately previous “Overview first” stage. The next
stage, “Details on demand,” is shown in Fig. 4.11.
Figure 4.12 shows the trajectories of the same user jbr for four tasks. These maps
reveal that the user spent longer and longer time in areas with relevant documents.
In the last trajectory map for task D, the user began to forage information in new
areas.
Trajectories of individual users have revealed many insightful findings. The next
step is to extract behavioral patterns from the group of users as a whole. From a
social navigation point of view, not only one has to understand the characteristics of
the trajectory of individual users in a spatial-semantic space, but also to identify the
commonality across individuals’ behavioral patterns.
Hidden Markov Models allow us to describe and predict sequential behavior
characteristics of users foraging information in thematic spaces. We categorize
users’ information foraging actions into three types of action events:
Node over
Node click, and
Node mark.
4.1 Footprints in Information Space 157

Fig. 4.10 Zoom in : : :

Fig. 4.11 Details on demand

When the user moves his/her mouse over a document in the thematic space, the
title is flashed out on the screen. When the user clicks on the document, the content
of the document becomes available. When the user has decided that the current
document is relevant for the task, he/she can mark the document.
First, we use two users’ trails as the training set to build the first Hidden Markov
model œstate. We choose user jbr and nol because one marked the most documents
and the other clicked the most number of times.
The third parameter of a Hidden Markov model is the intial distribution, denoted
as  . Intuitively, this is the likelihood that users will start with a given document for
their information foraging.
158 4 Trajectories of Search

Fig. 4.12 Overview first, zoom in, filtering, detail on demand. Accumulative trajectory maps of
user jbr in four consecutive sessions of tasks. Activated areas in each session reflect the changes of
the scope (clockwise: Task A to Task D)

Table 4.1 The state sequence generated by the HMM for user jbr. Relevant documents are
in bold type
67 57 120 199 65 61 61 61 73 73 73 87 170 134 105 170 142 172 156 112 192 77 47 138
128 114 186 30 13 13 18 114 135 50 161 50 43 50 66 50 50 66 161 66 66 169 66 66 169
169 123 123 83 149 169 169 123 123 149 149 83 11 138 159 121 123 149 149 100 100 91
91 83 83 119 83 83 119 119 83 41 162 162 82 50 82 82 82 82 161 122 31 43 135 81 161 43
43 135 81 81 135 14 135 135 14 14 20 20 80 80 189 189 152 56 189 189 64 64 158

In addition to the above approach, one can derive an HMM by using the Baum-
Welch algorithm based on a given sequence of observed actions. We use user jbr’s
action sequence as the input and generate an HMM.
Using the Hidden Markov model derived from user jbr’s and user nol’s actual
sequences, we can verify the internal structure of the model using the well-known
algorithm – the Viterbi algorithm. Given a Hidden Markov model œ and a sequence
of observed symbols, the Viterbi algorithm can be used to generate a sequence
of states. One can examine this state sequence and compare it with the original
sequence of events log from the user.
Table 4.1 shows the state sequence generated by the Viterbi algorithm based on
the HMM œstate, which returns the sequence of states that is most likely to emit the
observed symbols, i.e. the information foraging sequence. Relevant documents in
the state sequence are highlighted in bold. This sequence is of course identical to
the original sequence recorded in the session.
Based on the HMM œstate, user jbr’s observed information foraging action
sequence as the input and applied the Viterbi algorithm to generate the optimal
state transition path. Figure 4.13 shows the path of the sequence generated by the
4.1 Footprints in Information Space 159

Fig. 4.13 Synthesized trails. The trajectory of the optimal path over the original path of user jbr

Viterbi algorithm. The path started from the left-hand side of the thematic space
and traced the horizontal spine across the map and reached the target area. The path
finished in the target area with several extended visits to relevant documents in this
area. The optimal path is drawn on top of the original trail of the same user. By
showing the two versions of the trails on the same thematic map, it becomes clear
where the discrepancies are and where the conformance is. Since this is a novel way
to represent paths in a Hidden Markov model, many characteristics are yet to be
fully investigated. Even though, the synthesized path appears to be promising and
it moves straight to the target area and some wanders in the original trail has been
filtered out. For social navigation, the optimal path is likely to provide an enhanced
profile for this group of users.
Our study of behavioral semantics focused on the alcohol space in the MST-
based interface. The thematic space was exposed to users for the first time in
Task A. Apart from the structural model no navigation cues were readily available
to users. Users must first locate areas in the thematic space where they can find
documents relevant to the task. The optimal information foraging theory provides
an appropriate description of this type of processes.
We have made an assumption that this is an information foraging process and
it is also a stochastic process because much of the judgments and decisions made
by users in their exploration and foraging of relevant documents are implicit and
difficult to externalize. The introduction of Hidden Markov models allows us to
build descriptive and normative models so that we can characterize sequential
behavior of users in the context of information foraging. The visual inspection
of information foraging trails is encouraging. Animated trails and optimal paths
generated by HMMs have revealed many insights into how users were dealing
with the tasks and what are the prevailing characteristics and patterns. Replay and
animate HMM-paths over actual trails allow us to compare transition patterns in the
same context.
160 4 Trajectories of Search

In this example, we have focused on Task A, which is by nature a global


information foraging within the entire thematic space. Users switched to local search
for subsequent tasks. We have touched upon the shrinking-scope tendency in this
study, but studies of the full range of tasks with reference to Shneiderman’s task-data
type taxonomy should lead to deeper insights to how users interact with visual-
spatial interfaces.
As far as the resultant HMMs are concerned, a clearer understanding and
interpretation of various characteristics manifested by paths selected by HMMs is
certainly desirable. We have only analyzed a small portion of the data generated
from our experiment. Among twelve combinations of visual-spatial interfaces and
underlying thematic spaces, we have only studied one pair – Alcohol in MST.
In addition to animations of trails and HMM-paths, one can use ghost avatars to
traverse the thematic space along with the real users. Ghost avatars can travel along
HMM-generated paths as well as actual trails, which will in turn inspire other users
and draw their attention to profitable areas in information foraging.

4.2 Summary

In conclusion, many of our expectations have been confirmed in the visualization


and animation of trails of information foragers in thematic spaces. The task we
have studied is global information foraging in nature. The initial integration of the
optimal information foraging and Hidden Markov Models is promising, especially
with the facilities to animate user trails within the thematic spaces.
Visualizing an information foraging process has led to valuable insights into how
users explore and navigate through thematic spaces. The only visible navigation
cues for users in these spaces are structures resulted from a spatial-semantic
mapping. Labeling in its own right is a challenging issue – how to generate the
most meaningful labels and summarize unstructured documents. Users have indeed
raised the issue concerning labeling local areas in the thematic space. However,
because the aim of this study was to investigate information foraging behavior, it
has been decided not to label document clusters for users in the experiment.
The combination of the optimal information foraging theory and Hidden Markov
models plays an essential part in the study of users’ navigation strategies. In future
studies, there are several possible routes to pursue. One can repeat the study with a
larger sample size of users and classify users according to their cognitive abilities or
other criteria. Then one can compare HMMs across different user classes and make
connections between information foraging behavior of users and their individual
differences. Future studies should expand the scope of tasks to cover a fuller
range of information foraging activities. Visual-spatial interfaces should be carefully
designed for future studies so that fundamental issues can be addressed.
References 161

This approach offers a methodology that can be used to combine technologies


of information visualization and user behavioral modeling. Not only can a user’s
navigation path be vividly replayed on the computer screen, but also a virtual path
derived from a group of users with certain characteristics in common.
This chapter outlines a spectrum of techniques. Some of them have been well
used in science mapping, while others such as trail behavioral semantics are less so.
The main point of this chapter is to outline a broader context in which further studies
of behavioral semantics can be carried out with references to science mapping.

References

Bush V (1945) As we may think. Atl Mon 176(1):101–108


Chen C (1999a) Information visualisation and virtual environments. Springer, London
Chen C (1999b) Visualising semantic spaces and author co-citation networks in digital libraries.
Inf Process Manag 35(2):401–420
Chen C, Czerwinski M (1998). From latent semantics to spatial hypertext: an integrated approach.
Paper presented at the 9th ACM conference on hypertext and hypermedia (Hypertext’98),
Pittsburgh, PA, June 1998
Chen C, Carr L (1999) Trailblazing the literature of hypertext: author co-citation analysis
(1989–1998). Paper presented at the 10th ACM conference on hypertext (Hypertext’99),
Darmstadt, Germany, February 1999
Crossley M, Davies J, McGrath A, Rejman-Greene M (1999) The knowledge garden. BT Technol
J 17(1):76–84
Darken RP, Sibert JL (1996) Wayfinding strategies and behaviors in large virtual worlds. Paper
presented at the CHI’96, Vancouver, BC
Deerwester S, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent
semantic analysis. J Am Soc Inf Sci 41(6):391–407
Groetschel M (1977) Polyedrische Charakterisierungen kombinatorischer Optimierungsprobleme.
Mathematical systems in economics, 36. Hain, Meisenheim am Glan
Pirolli P, Card SK (1995) Information foraging in information access environments. Paper
presented at the CHI’95, Denver, CO
Pirolli P, Pitkow J, Rao R (1996) Silk from a sow’s ear: extracting usable structures from the web.
Paper presented at the CHI’96, Vancouver, BC
Sandstrom PE (1999) Scholars as subsistence foragers. Bull Am Soc Inf Sci 25(3):17–20
Schrijver A (2001) On the history of combinatorial optimization (till 1960). Retrieved November
6 2001, from http://www.cwi.nl/lex/files/histco.ps
Shneiderman B (1998) Codex, memex, genex: the pursuit of transformational technologies. Int J
Hum Comput Interact 10(2):87–106
Small H (1986) The synthesis of specialty narratives from co-citation clusters. J Am Soc Inf Sci
37(3):97–110
Small H (1999) On the shoulders of giants. Bull Am Soc Inf Sci 25(2):23–25
Small H (2000) Charting pathways through science: exploring Garfield’s vision of a unified index
to science web of knowledge – a Festschrift in honor of Eugene Garfield. Information Today
Inc., New York, pp 449–473
Chapter 5
The Structure and Dynamics of Scientific
Knowledge

If I have seen further it is by standing on the shoulders of Giants.


Isaac Newton (1642–1727)

In a letter to Robert Hooke in 1675, Isaac Newton made his most famous statement:
“If I have seen further it is by standing on the shoulders of Giants.” This statement
is now often quoted to symbolize scientific progress. Robert Merton examined the
origin of this metaphor in his On the Shoulders of Giants (Merton 1965). The
shoulders-of-giants metaphor can be traced to the French philosopher Bernard of
Chartres, who said that we are like dwarfs on the shoulders of giants, so that we can
see more than they, and things at a greater distance, not by virtue of any sharpness
of sight on our part, or any physical distinction, but because we are carried high and
raised up by their giant size.
In a presentation at the Conference on The History and Heritage of Science
Information Systems at Pittsburgh in 1998, Eugene Garfield used “On the Shoulders
of Giants” as the title of his tributes to an array of people who had made tremendous
contributions to citation indexing and science mapping, including Robert King
Merton, Derek John de Solla Price (1922–1983), Manfred Kochen (1928–1989),
Henry Small, and many others (Garfield 1998). In 1999, Henry Small used On the
Shoulders of Giants to entitle his ASIS Award Speech (Small 1999). He explained
that if a citation can be seen as standing on the shoulder of a giant, then co-
citation is straddling the shoulders of two giants, a pyramid of straddled giants
is a specialty, and a pathway through science is playing leapfrog from one giant
to another. Henry Small particularly mentioned Belver Griffith (1931–1999) and
Derek Price as the giants who shared the vision of mapping science with co-citation.
Griffith introduced the idea of using multidimensional scaling to create a spatial
representation of documents. According to Small, the work of Derek Price in
modeling of the research front (Price 1965) had a major impact on his thinking.
The goal of this chapter is to introduce some landmark works of giants in
quantitative studies of science, especially groundbreaking theories, techniques, and

C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 163
DOI 10.1007/978-1-4471-5128-9 5, © Springer-Verlag London 2013
164 5 The Structure and Dynamics of Scientific Knowledge

applications of science mapping. Henry Small praised highly the profound impact of
Thomas Kuhn on visualizing the entire body of scientific knowledge. He suggested
that if Kuhn’s paradigms are snapshots of the structure of science at specific
points in time, examining a sequence of such snapshots might reveal the growth
of science. Kuhn (1970) speculated that citation linkage might hold the key to solve
the problem.
In this chapter, we start with general descriptions of science in action as reflected
through indicators such as productivity and authority. We follow the development of
a number of key methods to science mapping over the last few decades, including
co-word analysis and co-citation analysis. These theories and methods have been
an invaluable source of inspiration for generations of researchers across a variety of
disciplines. And we are standing on the shoulders of giants.

5.1 Matthew Effect

What is the nature of scholarly publishing? Will the Internet-led electronic


publishing fundamentally change it? Michael Koenig and Toni Harrell (1995)
addressed this issue by using Derek Price’s urn model of Lotka’s law.
In 1926, Alfred Lotka (1880–1949) found that the frequency distributions of
authors’ productivity in chemistry and physics followed a straight line with a slope
of 2:1 (Lotka 1926). In other words, the number of authors who published N papers
is about twice the number of authors who published 2  N papers. This is known
now as Lotka’s law.
Derek Price illustrated the nature of scholarship with the following urn model
(Price 1976). To play the game, we need a bag, or an urn, and two types of balls
labeled “S” for success or “F” for failure. The player’s performance in the game
is expected to track the performance of a scholar. The scholar must publish one
paper to start the game. Whenever he draws an “F”, the game is over. There are two
balls at the beginning of the game: one “S” and the other “F”. The odd is 50–50 on
the first draw. If he draws an “S”, this ball plus another “S” ball will be put in the
bag and the scholar can make another draw. The odds improve with each round of
success. This game can replicate almost exactly the distribution that Lotka derived
from observation.
Price’s urn model accurately and vividly characterizes the nature of scholarship.
A scholar is indeed playing a game: Publications and citations are how scholars
score in the game (Koenig and Harrell 1995). To stay in the game, scholars must
play it successfully. Each publication makes it easier for the scholar to score again.
Success breeds success. Electronic publishing on the Internet has the potential to
increase the odds in the urn because it has the potential to speed up the process.
Can online accessibility boost the citations of an article? Steven Lawrence and his
colleagues found a strong correlation between the number of citations of an article
and the likelihood that the article is online (Lawrence 2001). They analyzed 119,924
conference articles in computer science and related disciplines, obtained from
5.1 Matthew Effect 165

DBLP (http://dblp.uni-trier.de), an online computer science bibliography. Citation


counts and online availability were estimated using ResearchIndex. Their conclusion
was that online articles were likely to acquire more citations.
Robert King Merton is an American sociologist who has revolutionized sociol-
ogy and mass communication. He is a pioneer in the sociology and the history of
sciences. He drew our attention to the “Matthew Effect” in scientific communities
(Merton 1968). He adopted the term from St. Matthew’s Gospel in the Bible: “For
unto everyone that hath shall be given, and he shall have abundance; but for him
that hath not shall be taken away even that which he hath.” (Bible, Matthew 13:12,
25:29). “Matthew Effect” sums up the phenomenon that the rich get richer and the
poor get poorer. In the context of science, “the richness” refers to the reputation and
prominence of an established scientist; in contrast, “the poor” includes scientists
who have not reached this level. Established scientists tend to receive more than
their fair share of credits at the expense of those who are not famous. Here is how
Merton described the Matthew effect in scientific reward systems:
You usually notice the name that you’re familiar with. Even if it’s last, it will be the one that
sticks. In some cases, all the names are unfamiliar to you, and they’re virtually anonymous.
But what you note is the acknowledgement at the end of the paper to the senior person for
his ‘advice and encouragement.’ So you will say: ‘This came out of Greene’s lab, or so and
so’s lab.’ You remember that, rather than the long list of authors.
Social and political forces may limit the recognition of a scientist. Merton
described the “41st chair” phenomenon in the French Academy, which can only
allow a maximum of 40 members. Many talented individuals were denied a
membership of the Academy simply because of this restriction.
Merton’s other contribution to sociology of science is the concept of scientific
obliteration. He first described the idea in On the Shoulders of Giants (Merton
1965):
Natural enough, most of us tend to attribute a striking idea or formulation to the author who
first introduced us to it. But often, that author has simply adopted or revived a formulation
which he (and others versed in the same tradition) knows to have been created by another.
The transmitters may be so familiar with its origins that they mistakenly assume these to
be well known. Preferring not to insult their readers’ knowledgeability, they do not cite the
original source or even refer to it. And so it turns out that the altogether innocent transmitter
becomes identified as the originator of the idea when his merit lies only in having kept it
alive, or in having brought it back to life after it had long lain dormant or perhaps in having
put it to new and instructive use.
Obliteration happens in a scientific reward system when researchers no longer
feel necessary to cite something everyone has already taken for granted. Take
Archimedes’ constant   for example. Archimedes discovered the ratio between
the diameter and circumference of a circle:  . As Archimedes’ constant becomes
increasingly familiar even to schoolchildren, scientists would cite Archimedes’
primordial paper less and less, until finally there is no need to cite it at all, which
means his original paper would have been obliterated. This is regarded as one of the
highest compliments the community of scientists can pay to a scientist because of a
contribution that was so basic, so vital, and so well known that every scientist can
simply take it for granted (Garfield 1975).
166 5 The Structure and Dynamics of Scientific Knowledge

Just to mention two more examples of obliteration. One is the notion of


“the exponential growth of scientific literature”. Derek Price formulated the law
of exponential growth of scientific literature in 1950 in his paper to the 6th
International Congress for the History of Science at Amsterdam. Before long,
scientists from different disciplines obliterated it and took the exponential growth
for granted. The notion of “paradigm shift” is another example. Phrases such as
“new paradigms” and “a paradigm shift” frequently appear in scientific literature
without direct citations to Thomas Kuhn’s seminal book The Structure of Scientific
Revolution (Kuhn 1962).
In information science, an “obliteration” hallmark is the annual Award of Merits
from the American Society for Information Science and Technology (ASIS&T). The
Award of Merit is the highest honor of ASIS&T for individuals who have made an
outstanding contribution to the field of information science. Henry Small of ISI was
the recipient of the 1999 award for his work in co-citation analysis. We will include
some examples of his work in this chapter. Don Swanson, professor emeritus at
the University of Chicago, was the recipient of the 2000 for his renowned work in
undiscovered public knowledge.
In Science since Babylon, Derek Price (1961) used the term invisible college to
emphasize the role of informal networks of scientists in scientific communication.
The term was originally used in the seventeenth century’s London to refer to an
informal club of artisans and practitioners before the formal organization of the
Royal Society. Diana Crane (1972) regarded such informal scholarly communica-
tion networks as the “lifeblood of scientific progress for both the physical and the
social sciences.”
Science mapping has been a long-lasting pursuit for revealing the dynamics of an
invisible college and the evolution of intellectual structures. Derek Price has been
regarded as the leader in the field of Science of Science, which is a precursor of
the social studies of science and the field of scientometrics. Scientometrics is the
quantitative study of scientific communications.
In science mapping, we must consider a wide variety of fundamental concepts
that distinguish the level of granularity of each individual study. Such concepts
are known as units of analysis. Examples of abstract units include ideas, concepts,
themes, and paradigms. These concepts are represented and conveyed through
words, terms, documents, and collections by individual authors, groups of authors,
specialties, and scientific communities. The following examples in this chapter
illustrate association relationships between several types of units of analysis, such as
word co-occurrences in text, document co-occurrences in bibliography (document
co-citation), author co-occurrences in bibliography (author co-citation), and patent
occurrences in patent publications (patent co-citation).
Science mapping reveals structures hidden in scientific literature. The definition
of association determines the nature of the structure to be extracted, to be visualized,
and to be eventually interpreted. Co-word analysis (Callon et al. 1986) and
co-citation analysis (Small 1973) are among the most fundamental techniques
for science mapping. Small (1988) described the two as follows: “if co-word
links are viewed as translations between problems, co-citation links have been
5.2 Maps of Words 167

viewed as statements relating concepts.” They are the technical foundations of the
contemporary quantitative studies of science. Each offers a unique perspective on
the structure of scientific frontiers. Researchers have found that a combination of
co-word and co-citation analysis could lead to a clearer picture of the cognitive
content of publications (Braam et al. 1991a, b).

5.2 Maps of Words

The tradition of deriving higher-level structures from word-occurrence patterns in


text originated in the co-word analysis method developed in the 1980s (Callon et al.
1983, 1986). Co-word analysis is a well-established camp in scientometrics, which
is a field of quantitative studies of science concerning with indicators and metrics of
the dynamics of science and technology at large. The outcome of co-word analysis
was typically depicted as a network of concepts.

5.2.1 Co-Word Maps

The history of co-word analysis has some interesting philosophical and sociological
implications for what we will see in later chapters. First, one of the key arguments
of the proponents of co-word analysis is that scientific knowledge is not merely pro-
duced within “specialist communities” which independently define their research
problems and delimit clearly the cognitive and methodological resources to be
used in their solution. The attention given to “specialist communities” is due to
the influence of the work done by Thomas Kuhn, particularly in his Postscript
to the second edition of The Structure of Scientific Revolutions. There are some
well-known examples of this approach, notably the invisible college by Diana
Crane (1972). The specialty areas are often identified by an analysis of citations in
scientific literature (Garfield et al. 1978). Co-citation analysis has been developed
in this context (Small, 1977; Small and Greenlee 1980). A general criticism of
the sociology of specialist communities was made by Knorr-Cetina (1999). Edge
(1979) gave critical comments on delimiting specialty areas by citations. In 1981,
the issue 11(1) of Social Studies of Science was devoted to the analysis of scientific
controversies. We will return to Kuhn’s theory when we explain its roles in
visualizing scientific frontiers in later chapters of the book.
In 1976, Henry Small raised the question of social-cognitive structures in science
and underlined the difficulties of using experts to help identify them. This is because
experts are biased. Co-word analysis was developed to provide an “objective”
approach without the help of domain experts.
The term leximappe was used to refer to this type of concept maps. More
specific types of such maps are inclusion maps and proximity maps. Subsequent
developments in relation to co-word analysis have incorporated artificial neural
168 5 The Structure and Dynamics of Scientific Knowledge

network techniques such as self-organized maps to depict patterns and trends


derived from text. See (Lin 1997; Noyons and van Raan 1998) for example.
The pioneering software for concept mapping is Leximappe, developed in 1980s.
It organizes a network of concepts based on associations determined by the co-word
method. In 1980s, it was Leximappe that had turned co-word analysis into an
instrumental tool for social scientists to carry out numerous studies originated from
the famous the actor-network theory (ANT).
Key concepts in Leximappe include poles and their position in concept maps. The
position of the poles is determined by centrality and density. The centrality implies
the capacity of structuring; the density reflects the internal coherence of the pole.
Leximappe is used to create structured graphic representations of concept net-
works. In such networks, vertices represent concepts; the strength of the connection
between two vertices reflects the strength of their co-occurrence. In the early days,
an important step was to tag all words in the text as a noun, a verb, or an adjective.
Algorithms used in information visualization systems such as ThemeScape (Wise
et al. 1995) have demonstrated some promising capabilities of filtering out nouns
from the source text.

5.2.2 Inclusion Index and Inclusion Maps

Inclusion maps and proximity maps are two types of concept maps resulted from
co-word analysis. Co-word analysis measures the degree of inclusion and proximity
between keywords in scientific documents and draws maps of scientific areas
automatically in inclusion maps and proximity maps, respectively.
Metrics for co-word analysis have been extensively studied. Given a corpus of
N documents, each document is indexed by a set of unique terms that can occur in
multiple documents. If two terms, ti and tj , appear together in a single document,
it counts as a co-occurrence. Let ck be the number of occurrences of term tk in
the corpus and cij be the number of co-occurrences of terms ti and tj , which is the
number of documents indexed by both terms. The inclusion index Iij is essentially a
conditional probability. Given the occurrence of one term, it measures the likelihood
of finding another term in documents of the corpus:
 
Iij D cij =min ci ; cj

For example, Robert Stevenson’s Treasure Island has a total of 34 chapters.


Among them the word map occurred in 5 chapters, cmap D 5, and the word treasure
occurred 20 chapters, ctreasure D 20. The two terms co-occur in 4 chapters, thus
cmap, treasure D 4. Imap, treasure D 4/5 D 0.8. In this way, we can construct an inclusion
matrix of terms based on their co-occurrence. This matrix defines a network. An
interesting step described in the original version of co-word analysis is to remove
certain types of links from this network.
5.2 Maps of Words 169

Fig. 5.1 An inclusion map of research in mass extinction based on index terms of articles on mass
extinction published in 1990. The size of a node is proportional to the total number of occurrences
of the word. Links that violate first-order triangle inequality are removed (© D 0.75)

The original co-word analysis prunes a concept graph using a triangle inequality
rule on conditional probabilities. Suppose we have a total of N words in the analysis,
for 1  i, j, k  N, ¨ij , ¨ik , and ¨kj represent the weights of links in the network and
¨ij is defined as 1 – Iij . Given a pre-defined small threshold ©, if there exists an index
k such that ¨ij > ¨ik *¨kj C ©, then we should remove the link Iij . Because ¨ik *¨kj
defines the weight of a path from term ti to tj , what this operation means is if we can
find a shorter path from term ti to tj than the direct path, then we choose the shorter
one. In other words, if a link violates the triangle inequality, it must be invalid; there-
fore, it should be removed. By rising or lowering the threshold ©, we can decrease
or increase the number of valid links in the network. This algorithm is simple to
implement. In co-word analysis, we usually only compare a one-step path with a
two-step path. However, when the size of the network increases, this simple algo-
rithm tends to allow in too many links and the resultant co-word map tends to lose
its clarity. In next chapter, we will introduce Pathfinder network scaling as a generic
form of the triangle inequality condition, which enable us compare much longer
paths connecting two points and detect much subtle association patterns in data.
Figure 5.1 shows a co-word map based on the inclusion index. The co-word
analysis was conducted on index terms of articles published in 1990 from a search in
the Web of Science with the query “mass extinction”. The meaning of this particular
co-word map should become clear when you complete Chap. 7, which contains a
170 5 The Structure and Dynamics of Scientific Knowledge

detailed account of the background and key issues in the study of mass extinction.
The main reason we skip the explanation here is because of its involvement
with theories and examples of competing paradigms, a unique characteristic of a
scientific frontier.

5.2.3 The Ontogeny of RISC

Steve Steinberg (1994) addressed several questions regarding the use of a quanti-
tative approach to identify paradigm shifts in the real world. He chose to examine
Reduced Instruction Set Computing (RISC). The idea behind RISC was that a pro-
cessor with only a minimal set of simple instructions could outperform a processor
that included instructions for complex high-level tasks. In part, RISC marked a
clear shift in computer architecture and had reached some degree of consensus.
Steinberg searched for quantitative techniques that could help his investigation.
Eventually he found the co-word analysis technique that could produce a map of
the field, a visualization of the mechanisms, and a battle chart of the debate. He
wrote (Steinberg 1994): “If I could see the dynamics of a technical debate, I thought,
perhaps I could understand them.”
He collected all abstracts with the keyword RISC for the years 1980–1993 from
the INSPEC database, filtered out the 200 most common English words, and ranked
the remaining words by frequency. The top 300 most frequently occurred words
were given to three RISC experts to choose those words central to the field. Finally,
words chosen by the experts were aggregated by synonyms into 45 keyword clusters.
The inclusion index was used to construct a similarity matrix. This matrix was
mapped by MDS with ALSCAL. The font size of a keyword was proportional to
the word’s frequency. Strongly linked keywords were connected by straight lines.
Figure 5.2 shows the co-word map of the period of 1980–1985. The first papers
to explicitly examine and define RISC appeared within this period. The design
philosophy of RISC was so opposed to the traditional computing architecture
paradigm, every paper in this period was written to defend and justify RISC. The
map shows two main clusters. One is on the left, surrounding keywords such
as register, memory, simple, and pipeline. These are the architectural terms that
uniquely define RISC. The other cluster is on the right, centered on keywords such as
language and CISC. These are the words that identify the debate between the RISC
and CISC camps. Language is the most frequent keyword on the map. According
to Steinberg, the term language most clearly captures the key to the debate between
RISC and CISC. While CISC proponents believed that a processor’s instruction
set should closely correspond to high-level languages such as FORTRAN and
COBOL, RISC proponents argue that simple instructions were better than high-
level instructions. This debate is shown in the co-word map with the connections
between language, CISC, compiler, and programming.
To illustrate the paradigm shift, we also include the co-word map of another
period: 1986–1987 (Fig. 5.3). During this period, Sun introduced the first
5.2 Maps of Words 171

Fig. 5.2 The co-word map


of the period of 1980–1985
for the debate on RISC

Fig. 5.3 The co-word map of


another period: 1986–1987
for the debate on RISC

commercially important RISC microprocessor – the SPARC in 1986. RISC had


transformed from papers to a tangible product, backed by investors. The bi-polar
co-word map for the previous period is now predominated by the RISC cluster. The
technology of RISC implementation, namely VLSI, has become larger and more
central.
On the one hand, the reconfiguration of the co-word map from bi-polar to lop-
sided indicates that the high-level language argument had been settled. On the other
hand, the map provides few clues of how this transformation took place. The lack
172 5 The Structure and Dynamics of Scientific Knowledge

of interpretable indicators at detailed levels is not uncommon with co-word maps


and indeed with other types of bibliometric maps as well. In order to interpret
a visualized structure, one has to resort to some substantial levels of domain
knowledge, or at least one has to read some qualitative summaries of the subject.
In fact, it is advisable to consult a good review article to double-check the validity
of interpretations along the map. In this example, Steinberg himself was an expert on
the topic of RISC and he incorporated his domain knowledge into the interpretation
of co-word maps generated from abstracts. Researchers in quantitative studies
of science have also recommended a multiple-approach strategy – to the same
phenomenon with a few different methods – so that one can compare and contrast
results from different perspectives and piece together a big picture. If mapping the
paradigm shift with one single technique is like the blind men approaching the
elephant, combining different techniques may lead to a more accurate model of
the elephant. Next, we turn to co-citation analysis, which is another major approach
that has been used for science mapping.

5.3 Co-Citation Analysis

Citation analysis takes into account one of the most crucial indicators of scholar-
ship – citations. Citation analysis has a unique position in the history of science
mapping because several widely used analytical methods have been developed to
extract citation patterns from scientific literature and these citation patterns can
provide insightful knowledge of an invisible college. Traditionally, both philosophy
of science and sociology of knowledge have a strong impact on citation analysis.
Opponents of citation analysis criticize its approach influenced by the idea of invis-
ible colleges and scientific communities, and argue that the way science operates is
far beyond the scope of citation practices (Callon et al. 1986). However, this issue
cannot be simply settled by theoretical arguments. Longitudinal studies and large-
scale domain analysis can provide insightful answers, but they tend to be very time-
consuming and resource demanding. In practice, researchers have been exploring
frameworks that can accommodate both co-word analysis and co-citation analysis
(Braam et al. 1991a, b). These efforts may provide additional insights into the philo-
sophical and sociological debates. Document co-citation analysis (DCA) and author
co-citation analysis (ACA) represent the two most prolific mainstream approaches
to co-citation analysis. Here we first introduce DCA and then explain ACA.

5.3.1 Document Co-Citation Analysis

Citation indexing provides a device for researchers to track the history of advances
in science and technology. One can trace a network of citations to find out the history
and evolution of a chain of articles on a particular topic. The goal of citation analysis
is to make the structure of such a network more recognizable and more accessible.
5.3 Co-Citation Analysis 173

Fig. 5.4 A document co-citation network of publications in Data and Knowledge Engineering

Traditional citation analysis is typically biased to journal publications due to the


convenience of available citation data. Expanding the sources to other scientific
inscriptions, such as books, proceedings, grant proposals, patents, preprints, and
digital resources on the Internet, has begun to attract the attention of researchers
and practitioners. In 2002, when I wrote the first edition of the book, we anticipated
to see a sharp increase in patent analysis and studies utilizing Web-based citation
indexing techniques in the next 3–5 years because of the growing interest and
commercial investments in supporting patent analysis with knowledge discovery
and visualization techniques. Today, major resources for citation analysis in-
clude Thomson Reuters’ Web of Science, Elsevier’s Scopus, and Google Scholar.
Figure 5.4 shows a visualization of a document co-citation network of publications
in Data and Knowledge Engineering. The color coded clusters indicate the focus of
the community at various stages of the research. The most cited paper is the one that
invented the entity-relationship modeling method by Peter Chen.

5.3.1.1 Specialties

In information science, the term specialty refers to the perceived grouping of


scientists who are specialized on the same or closely related topics of research.
Theories of how specialties evolve and change started to emerge in the 1970s
(Small and Griffith 1974). Researchers began to focus on the structure of scientific
literatures in order to identify and visualize specialties, although they did not use
the term “visualization” at that time.
174 5 The Structure and Dynamics of Scientific Knowledge

Fig. 5.5 Citation analysis


detected a vital missing
citation from Mazur’s paper
in 1962 to Rydon’s paper
in 1952

Recently, science-mapping techniques have begun to reveal structures of


scientific fields in several promising visualization metaphors, including networks,
landscapes, and galaxies. The ability to trace scientific and technological
breakthroughs from these science maps is particularly important. The key questions
are: what are these maps telling us, and how do we make use of such maps at both
strategic and tactical levels?
Today’s most widely used citation index databases such as SCI and SSCI were
conceived in the 1950s, especially in Garfield’s pioneering paper published in
Science (Garfield 1955). In the 1960s, several pioneering science mapping studies
began to emerge. For example, Garfield, Sher, and Torpie created the historical map
of research in DNA (Garfield et al. 1964). Sher and Garfield demonstrated the power
of citation analysis in their study of Nobel Prize winners’ citation profiles (Sher
and Garfield 1966). Figure 5.5 shows how citation analysis spotted a vital missing
citation to earlier work (Garfield 1996).
In the 1970s, information scientists began to focus on ways that can reveal
patterns and trends reflected through scientific literature. Henry Small demonstrated
the power of SCI-Map in mapping the structure of research in AIDS (Small 1994).
Once the user specified an author, a paper, or a key word as the seed, SCI-Map
could create a map of related papers by adding strongly co-cited papers to the map.
The creation of a map involved a series of iterations of clustering. The layout was
5.3 Co-Citation Analysis 175

generated by a method called geometric triangulation, which is different from the


MDS approach used in Small’s earlier work and in similar studies in the US and
Europe (Garfield 1996).
Henry Small and Belver Griffith initiated co-citation analysis for identifying and
mapping specialties from the structure of scientific literature (Small and Griffith
1974). Articles A and B have a co-citation count of k if there are k articles and each
of them cites both articles A and B. The co-citation rate of A and B is defined as
the number of such instances. A high co-citation rate implies a strong intellectual
tie between two articles.
In a longitudinal study of collagen research, Henry Small tracked the movement
of specialties in collagen research using a cluster-based approach (Small 1977).
He emphasized the fundamental role of systematic and consistent methodological
frameworks. He used the frequency of co-citation to measure the strength of the
association between articles on the topic. He marked clusters of highly cited articles
in MDS maps with contour lines so that he could track rapid sifts in research focus
from 1 year to another as articles moved in and out key cluster contours and used it
as an indicator of “revolutionary” changes.
In the 1980s, the Institute for Science Information (ISI) published the Atlas of
Science in Biochemistry and Molecular Biology, which identified more than 100
distinct clusters of articles, known as research front specialties, and provided a
distinct snapshot of scientific networks. The Atlas was constructed based on co-
citation relationships between publications in the field over a period of 1 year. In
1989, Garfield and Small explained how software like SCI-Map could help users
navigate the scientific literature and visualize the changing frontiers of science based
on citation relationships (Garfield and Small 1989). Henry Small described in detail
his citation mapping approach to visualizing science. Figure 5.6 shows a global map
of science for 1996 produced by co-citation mapping. The map highlighted major
connections among disciplines such as economics, neuroscience, biomedicine,
chemistry, and physics. The size of a circle was made proportional to the volume of a
particular scientific literature, for example, the large biomedical circle in the center
of the map indicating the huge number of biomedicine publications in journals.
Computer science, shown as a relatively small circle in the map, linked to imaging
and economics. The small volume of computer science reflected the fact that journal
publications are merely a small proportion of the entire computer science literatures,
typically including conference proceedings, technical reports, and preprints. One
can also zoom into the global map of science and examine local structures (See
Figs. 5.7 and 5.8).
MDS maps and clustering algorithms are typically used in co-citation analysis
to represent co-citation structures. There is an increasing interest in using graph-
drawing techniques to depict the results of co-citation analysis, including minimum
spanning trees (MST) and Pathfinder networks (PF). The increased use of the
metaphor of an information landscape is another trend, in which the entire structure
can be rendered as a mountain terrain or a relief map.
176 5 The Structure and Dynamics of Scientific Knowledge

Fig. 5.6 A global map of science based on document co-citation patterns in 1996, showing
a linked structure of nested clusters of documents in various disciplines and research areas
(Reproduced from Garfield 1998)

5.3.1.2 Narratives of Specialties

Creating a science map is the first step towards exploring and understanding
scientific frontiers. Science maps should guide us from one topic or specialty to
related topics or specialties. Once we have a global map in our hands, the next
logical step is to find out how we can make a journey from one place to another
based on the information provided by the map. Small introduced the concept of
passage through science. Passages are chains of articles in scientific literature.
Chains running across the literature of different disciplines are likely to carry a
method established in one discipline into another. Such chains are vehicles for
cross-disciplinary fertilization. Traditionally, a cross-disciplinary journey would
require scientists to make a variety of connections, translations, and adaptations.
Small demonstrated his powerful algorithms by blazing a magnificent trail of
more than 300 articles across the literatures of different scientific disciplines.
This trailblazing mechanism development has brought Bush’s (1945) concept of
information trailblazing to life.
5.3 Co-Citation Analysis 177

Fig. 5.7 Zooming in to reveal a detailed structure of biomedicine (Reproduced from


Garfield 1998)

Henry Small described what he called the synthesis of specialty narratives from
co-citation clusters (Small 1986). This paper won the JASIS best-paper award in
1986. Small first chose a citation frequency threshold to select the most cited
documents in SCI. The second step was to determine the frequency of co-citation
between all pairs of cited documents above the threshold. Co-citation counts
were normalized by Salton’s cosine formula. Documents were clustered using the
single-link clustering method, which was believed to be more suitable than the
complete-link clustering algorithm because the number of co-citation links can be
as many as tens of thousands. Single-link clusters tend to form a mixture of densely
and weakly linked regions in contrast to more densely packed and narrowly focused
complete-link clusters. MDS was used to configure the layout of a global map.
Further, Small investigated how to blaze trails in the knowledge space represented
by the global map. He called this type of trail the specialty narrative.
Small addressed how to transform a co-citation network into a flow of ideas. The
goal for specialty narrative construction is to find a path through such networks so as
to track the trajectory of scientists who had encountered these ideas. Recall that the
traveling salesman problem (TSP) requires the salesman to visit each city exactly
once along a route optimized against a given criterion. We are in a similar situation
with the specialty narrative construction, or more precisely, the re-construction of
178 5 The Structure and Dynamics of Scientific Knowledge

Fig. 5.8 Zooming in even further to examine the structure of immunology (Reproduced from
Garfield 1998)

narrative trails when we retrace the possible sequence of thought by following trails
of co-citation links. TSP is a hard problem to solve. Luckily, there are some very
efficient algorithms to traverse a network, namely breath-first search (BFS) and
depth-first search (DFS). Both result in a minimum-spanning tree (MST). Small
considered several possible heuristics for the traversal in his study. For example,
when we survey the literature, we tend to start with some old articles so as to form
a historical context. A reasonable approach is to start from the oldest article in
the co-citation network. In this example, DFS was used to generate an MST. The
longest path through the MST tree was chosen as the main sequence of the specialty
narrative (See Fig. 5.9).
The context of citing provides first-hand information on the nature of citation.
A specialty narrative is only meaningful and tangible if sufficient contextual
information of citation is attached to the narrative. The citation context of a given
article consists of sentences that explicitly cite the article. Such sentences may come
from different citing articles. Different authors may cite the same article for different
reasons. On the other hand, researchers may cite several articles within one sentence.
Small took all these circumstances into account in his study. In the foreseeable
future, we will still have to rely on human intervention to make such selection,
as opposed to automated algorithmic devices. Nevertheless, NEC’s ResearchIndex
has shown some promising signs of how much we might benefit from citation
contexts automatically extracted from documents on the Web. In his 1986 specialty
narrative study, Small had to examine passages from citing papers, coded them, and
5.3 Co-Citation Analysis 179

Fig. 5.9 The specialty narrative of leukemia viruses. Specialty narrative links are labeled by
citation-context categories (Reproduced from Small 1986)

keyed them before running a program to compute the occurrence frequencies. This
specialty narrative was rigorously planned, carefully carried out, and thoroughly
explained. Henry Small’s JASIS award-wining paper has many inspiring ideas and
technical solutions that predated the boom of information visualization in the 1990s.
Over the last 15 years, this paper has been a source of inspiration for citation
analysis; we expect it will also influence information visualization and knowledge
visualization in a fundamental way.
Robert Braam, Henk Moed and Anthony van Raan investigated whether co-
citation analysis indeed provided a useful tool for mapping subject-matter spe-
cialties of scientific research (Braam et al. 1991a, b). Most interestingly, the
cross-examination method they used was co-word analysis. Their work clarified a
number of issues concerning co-citation analysis.
The cluster of co-cited documents is considered to represent the knowledge
base of a specialty (Small 1977). In a review of bibliometric indicators, Jean King
(1987) sums up objections against co-citation analysis: loss of relevant papers,
inclusion of non-relevant papers, overrepresentation of theoretical papers, time lag,
and subjectivity in threshold setting. There were more skeptical claims that co-
citation clusters were mainly artifacts of the applied technique having no further
identifiable significance. Braam and his co-workers addressed several issues in their
investigation in response to such concerns. For example, does a co-citation cluster
180 5 The Structure and Dynamics of Scientific Knowledge

identify a specialty? They used concepts such as “cognitive coherence” within


clusters and “cognitive differences” between clusters. Their results suggested that
co-citation analysis indeed showed research specialties, although one specialty may
be fragmented across several different clusters. They concluded that co-citation
clusters were certainly not artifacts of an applied technique. On the other hand, their
study suggested that co-citation clusters did not represent the entire body of publi-
cations that comprised a specialty. Therefore, they concurred the recommendation
of Mullins et al. (1988) that it would be necessary to analyze different structural
aspects of publications so as to generate significant results in science mapping.

5.3.2 Author Co-Citation Analysis

The 1980 s saw the beginning of what turned out to be a second fruitful line of
development in the use of citation to map science – author co-citation analysis
(ACA). Howard White and Belver Griffith introduced ACA in 1981 as a way to
map intellectual structures (White and Griffith 1981). The unit of analysis in ACA is
authors and their intellectual relationships as reflected through scientific literatures.
The author-centered perspective of ACA led to a new approach to the discovery
of knowledge structures in parallel to approaches used by document-centered co-
citation analysis (DCA).

5.3.2.1 Intellectual Structures

An author co-citation network offers a useful alternative starting point for co-citation
analysis, especially when we encounter a complex document co-citation network,
and vice versa. Katherine McCain (1990) gave a comprehensive technical review of
mapping authors in intellectual spaces. ACA reached a significant turning point in
1998 when White and McCain (1998) applied ACA to information science in their
thorough study of the field. Since then ACA has flourished and has been adopted
by researchers across a number of disciplines beyond the field of citation analysis
itself. Their paper won the best JASIS paper award. With both ACA and DCA at our
hands, we begin to find ourselves in a position to compare and contrast messages
conveyed through different co-citation networks of the same topic as if we were
having two pairs of glasses.
Typically, the first step is to identify the scope and the focus of ACA. The raw
data are either analyzed directly or, more commonly, converted into a correlation
matrix of co-citation. Presentations often combine MDS with cluster analysis or
PCA. Groupings are often produced by hierarchical cluster analysis. Figure 5.10
illustrates a generic procedure of a standard co-citation procedure. For example,
node placement can be done with MDS; clustering can be done with the single-
or complete-link clustering; PCA might replace clustering. In practice, some
researchers choose to work on raw co-citation data directly, whereas others prefer
5.3 Co-Citation Analysis 181

Fig. 5.10 A generic


procedure of co-citation
analysis. Dashed lines
indicate visualization options

to work on correlation matrices. To our knowledge, there is no direct comparison


between the two routes in terms of the quality of clustering, although it would be
useful to know the strengths and weaknesses of each route. Partition can divide a
global view into more manageable regions and make the map easier to understand.
Finally, additional information such as citation counts and co-citation strengths can
be rendered in the map to convey the message clearly.
In their pioneering 1981 study, White and Griffith created the first ever author
co-citation map of information science from the Social Sciences Citation Index®
(SSCI) for 1972–1979. Their map showed five main clusters of author within
the field of information science (See Fig. 5.11). Each cluster corresponded to a
specialty:
1. Scientific communication,
2. Bibliometrics,
3. Generalists,
4. Document Analysis/Retrieval evaluation/Systems, and
5. Precursors.
In this first author co-citation map of information science, scientific communica-
tion is on the left and information retrieval on the right. Over the last 20 years,
researchers have created several co-citation maps of the field of information
science – their home discipline. Later maps have shared some characteristics of
this structure.
The author co-citation map produced by White and Griffith (1981) depicted
information science over a 5-year span (1972–1979). In 1998, 17 years later,
White and McCain (1998) generated a new map of information science based on
a considerably expanded 23-year span (1972–1995). They first selected authors
who had been highly cited in 12 key journals of information science. Co-citations
of 120 selected authors between 1972 and 1995 were extracted from SSCI. They
182 5 The Structure and Dynamics of Scientific Knowledge

Fig. 5.11 The first map of author co-citation analysis, featuring specialties in information science
(1972–1979) (Reproduced from White and Griffith 1981)

generated maps of the top 100 authors in the field. Major specialties in the fields
were identified using factor analysis. The resultant map showed that the field of
information science consisted of two major specialties with little overlap in terms
of their memberships, namely experimental retrieval and scientific communication.
Citation analysis belongs to the same camp as scientific communication. One of
remarkable findings was that the new map preserved some of the basic structure
from the 1981 map: scientific communication on the right and information retrieval
on the left.
White and McCain demonstrated that authors might simultaneously belong to
several specialties. Instead of clustering authors into mutual exclusive specialties,
they used PCA to accommodate the multiple-specialty membership for each author.
First, the raw co-citation counts were transformed into Pearson’s correlation
coefficients as a measure of similarity between pairs of authors (White and McCain
1998). They generated an MDS-based author co-citation map of 100 authors in
information science for the period of 1972–1995. It is clear from the map that
information science was made of two major camps: the experimental retrieval
camp on the right and the citation analysis camp on the left. The experimental
retrieval camp includes names such as Vannevar Bush (1890–1974), Gerald Salton
(1964–1988), and Don Swanson, whereas the citation camp includes David Price
(1922–1983), Eugene Garfield, Henry Small, and Howard White. Thomas Kuhn
(1922–1996) appears at about the coordinates of (1.3, 0.8).
5.3 Co-Citation Analysis 183

Fig. 5.12 A two-dimensional Pathfinder network integrated with information on term frequencies
as the third dimension (Reproduced from Chen 1998)

Since 1997, I started to explore Pathfinder network scaling as a vehicle to


visualize complex networks (Chen 1997b, 1998). See Figs. 5.12, 5.13, and 5.14.
Pathfinder network scaling filters out excessive links in a network while maintaining
the salient structure of the network, more precisely, by preserving links that satisfy
the triangular inequality throughout the network. In 1999, I published a study
of author co-citation networks using Pathfinder network scaling techniques and
demonstrated the advantages of using Pathfinder over multidimensional scaling
because Pathfinder networks display connections explicitly and preserve salient
structure while pruning excessive links (Chen 1999).
In 2003, Howard White revisited the same dataset used in their 1998 author
co-citation analysis and applied the Pathfinder network techniques to represent
co-cited authors. He concluded that Pathfinder networks provide considerable
advantages over MDS maps because Pathfinder networks make the connections
explicit. Figure 5.15 shows a Pathfinder of 121 information science authors based
on raw co-citation counts. Garfield, Lancaster, and Salton are the most prominent
authors in the Pathfinder network; each is surrounded by a large number of co-cited
authors.
White and McCain (1998) discussed some issues concerning detecting paradigm
shifts. They compared author co-citation networks over three consecutive periods
using INDSCAL. White and McCain’s work is a significant step towards under-
standing how we may grasp the dynamic of a scientific community and track the
development of a discipline.
184 5 The Structure and Dynamics of Scientific Knowledge

Fig. 5.13 A Pathfinder network of SIGCHI papers based on their content similarity. The interac-
tive interface allows users to view the abstract of a paper seamlessly as they navigate through the
network (Reproduced from Chen 1998)

Fig. 5.14 A Pathfinder network of co-cited authors of the ACM Hypertext conference series
(1989–1998) (Reproduced from Chen and Carr 1999)
5.3 Co-Citation Analysis 185

Fig. 5.15 A Pathfinder network of 121 information science authors based on raw co-citation
counts (Reproduced from White 2003)

5.3.2.2 Generalized Similarity Analysis

Generalized Similarity Analysis (GSA) is a generic framework for structuring and


visualizing distributed hypermedia resources (Chen 1997a, 1998). See Chap. 4 for
a detailed discussion. GSA uses Pathfinder networks to achieve an improved clarity
of a generic network. John Leggett of Texas A&M was a keynote speaker at the
8th ACM Hypertext conference in Pittsburgh and he was talking about “camps”
in hypertext research and “runners” between these invisible camps: who they are
and where they are now. Inspired by White and McCain’s author co-citation maps
and John Leggett’s thought-provoking keynote speech, we were able to pull things
together by applying GSA to ACA. Leslie Carr at the University of Southampton
provided me with the citation data for the ACM Hypertext conference series.
We presented a Pathfinder-powered visualization of the co-citation networks of
hypertext research at the 9th ACM Hypertext conference at Darmstadt in Germany
in 1999. Since then, we have developed a systematic and consistent framework
for ACA and document co-citation analysis (DCA) to accommodate Pathfinder
networks side-by-side with traditional dimensionality reduction techniques such
as MDS and PCA, and working with information visualization techniques such
as animation, color mapping, and three-dimensional landscaping. By 2001, we
consolidated the methodology into a four-step procedure for domain visualization
(Chen and Paul 2001). Having created global thematic landscapes of a subject
186 5 The Structure and Dynamics of Scientific Knowledge

Table 5.1 Comparisons of networks by the number of links, where K is


the number of unique edges in the graph
G D (Vertices, edges) #Vertices #Eedges Example: N D 367
MDS N 0 0
MST N N1 366
PF N 3N 398
Full matrix N N(N  1)/2 61,175

domain, our focus turned to the question of the functionality of such visualizations
and maps. It became clear that a more focused perspective is the key to a more
fruitful use of such visualizations. This is the reason we will turn to Thomas Kuhn’s
puzzle-solving paradigms and focus on the scenarios of competing paradigms in
scientific frontiers in next chapter. Henry Small’s specialty narrative also provides
an excellent example of how domain visualization can guide us to a greater access
to the core knowledge in scientific frontiers.

5.3.2.3 MDS, MST, and Pathfinder

Multidimensional scaling (MDS) maps are among the most widely used ones
to depict intellectual groupings. MDS-based maps are consistent with Gestalt
principles – our perceived groupings are largely determined by proximity, similarity,
and continuity. MDS is designed to optimize the match between pairwise proximity
and distance in high – dimensional space. In principle, MDS should place similar
objects next to each other in a two- or three-dimensional map and keep dissimilar
ones farther apart.
MDS is easily accessible in most statistical packages such as SPSS, SAS, and
Matlab. However, MDS provides no explicit grouping information. We have to judge
proximity patterns carefully in order to identify the underlying structure. Proximity-
based pattern recognition is not easy and sometimes can be misleading. For
example, one-dimensional MDS may not necessarily preserve a linear relationship.
A two-dimensional MDS configuration may not be consistent with the results of
hierarchical clustering algorithms – two points next to each other in an MDS
configuration may belong to different clusters. Finally, three-dimensional MDS may
become so visually complex that it is hard to make sense of it without rotating
the model in a 3D space and studying it from different angles. Because of these
limitations, researchers often choose to superimpose additional information over
an MDS configuration so as to clarify groupings of data points, for example, by
drawing explicit boundaries of point clusters in an MDS map. Most weaknesses of
MDS boil down to the lack of local details. If we treat an MDS as a graph, we can
easily compare the number of links across various network solutions and an MDS
configuration (See Table 5.1).
Figure 5.16 shows a minimum spanning tree (MST) of an author co-citation
network of 367 prominent authors in the field of hypertext. The original author
co-citation network consisted of 61,175 links among these authors. A fully
5.3 Co-Citation Analysis 187

Fig. 5.16 A minimum


spanning tree solution of the
author co-citation network
based on the ACM Hypertext
dataset (Nodes D 367,
Links D 366)

connected symmetric matrix of this size would have a maximum of 66,978 links,
excluding self-citations. In other words, the co-citation patterns were about 91 %
of the maximum possible connectivity. The MST solution selected a total of 366
strongest links. It produces a much-simplified picture of the patterns.
MST provides explicit links to display a more detailed picture of the underlying
network. If the network contains equally weighted edges, one can arbitrarily choose
any one of the MSTs. However, an arbitrarily chosen MST destroys the semantic
integrity of the original network because the selection of an MST is not based on
semantic judgments. Pathfinder network scaling resolves this problem by preserving
the semantic integrity of the original network. When geodesic distances are used, a
Pathfinder network is the set union of all possible MSTs. Pathfinder selects links by
ensuring that selected links do not violate the triangle inequality condition.
Figure 5.17 is a Pathfinder network solution of the same author co-citation
matrix. Red circles mark the extra links when comparing to an MST solution. A total
of 398 links were included in the network – the pathfinder network was 32 links
more than the number of links in its MST counterpart solution. These extra links
would be denied in MST because they form cyclic paths, but forming a cyclic path
alone as a link selection criterion may overlook potentially important links.
In order to incorporate multiple aspects of author co-citation networks, we
emphasize the significance of the following aspects of ACA (See Fig. 5.18):
• Represent an author co-citation network as a Pathfinder network;
• Determine specialty memberships directly from the co-citation matrix using
PCA;
• Depict citation counts as segmented bars, corresponding to citation counts over
several consecutive years.
188 5 The Structure and Dynamics of Scientific Knowledge

Fig. 5.17 The author


co-citation network of the
ACM Hypertext data in a
Pathfinder network
(Nodes D 367, Links D 398)

Fig. 5.18 The procedure


of co-citation analysis as
described in Chen and Paul
(2001)

The results from the three sources, namely, Pathfinder network scaling, PCA, and
annual citation counts, are triangulated to provide the maximum clarity. Figure 5.19
shows an author co-citation map produced by this method. This is an author
co-citation map of 367 authors in hypertext (1989–1998). PCA identified 39 factors,
5.3 Co-Citation Analysis 189

Fig. 5.19 A Pathfinder


network showing an author
co-citation structure of 367
authors in hypertext research
(1989–1998). The color of a
node indicates its specialty
membership identified by
PCA: red for the most
predominant specialty, green
the second, and blue the third
(© 1999 IEEE)

which corresponded to 39 specialties in the field of hypertext. Authors were colored


by factor loadings of the top three largest specialties. The strongest specialty was
colored in red. The next two strongest ones were in green and blue, respectively.
The strongest specialty branches out from the top of the ring structure, whereas the
second strongest specialty appears to concentrate around the lower left-hand corner
of the ring.
The colored PCA overlay allows us to compare structural positions of authors
and their presence in the three major specialties. Partitioning the network by
color provides a unique and informative alternative to traditional non-overlapping
partitions based on clustering and other mutually exclusive partition schemes. Less
restricted partition schemes are most appropriate when we deal with invisible
colleges – identifying the membership of a scientist is rarely a clear cut. After all,
giants in scientific frontiers may well appear simultaneously in several specialties.
Figure 5.20 shows a landscape view of the same author co-citation network
enhanced by corresponding citation history of each author. Most cited authors
became landmarks in the scene. The shape of the invisible college associated with
this field of study began to emerge.
We explored two types of animations: animations that display the distributions of
specialties and animations that display the growth of citation bars in the landscape.
We chose to keep the underlying co-citation network constant, which serves as a
base map, and let the citation profiles grow. In effect, we have a growing thematic
overlay within a static reference framework.
Applying Pathfinder network scaling to co-citation networks not only enriched
the applications of Pathfinder networks, but also let to deeper insights into the
nature of Pathfinder network scaling and how to interpret various patterns emerged
from such representations. Now we can systematically explain the meaning of a
co-citation network. For example, documents or authors in the centre or a relatively
190 5 The Structure and Dynamics of Scientific Knowledge

Fig. 5.20 A landscape view of the hypertext author co-citation network (1989–1998). The height
of each vertical bar represents periodical citation index for each author (© 1999 IEEE)

fully connected area tend to be more generic and generally applicable, whereas those
located in peripheral areas of the Pathfinder network tend to represent more specific
topics.

5.4 HistCite

HistCite is a widely known example of historiography advocated by Eugene


Garfield for decades. HistCite is designed to depict citation connections between
scientific articles over time. It takes bibliographic records from the Web of Science
and generates a variety of tables and historiographic diagrams. In HistCite, the
number of citations of a reference in the entire Web of Science is called Global
Citation Score (GCS), whereas the number of citations of a reference made by
a given set of bibliographic records, also known as a collection, is called Local
Citation Score (LCS). Garfield has maintained a series of analyses using HistCite
on the web.1
In a historiography, published articles are organized according to the time of
their publication. Articles published in the earliest years are placed on the top of
the diagram, whereas more recent articles appear lower in the diagram. If article

1
http://garfield.library.upenn.edu/histcomp/
5.4 HistCite 191

Fig. 5.21 An annotated historiograph of co-citation research (Courtesy of Eugene Garfield;


the original diagram can be found at: http://garfield.library.upenn.edu/histcomp/cocitation small-
griffith/graph/2.html)

A cites article B, then the connection is depicted by a directed line from A to B.


HistCite depicts how often an article has been cited by making the size of the node
proportional to the number of citations.
The historiograph in Fig. 5.21 illustrates what information could be conveyed
by such diagrams. The diagram is a co-citation historiograph generated by Eugene
Garfield on HistCite’s website.2 According to Garfield, the dataset, or collection,
consists of papers that either have the words csocitation or co-citation in the titles or
cite one of three articles identified below:

2
http://garfield.library.upenn.edu/histcomp/cocitation small-griffith/graph/2.html
192 5 The Structure and Dynamics of Scientific Knowledge

• Small H., 1973, JASIS, 24(4), 265–269.


• Small H., 1974, Science Studies, 4(1), 17–40.
• Griffith BC, 1974, Science Studies, 4(4), 339–365.
To make it easier to read, I manually annotated the diagram by labeling nodes
with the lead author and a short phrase. The diagram makes it clear that the
co-citation research was pioneered by Henry Small in an article published in the
Journal of the American Society for Information Science (JASIS) in 1973. His
article cited 20 articles, including Garfield’s 1955 paper in Science, which laid down
the foundation of citation analysis, Kessler’s 1963 paper, in which the concept of
bibliographic coupling was introduced.
In 1974, Small and Griffith further consolidated the conceptual foundation of
co-citation analysis. Garfield’s article on citation classics also appeared in the same
year. Three years later, Small deepened the co-citation methodology further with a
longitudinal cocitation study of collagen research. In the meantime, Moravcsik and
Murugesan studied function and quality of citations. Gilbert examined the role of
citations in persuasion. Small crystalized the notion of cited references as concept
symbols. The first major review article appeared in Library Trends in 1981, written
by Linda Smith.
Author cocitation analysis (ACA) was first introduced in 1981 by Howard White
and Belver Griffith. Generally speaking, a co-citation network of authors tends to
be denser than a co-citation network of references, or document cocitation analysis
(DCA), as proposed by Small. ACA can be seen as an aggregated form of cocitation
analysis because different articles by the same author would be aggregated to the
name of the author in ACA. On the one hand, such aggregation may simplify
the overall complexity of the structure of a specialty. On the other hand, such
aggregation may also lose the necessary information to differentiate the works by
the same author. Considering it is quite common for a scholar to change his or
her research interests from time to time, one may argue that it would be more
informative to keep the distinct work of the same author separated instead of
lumping them altogether. The co-word analysis method was introduced by Callon
et al. in 1983. In 1985, Brooks investigated what motivated citers, while Small
further advanced the method for cocitation analysis with specific focus on the role
of cocitations as a clustering mechanism.
In 1987, Swanson’s work merged, leading to research in literature-based dis-
covery. His 1987 paper was followed by two more papers in 1990 and 1997,
respectively. In the meantime, Howard White and Katherine McCain reviewed the
state of the art of biometrics with their special focus on authors as opposed to other
units of analysis. In 1998, White and McCain presented a comprehensive ACA of
information science and mapped the results in multidimensional scaling (MDS). In
1999, we introduced Pathfinder network scaling to the analysis of author cocitation
networks. In 2006, the publication of CiteSpace II marked a streamlined analytic
platform for cocitation studies. In 2008, the most recent addition to the landscape
of cocitation literature was Martin Rosvall’s work that models information flows in
networks in terms of random walks.
5.5 Patent Co-Citations 193

5.5 Patent Co-Citations

Patent analysis has a long history in information science, but recently there is a
surge of interest from the commercial sector. Numerous newly formed companies
are specifically aiming at the patent analysis market. Apart from historical driving
forces such as monitoring knowledge and technology transfer and staying in com-
petition, the rising commercial interest in patent analysis is partly due to the public
accessible patent databases, notably the huge amount of patent applications and
grants from the United States Patent and Trademark Office (USPTO). The public
can search patents and trademarks at USPTO’s website http://www.uspto.gov/
and download bibliographic data from ftp://ftp.uspto.gov/pub/patdata/. Figure 5.22
shows a visualization of a network of 1,726 co-cited patents.
The availability of the abundant patent data, the increasingly widespread aware-
ness of information visualization, and the maturity of search engines on the Web
are among the most influential factors behind the emerging trend of patent analysis.

Fig. 5.22 A minimum spanning tree of a network of 1,726 co-cited patents related to cancer
research
194 5 The Structure and Dynamics of Scientific Knowledge

Fig. 5.23 Landscapes of patent class 360 for four 5-year periods. Olympus’s patents are shown
in blue; Sony in green; Hitachi in green; Philips in magenta; IBM in cyan; and Seagate in red
(Reproduced from Figure 1 of Boyack et al. 2000)

Many patent search interfaces allow users to search by specific sections in patent
databases, for example by claims. Statistical analysis and intuitive visualization
functions are by far the most commonly seen selling points from a salesman’s patent
analysis portfolio. The term visualization becomes so fashionable now in the patent
analysis industry that from time to time we come across visualization software tools
that turn out to be little more than standard displays of statistics.
A particularly interesting example is from Sandia National Laboratory. Kevin
Boyack and his colleagues (2000) used their landscape-like visualization tool
VxInsight to analyze the patent bibliographic files from USPTO in order to answer
a number of questions. For example, where are competitors placing their efforts?
Who is citing our patents, and what types of things have they developed? Are there
emerging competitors or collaborators working in related areas? The analysis was
based on 15,782 patents retrieved from a specific primary classification class from
the US Patent database. The primary classification class is class 360 on Dynamic
Magnetic Information Storage or Retrieval. A similarity measure was calculated
using the direct and co-citation link types of Small (1997). Direct citations were
given a weighting five times that of each co-citation link. These patents were
clustered and displayed in a landscape view (See Figs. 5.23 and 5.24).
5.6 Summary 195

Fig. 5.24 Map of all patents issued by the US Patent Office in January 2000. Design patents are
shown in magenta; patents granted to universities in green; and IBM’s patents in red (Reproduced
from Figure 5 of Boyack et al. 2000)

5.6 Summary

In this chapter, we have introduced factors that influence perceived impact of scien-
tific works, such as Matthew Effect. We focused on two mainstream approaches
to science mapping, namely co-word analysis and co-citation analysis. Within
co-citation analysis, we distinguished document co-citation analysis and author co-
citation analysis. Key techniques used in and developed along with these approaches
were described, although our focus was on the fundamental requirements and
strategies rather than detailed implementations. More fundamental issues were
identified, that is, where should we go next from the global map of a field of
study from 60,000 ft above the ground? The central theme of this chapter is on the
shoulders of giants, which implies that the knowledge of the structure of scientific
frontiers in the immediate past holds the key to a fruitful exploration of human
being’s intellectual assets. Henry Small’s specialty narrative provided an excellent
example to mark the transition from admiring a global map to a more detailed
knowledge acquisition process. We conclude this chapter with a visualization of the
literature of co-citation analysis. The visualization in Fig. 5.25 shows a network of
co-cited references from articles that cited either Henry Small or Belver Griffith, the
two pioneers of the co-citation research. The visualization is generated by CiteSpace
based on citations made between 1973 and 2011. The age of an area of concentration
196 5 The Structure and Dynamics of Scientific Knowledge

Fig. 5.25 A visualization of the literature of co-citation analysis

is indicated by the colors of co-citation links. The earlier works are in colder colors,
i.e. blue. The more recent works are in warmer colors, i.e. in orange. The upper half
of the network was formed the first, whereas the lower left area was the youngest.
The network is divided into clusters of co-cited references based on how tightly
they were coupled. Each cluster is automatically labeled with words from the titles
of articles that are responsible for the formation of the cluster. For example, clusters
such as #86 scientific specialty, #76 co-citation indicator, and #67 author co-citation
structure are found in the region with many areas in blue color. The few clusters in
the middle of the map connect the upper and lower parts, including #21 cocitation
map and #26 information science. Clusters in the lower left areas are relatively new,
including #37 interdisciplinarity and #56 visualization. Technical advances in the
past 10 years have made such visual analytics more accessible than before.
Researchers began to realize that to capture the dynamics of science in action,
science mapping needs to bring in different perspectives and metaphors. Loet
Leydesdorff of University of Amsterdam argued that evolutionary perspectives are
more appropriate for mapping science than a historical perspective commonly taken
by citation analysts (Leydesdorff and Wouters 2000). Leydesdorff suggested that
the metaphor of geometrical mappings of multidimensional spaces is gradually
being superseded by evolutionary metaphors. Animations, movies, and simulations
are replacing snapshots. Science is no longer perceived as a solid body of unified
knowledge in a single cognitive dimension. Instead, science may be better repre-
sented as a network in a multi-dimensional space that develops not only within the
boundaries of this space, but also by co-evolutionary processes creating dimensions
References 197

to this space. Now it is time to zoom closer to the map and find trails that can lead
us to the discovery of what happened in some of the most severe and long-lasting
puzzle-solving cases in modern science. In the next chapter, we will focus on the
role of Kuhn’s paradigm shift theory in mapping scientific frontiers.

References

Boyack KW, Wylie BN, Davidson GS, Johnson DK (2000) Analysis of patent databases using
Vxinsight (No. SAND2000-2266C). Sandia National Laboratories, Albuquerque
Braam RR, Moed HF, Raan AFJv (1991a) Mapping of science by combined co-citation and word
analysis II: dynamical aspects. J Am Soc Inf Sci 42(4):252–266
Braam RR, Moed HF, Raan AFJv (1991b) Mapping of science by combined co-citation and word
analysis. I: structural aspects. J Am Soc Inf Sci 42(4):233–251
Bush V (1945) As we may think. Atl Mon 176(1):101–108
Callon M, Courtial JP, Turner WA, Bauin S (1983) From translations to problematic networks – an
introduction to co-word analysis. Soc Sci Inf Sur Les Sci Soc 22(2):191–235
Callon M, Law J, Rip A (eds) (1986) Mapping the dynamics of science and technology: sociology
of science in the real world. Macmillan Press, London
Chen C (1997a) Structuring and visualising the WWW with generalised similarity analysis. Paper
presented at the 8th ACM conference on hypertext (Hypertext’97), Southampton, UK, April
1997
Chen C (1997b) Tracking latent domain structures: an integration of pathfinder and latent semantic
analysis. AI Soc 11(1–2):48–62
Chen C (1998) Generalised similarity analysis and pathfinder network scaling. Interact Comput
10(2):107–128
Chen C (1999) Visualising semantic spaces and author co-citation networks in digital libraries. Inf
Process Manag 35(2):401–420
Chen C, Carr L (1999) Trailblazing the literature of hypertext: author co-citation analysis (1989–
1998). Paper presented at the 10th ACM conference on hypertext (Hypertext’99), Darmstadt,
Germany, February 1999
Chen C, Paul RJ (2001) Visualizing a knowledge domain’s intellectual structure. Computer
34(3):65–71
Crane D (1972) Invisible colleges: diffusion of knowledge in scientific communities. University of
Chicago Press, Chicago
Edge D (1979) Quantitative measures of communication in science: a critical overview. Hist Sci
17:102–134
Garfield E (1955) Citation indexes for science: a new dimension in documentation through
association of ideas. Science 122(108–111)
Garfield E (1975) The “Obliteration Phenomenon” in science and the advantage of being
obliterated! Curr Content 51(52):5–7
Garfield E (1996) When to cite. Libr Q 66(4):449–458
Garfield E (1998) On the shoulders of giants. Paper presented at the conference on the history and
heritage of science information systems, Pittsburgh, PA, October 24 1998
Garfield E, Small H (1989) Identifying the changing frontiers of science. Paper presented at the
innovation: at the crossroads between science & technology
Garfield E, Sher IH, Torpie RJ (1964) The use of citation data in writing the history of science.
Institute for Scientific Information, Philadelphia
Garfield E, Malin MV, Small H (1978) Citation data as science indicators. In: Elkana Y (ed) Toward
a metric of science. Wiley, New York
198 5 The Structure and Dynamics of Scientific Knowledge

King J (1987) A review of bibliometric and other science indicators and their role in research
evaluation. J Inf Sci 13(5):261–276
Knorr-Cetina KD (1999) Epistemic cultures: how the sciences make knowledge. Harvard
University Press, Cambridge, MA
Koenig M, Harrell T (1995) Lotka’s law, price’s urn, and electronic publishing. J Am Soc Inf Sci
46(5):386–388
Kuhn TS (1962) The structure of scientific revolutions. University of Chicago Press, Chicago
Kuhn TS (1970) The structure of scientific revolutions, 2nd edn. University of Chicago Press,
Chicago
Lawrence S (2001) Online or invisible? Nature 411(6837):521
Leydesdorff L, Wouters P (2000) Between texts and contexts: advances in theories of citation.
Retrieved June 26 2000, from http://www.chem.uva.nl/sts/loet/citation/rejoin.htm
Lin X (1997) Map displays for information retrieval. J Am Soc Inf Sci 48(1):40–54
Lotka AJ (1926) The frequency distribution of scientific productivity. J Wash Acad Sci 16:317–323
McCain KW (1990) Mapping authors in intellectual space: a technical overview. J Am Soc Inf Sci
41(6):433–443
Merton RK (1965) On the shoulders of giants: a Shandean postscript. University of Chicago Press,
Chicago
Merton RK (1968) The Mathew effect in science. Science 159(3810):56–63
Mullins N, Snizek W, Oehler K (1988) The structural analysis of a scientific paper. In: Raan AFJv
(ed) Handbook of quantitative studies of science & technology. Elsevier Science Publishers,
Amsterdam, pp 85–101
Noyons ECM, van Raan AFJ (1998) Monitoring scientific developments from a dynamic perspec-
tive: self-organized structuring to map neural network research. J Am Soc Inf Sci 49(1):68–81
Price D (1961) Science since Babylon. Yale University Press, New Haven
Price D (1965) Networks of scientific papers. Science 149:510–515
Price D (1976) A general theory of bibliometric and other cumulative advantage processes. J Am
Soc Inf Sci 27:292–306
Sher I, Garfield E (1966) New tools for improving and evaluating the effectiveness of research.
Paper presented at the research program effectiveness, Washington, DC, 27–29 1965
Small H (1973) Co-citation in scientific literature: a new measure of the relationship between
publications. J Am Soc Inf Sci 24:265–269
Small H (1977) A co-citation model of a scientific specialty: a longitudinal study of collagen
research. Soc Stud Sci 7:139–166
Small H (1986) The synthesis of specialty narratives from co-citation clusters. J Am Soc Inf Sci
37(3):97–110
Small HS (1988) Book review of Callon et al. Scientometrics 14(1–2):165–168
Small H (1994) A SCI-MAP case study: building a map of AIDS research. Scientometrics
30(1):229–241
Small H (1997) Update on science mapping: creating large document spaces. Scientometrics
38(2):275–293
Small H (1999) On the shoulders of giants. Bull Am Soc Inf Sci 25(2):23–25
Small H, Greenlee E (1980) Citation context analysis and the structure of paradigms. J Doc
36:183–196
Small HG, Griffith BC (1974) The structure of scientific literatures I: identifying and graphing
specialties. Sci Stud 4:17–40
Steinberg SG (1994) The ontogeny of RISC. Intertek 3(5):1–10
White HD (2003) Pathfinder networks and author cocitation analysis: a remapping of paradigmatic
information scientists. J Am Soc Inf Sci Tech 54(5):423–434
References 199

White HD, Griffith BC (1981) Author co-citation: a literature measure of intellectual structure.
J Am Soc Inf Sci 32:163–172
White HD, McCain KW (1998) Visualizing a discipline: an author co-citation analysis of
information science, 1972–1995. J Am Soc Inf Sci 49(4):327–356
Wise JA, Thomas JJ, Pennock K, Lantrip D, Pottier M, Schur A et al (1995) Visualizing the
non-visual: spatial analysis and interaction with information from text documents. Paper
presented at the IEEE symposium on information visualization’95, Atlanta, Georgia, USA,
30–31 October 1995
Chapter 6
Tracing Competing Paradigms

Paradigms are exemplary scientific achievements


Thomas Kuhn (1922–1996)

Bibliometrics can show sociological tendencies in knowledge development, but


the interpretation of these tendencies must be based on broader knowledge in
the sociology and philosophy of science. From the point of view of domain
analysis, bibliometrics is only a means to an end and it must be based on a more
comprehensive methodology that addresses the contextual issues at the level of an
entire domain (Hjorland and Albrechtsen 1995).
In this chapter we explain how information visualization can draw upon the
philosophical framework of paradigm shifts and enable scientists to track the devel-
opment of competing paradigms. We include two case studies to illustrate the use of
co-citation analysis and domain visualization techniques: one is on the topic of mass
extinctions in geology and the other is on the search for supermassive black holes
in cosmology. We focus on the identification and the development of a scientific
paradigm, or a sustained cluster of documents or a group of scientists concerning a
specific subject. Furthermore, we intend to provide a historical account for the key
issues under debates, so that the reader can appreciate the value of visualizations in
more detail.

6.1 Domain Analysis in Information Science

Hjorland has been a key figure in promoting domain analysis in information science
(Hjorland 1997; Hjorland and Albrechtsen 1995). The unit of domain analysis is
a specialty, a discipline, or a subject matter. In contrast to existing approaches to
domain analysis, Hjorland emphasized the essential role of a social perspective
instead of the more conventional psychological perspective.

C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 201
DOI 10.1007/978-1-4471-5128-9 6, © Springer-Verlag London 2013
202 6 Tracing Competing Paradigms

Table 6.1 Differences between cognitivism and the domain-specific viewpoint (Hjorland and
Albrechtsen 1995)
Cognitivism The domain-specific view
Priority is given to the understanding of Priority is given to the understanding of user
isolated user needs and intrapsychological needs from a social perspective and the
analysis. Intermediating between producers functions of information systems in trades
and users emphasizes psychological or disciplines
understanding
Focus on the single user Focus on either one knowledge domain or the
Typically looks at the disciplinary context as a comparative study of different knowledge
part of the cognitive structure of an domains. Looks at the single user in the
individual – if at all context of the discipline
Mainly inspired by artificial intelligence and Mainly inspired by knowledge about the
cognitive psychology information structures in domains, by the
sociology of knowledge and the theory of
knowledge
The psychological theory emphasizes the role The psychological theory emphasizes the
of cognitive strategies in performance interaction among aptitudes, strategies, and
knowledge in cognitive performance
Central concepts are individual knowledge Central concepts are scientific and professional
structures, individual information communication, documents (including
processing, short- and long-term memory, bibliographies), disciplines, subjects,
categorical versus situational classification information structures, paradigms etc
Methodology characterized by an Methodology characterized by a collectivistic
individualistic approach approach
Methodological individualism has some Methodological collectivism has some
connection to a general individualistic connection to a general collectivistic view,
view, but the difference between cognitive but the difference between cognitivism and
and the domain-specific view is not a the domain-specific view is not a different
different political perception of the role of political perception of the role of
information systems, but a different information systems, but a different
theoretical and methodological approach to theoretical and methodological approach to
the study and optimization of information the study and optimization of information
systems systems
Best examples of applications: user interfaces Best examples of applications:
(the outer side of information systems) subject-representation/classification (the
inner side of information systems)
Implicit theory of knowledge: mainly Theory of knowledge: scientific realism/forms
rationalistic/positivistic, tendencies toward of social constructivism with tendencies
hermeneutics towards hermeneutics
Implicit ontological position: subjective Ontological position: realism
idealism

Hjorland called his approach activity-theoretical approach. The traditional


approaches focus on individuals as a single user of information in terms of their
cognitive structures and strategies. The activity-theoretical approach, on the other
hand, emphasizes a holistic view of information retrieval issues in a much broader
context so that the needs of a user should be always interpreted in the context of
the discipline (See Table 6.1). In this sense, information retrieval is not an isolated
6.2 A Longitudinal Study of Collagen Research 203

activity; rather, it is part of an ongoing process. The relevance of a retrieved item


is linked directly to the substance of a subject matter. This view is in line with the
goal of mapping scientific frontiers, that is to provide a meaningful context in which
scientists can explore the body of knowledge as a whole, as opposed to deal with
fragmented pieces of knowledge. Domain visualization underlines the development
of a research theme in scientific literature.
Patrick Wilson, who is the recipient of the 2001 ASIS Award of Merit, regarded
the communication problem as one of communication among specialties rather than
individuals (Wilson 1993). The main way in which information from outside affects
a specialty is by being recognized by the group as being impersonally, objectively
relevant. It is a group as a whole that has to be persuaded that the information has
an appropriate logical or evidential status.
TefkoSaracevic suggests that the subject knowledge relevance is fundamental
to all other views of relevance, because subject knowledge is fundamental to
communication of knowledge (Saracevic 1975). The subject literature view of
relevance can be built around considerations of the structure of subject literatures.
The subject knowledge view of relevance stresses the nature, structure and extent
of the subject knowledge on a topic given by a question. Subject knowledge and
subject literature are not the same, but obviously related.
The influence from philosophy of science includes Kuhn’s paradigm shift
theory and Thagard’s conceptual revolution theory. In order to track the growth of
knowledge, we build on Bush’s notions of associations and trailblazing information
spaces. A key step in our approach is to find concrete and quantitative measures
of the strength of association between intellectual and conceptual entities. Citation
indexing is another cornerstone of our work. The general acceptance of a theory or
a new piece of evidence associated with a paradigm is one of the most informative
indicators of how well a paradigm is conceived and perceived by peer scientists.
Citation indexing is now a well-established method, which provides this type of
information. Furthermore, citations to publications from subsequently published
works allow analysts to trace the origin of a particular publication and the impact of
an article on a topic.

6.2 A Longitudinal Study of Collagen Research

In a longitudinal study of citation patterns, Henry Small traced the development of


research in Collagen. Highly cited articles were grouped into clusters by co-citation
strengths. Each cluster was represented by a number of contour lines showing the
number of times they were cited in a particular year. Figure 6.1 shows such annual
snapshots.
By examining the patterns of movements of articles in and out across inner circles
and outer circles over years, Small identified characteristics that can be regarded
as signs of paradigm shifts. However, not only the thematic layer changed each
year, but also did the base map, i.e. the membership of each cluster. The trace of
204 6 Tracing Competing Paradigms

Fig. 6.1 Paradigm shift in collagen research (Reproduced from Small 1977)

a particular article was visible so long as it remained within the scope of these
diagrams. Once an article moved out our sight, there would be no way to follow
the article any further. The chase would be over. A wider field of view would
provide more contextual information so that we can follow the trajectory of a rising
paradigm as well as a falling one.
Researchers have found that thematic maps of geographic information can help to
improve memory for facts and inferences (1994). If people study a geographic map
first and read relevant text later, they can remember more information from the text.
If we visualize the intellectual structure of a knowledge domain, such knowledge
maps may help researchers in a similar way.
Traditionally, a geographic map shows two important types of information:
structural and feature information. Structure information helps us to locate indi-
vidual landmarks on the map and determine spatial relations among them. Feature
information refers to detail, shape, size, color, and other visual properties used to
depict particular items on a map. One can distinguish landmarks from one another
based on feature information without relying on the structural relations among
these landmarks. When people study a map, they first construct a mental image
6.2 A Longitudinal Study of Collagen Research 205

of the map’s general spatial framework and add the landmarks into the image
subsequently (1994). Once a mental image is in place, it becomes a powerful tool
for retrieving information. The mental image integrates information about individual
landmarks in a single relatively intact piece, which allow rapid and easy access to
the landmarks embedded. In addition, the greater the integration of structural and
feature information in the image, the more intact the image is. The more intact the
image, the more easily landmark information can be located and help retrieval of
further details.
These findings about thematic maps provide useful design guidelines for in-
formation visualization. In a previous study of visualizing a knowledge domain’s
intellectual structure (Chen and Paul 2001), we developed a four-step procedure
to construct a landscape of a knowledge domain based on citation and co-citation
data. Our method extracts structural information from a variety of association
measures, such as co-citation, co-word, or co-descriptor. The structural information
is represented as a Pathfinder network, which essentially consists of shortest paths
connecting the network components. The feature information in our visualization
mainly corresponds to citation impact and specialty memberships. Citation impact
of an article is depicted by the height of its citation bar. The color of each year’s
citation bar indicates the recentness of citations. Identifying a landmark in such
knowledge landscape becomes a simple task: a tall citation bar with a large amount
of segments in bright color is likely to be a landmark article in the given knowledge
domain. In our approach, the membership of specialty, sometimes also known as a
sub-domain, or a theme, is colored according to the results of factor analysis.
In the following two case studies, we intend to highlight structural and feature
information associated with debates between competing paradigms. We also want
to highlight the movement of a paradigm in terms of the movement of landmark
articles in the global structure. We focus on matching structural and feature
information to what we know about scientific debates involved. A comprehensive
validation with domain experts is a separate topic in its own right.
Kuhn’s notion of scientific paradigm indeed provides a framework for us to
match visual-spatial patterns to the movement of an underlying paradigm. If there
exists a predominant paradigm within a scientific discipline, citation patterns should
reflect this phenomenon, allowing the usual delay in publication cycles. A predomi-
nant paradigm should acquire the most citations at least over certain period of time.
Citation peaks are likely to become visible in a landscape view. Two competing
paradigms would show as twin peaks locking in a landscape. Furthermore, such
clusters should be located towards the center of the domain structure. During a
period of normal science, the overall landscape would demonstrate continuous
increases in citations of such clusters. However, if the particular scientific discipline
is in crisis, one or more clusters outside the predominant one will rapidly appear
at the horizon of the virtual landscape. The phenomenon of a paradigm shift takes
place at the moment when the citations of the new clusters of articles take over
that of the original clusters of articles: the peak of the old paradigm drops, while
the valley of a new paradigm rises. Figure 6.2 illustrates the relationship between
206 6 Tracing Competing Paradigms

Fig. 6.2 The curve of a predominant paradigm

a predominant paradigm and its citation profile. A paradigm normally shows as a


cluster of documents instead of a single, isolated spike. Documents that survived a
paradigm shift might well become obliterated.

6.3 The Mass Extinction Debates

Five mass extinctions occurred in the past 570 million years on earth. Geologists
divided this vast time span into eras and periods on the geological scale (See
Table 6.2). The Permian-Triassic extinction 248 million years ago was the greatest
of all the mass extinctions. However, the Cretaceous-Tertiary extinction, which
wiped out the dinosaurs from the earth 65 million years ago within a short period of
time along with many other species, has been the most mysterious and hotly debated
topic over the last two decades.

6.3.1 The KT Boundary Event

Dinosaurs’ extinction occurred at the end of the Mesozoic. Many other organisms
became extinct or were greatly reduced in abundance and diversity. Among these
were the flying reptiles, sea reptiles, and ichthyosaurs, the last disappearing slightly
6.3 The Mass Extinction Debates 207

Table 6.2 Timeline of major extinctions


Million years ago Major extinction
Paleozoic Cambrian 543
Ordovician 510
Silurian 438
Devonian 408
Mississippian 360
Pennsylvanian 320
Permian 286 Permian extinction
Mesozoic Triassic 248
Jurassic 208
Cretaceous 144 K-T extinction
Cenozoic Tertiary 66
Quaternary 2

before the Cretaceous-Tertiary boundary – known as the K-T boundary. Strangely,


turtles, crocodilians, lizards, and snakes were not affected or were affected only
slightly. Whatever factor or factors caused it, there was a major, worldwide biotic
change at about the end of the Cretaceous. But the extinction of dinosaurs is the
best-known change by far and has been a puzzle to paleontologists, geologists,
and biologists for two centuries. Many theories have been offered over the years to
explain dinosaur extinction, but few have received serious consideration. Proposed
causes have included everything from disease, heat waves and resulting sterility,
freezing cold spells, and the rise of egg-eating mammals, to X rays from a supernova
exploding nearby. Since the early 1980s, attention has focused on the impact theory
by the American geologist Walter Alvarez, his father, the physicist Nobel Prize
winner Luis Alvarez, and their colleagues.
There have been over 80 theories of what caused the extinction of dinosaurs,
also known as the KT debate. Paleontologists, geologists, physicists, astronomers,
nuclear chemists, and many others all have been involved in this debate (Alvarez
1997). Throughout the 1980s the KT debate was largely between the impact camp
and the volcanism camp. The impact camp argued that the KT extinction was due to
the impact of a gigantic asteroid or comet, suggesting a catastrophic nature of the KT
extinction. The volcanism camp, on the other hand, insisted that the mass extinction
was due to massive volcanism over a much longer period of time, implying a gradual
nature of the KT event. The impact camp had evidence for the impact of an asteroid
or a comet, such as the anomalous iridium, spherules, and shocked quartz in the
KT boundary layer, whereas the volcanism camp had the Deccan Traps, which was
connected to a huge volcanic outpouring in India 65 million years ago.
The first thoroughly documented account of the asteroid theory of dinosaur
extinction, by the original proponents, can be found in Luis W. Alvarez et al.,
“Extraterrestrial Cause for the Cretaceous-Tertiary Extinction: Experimental
Results and Theoretical Interpretation,” Science, 208(4448):1095–1108 (June 6,
1980), a highly technical paper. Popular reviews of the general issue include Dale
208 6 Tracing Competing Paradigms

A. Russell, “The Mass Extinctions of the Late Mesozoic,” Scientific American,


246(1):58–65 (January 1982); Steven M. Stanley, “Mass Extinctions in the Ocean,”
Scientific American, 250(6):64–72 (June 1984); and Rick Gore, “Extinctions,”
National Geographic, 175(6):662–699 (June 1989).

6.3.1.1 Catastrophism

In their 1980 Science article (Alvarez et al. 1980), Alvarez and his colleagues, a
team of a physicist, a geologist, and two nuclear chemists, proposed an impact
theory to explain what happened in the Cretaceous and Tertiary extinction. In
contrast to the widely held view at the time, especially by paleontologists, the impact
theory suggests that the extinction happened within a much shorter period of time
and that it was caused by an asteroid or a comet.
In the 1970s, Walter Alvarez found a layer of iridium sediment in rocks at the
Cretaceous-Tertiary (K-T) boundary at Gubbio, Italy. Similar discoveries were made
subsequently in Denmark and elsewhere, both in rocks on land and in core samples
drilled from ocean floors. Iridium normally is a rare substance in rocks of the Earth’s
crust (about 0.3 parts per billion). At Gubbio, the iridium concentration was found
more than 20 times greater than the normal level (6.3 parts per billion), and it was
even greater at other sites.
There are only two places one can find such high concentration of iridium: one
is in the earth’s mantle. The other is in extra-terrestrial record. Iridium can be found
in the earth’s mantle and in extra-terrestrial objects such as meteors and comets.
Scientists could not find other layers of iridium like this above or below the KT
boundary. This layer of iridium provided the crucial evidence for the impact theory.
However, the impact theory has triggered some of the most intense debates between
gradualism and catastrophism. The high iridium concentration did not necessarily
rule out the source could not be from the Earth.

6.3.1.2 Gradualism

Gradualists believed that mass extinctions occurred gradually instead of catas-


trophically. The volcanism camp is the leading representative of gradualism. The
volcanism camp had a different explanation of where the iridium layer in the KT
boundary came from. They argued that this iridium layer may be the result of a
massive volcanic eruption. The Deccan Traps in India was dated 65 million years
ago, which coincided with the KT extinction; the Siberia Traps was dated 248
million years ago, which coincided with another mass extinction – the Permian-
Triassic mass extinction, in which as many as 95 % of species on Earth were wiped
out. The huge amount of lava produced by such volcanic eruptions would cause
intense climatic and oceanic change worldwide.
Another line of research has been focusing on the periodicity of mass extinctions
based on an observation that in the past there was a major extinction about every 26
6.3 The Mass Extinction Debates 209

Fig. 6.3 An artist’s illustration of the impact theory: before the impact, seconds to impact, moment
of impact, the impact crater, and the impact winter (© Walter Alvarez)

million years. The periodicity hypothesis challenged both the impact theory and the
volcanism to extend the explanation power of their theories to cover not only the KT
extinction alone but also other mass extinctions such as the Permian-Triassic mass
extinction and other major extinctions. Some researchers in the impact camp were
indeed searching for theories and evidence that could explain why the Earth could
be hit by asteroids or comets every 26 million years.
A watershed for the KT impact debate was 1991 when the Chicxulub crater was
identified as the impact site on the Yucatan Peninsula in Mexico (Hildebrand et al.
1991). The Signor-Lipps effect was another milestone for the impact theory. Phil
Signor and JereLipps in 1982 demonstrated that even for a truly abrupt extinction,
the poor fossil record would make it look like a gradual extinction (Signor and Lipps
1982). This work in effect weakened the gradualism’s argument.
In 1994, proponents of the impact theory were particularly excited to witness the
spectacular scene of the comet Shoemaker-Levy 9 colliding into Jupiter because
events of this type could happen to the Earth and it might have happened to
dinosaurs 65 million years ago. The comet impacts on Jupiter’s atmosphere were
spectacular and breathtaking. Figure 6.3 shows an artist’s impression of the KT
impact. Figure 6.4 shows the impact of Shoemaker-Levy 9 on Jupiter in 1994.
In the controversy between the gradualist and catastrophist explanations of the
dinosaurs’ extinction, one phenomenon might not exclude the other. It was the
explanations of the highly concentrated layer of iridium that distinguished two
competing paradigms (See Fig. 6.5).

6.3.2 Mass Extinctions

In this example, we use our approach to visualizing a knowledge domain’s


intellectual structure based on co-citation patterns (Chen and Paul 2001). We apply
this approach to document co-citation analysis. Our aim is to visualize the growth
of competing paradigm and establish the context of the growth.
The source documents were located by searching the Web of Science using
a query “mass extinction” within a 20-year citing window between 1981 and
2001. We produced a paradigmatic visualization based on co-citation structures
embedded in this set of documents. Figure 6.6 shows four paradigmatic clusters.
210 6 Tracing Competing Paradigms

Fig. 6.4 Shoemaker-Levy 9 colliding into Jupiter in 1994. Eight impact sites are visible. From left
to right are the E/F complex (barely visible on the edge of the planet), the star-shaped H site, the
impact sites for tiny N, Q1, small Q2, and R, and on the far right limb the D/G complex. The D/G
complex also shows extended haze at the edge of the planet. The features are rapidly evolving on
timescales of days. The smallest features in this image are less than 200 km across. This image is
a color composite from three filters at 9,530, 5,550, and 4,100 Å (Copyright free, image released
into the public domain by NASA)

Fig. 6.5 Interpretations


of the key evidence by
competing paradigms in the
KT debate
6.3 The Mass Extinction Debates 211

Fig. 6.6 A paradigmatic view of the mass extinction debates (1981–2001)

Each is colored by factor loadings obtained from PCA. The KT Impact cluster
is in red, implying its predominance in the field. The green color for Periodicity
and Gradualism indicates their secondary position in the field. Of course, this
classification is purely based on co-citation groupings. Similarly, the blue Permiam
Extinction zone also marks its relative importance in the mass extinction research.

6.3.2.1 The KT Impact Paradigm

This is the most predominant specialty of the mass extinction research revealed by
the citation landscape. The highest cited article in the entire network of articles
was the one by Luis Alvarez, Walter Alvarez, Frank Asaro, and Helen Michel,
published in Science in 1980 (Alvarez et al. 1980). It was this article that laid down
the foundation for the impact paradigm.
Alvarez and his colleagues argued that an asteroid hit the earth and the impact
was the direct cause of the KT extinction, and that the discovery of the abnormally
concentrated layer of iridium provided crucial evidence. This is the essence of the
KT impact paradigm. Such layers of iridium were found in deep-sea limestone
exposed in several places, including Italy, Denmark, and New Zealand. The
excessive amount of iridium, found at precisely the time of the Cretaceous-Tertiary
extinctions, ranged from 20 to 160 times higher than the background level.
If the impact theory is correct, then there should be a crater left on the earth.
They estimated that the size of the impact asteroid was about 6 miles (10 km) in
diameter, so the size of the crater must be between 90 and 120 miles (150–200 km)
in diameter. In 1980, scientists only discovered three craters with a diameter of 60
miles (100 km) or more: Sudbury, Vrdefort, and Popigay. The first two were dated
to Precambrian age, which would be too old for the KT impact; the Popigay Crater
212 6 Tracing Competing Paradigms

Fig. 6.7 The location of the Chicxulub crater

in Siberia, dated only 28.8 million years old, would be too young. Alvarez and his
colleagues suggested that there was a 2/3 probability that the impact site was in the
ocean. If that was the case, we would not be able to find the crater because evidence
from the ocean of that age had long gone. Nevertheless, searching for the impact
crater had become a crucial line of research. A breakthrough came in 1991 when
Alan Hildebrand linked the Chicxulub crater to the KT impact.
The Chicxulub crater is a 110-mile (180-km) structure, completely buried under
the Yucatan Peninsula in Mexico (See Fig. 6.7). In 1950s, the gravity abnormality
of the Chicxulub crater attracted the Mexican National Oil Company (PEMEX)
searching for oil fields, but the crater remained its low profile to the community of
mass extinction research until Alan Hildebrand’s discovery.
Hildebrand’s 1991 paper is one of the most highly cited articles in the KT impact
cluster (Hildebrand et al. 1991). Figure 6.8 shows the gravity field and magnetic
field of the Chicxulub crater.
Since the impact theory was conceived, its catastrophism point of view has
received strong resistance, especially from paleontologists who held a gradualism
viewpoint. The impact theory, its interpretations of evidence, and the validity of
evidence have been all under scrutiny.
In Walter Alvarez’s book, Gerta Keller was regarded as the number one opponent
of the impact theory (Alvarez 1997). A number of Keller’s several papers appeared
in the KT impact cluster, including their 1993 paper, in which they challenged the
available evidence of impact-generated tsunami deposits.
The presence of articles from a leading opponent to the impact theory right
in the center of this cluster has led to new insights into visualizing competing
paradigms. Co-citations not only brought supportive articles together into the same
6.3 The Mass Extinction Debates 213

Fig. 6.8 Chicxulub’s gravity field (left) and its magnetic anomaly field (right) (© Mark Pilkington
of the Geological Survey of Canada)

cluster, but also ones that challenged the paradigm. This would be a desirable
feature because scientists can access a balanced collection of articles from different
perspectives of a debate. Indeed, evidence strongly supporting the impact theory,
such as Hildebrand’s 1991 paper on the Chicxulub crater and Keller’s 1995 paper
on the conclusiveness of available evidence (Keller 1993) were found in the same
cluster. After all, when we debate about a topic, we are likely to cite the arguments
from both sides.
The KT impact cluster also included an article labeled as Signor. This is an article
by Signor and Lipps on what is later known as the Signor-Lipps effect. The Signor-
Lipps effect says if there were few fossils preserved, an abrupt distinction can look
like a gradual extinction. Because whether the KT event was a gradual extinction
or a catastrophic one is crucial to the debate, the high citation profile of Signor and
Lipps’ article indicates its significance in this debate.
Table 6.3 shows the most representative articles of the KT impact cluster in terms
of their factor loadings. Walter Alvarez in his book (1997) highly regarded Smit’s
contribution to the impact theory: Alvarez found the iridium abnormally in Italy,
whereas Smit confirmed the iridium abnormally in Spain. Smit’s 1980 article in
Nature, which topped the list, is located immediately next to the 1980 Science paper
by Alvarez et al. Both articles are strongly connected via a strong Pathfinder network
link. The table also includes Glen’s 1994 book Mass Extinction Debates.
Articles from the Gradualism camp are located between the KT Impact cluster
and the Periodicity cluster. Landmark articles in this cluster include ones from
Chunk Officer, a key opponent of the impact theory. The article by another anti-
impact researcher Dewey McLean is also in this cluster, but below the 50-citation
landmark threshold. McLean proposed that prolonged volcanic eruptions from the
Deccan Traps in India were the cause of the KT mass extinction.
Piet Hut’s 1987 Nature article on comet showers, with co-authors such as Alvarez
and Keller, marked a transition from the KT impact paradigm to the periodicity
hypothesis. This article was seeking an explanation of the periodicity of mass
extinctions within the impact paradigm.
214 6 Tracing Competing Paradigms

Table 6.3 Landmark articles in the top three specialties of mass extinctions (Citations  50)
Factor loadings Name Year Source Volume Page
KT impact
0.964 Smit J 1980 Nature 285 198
0.918 Hildebrand AR 1991 Geology 19 867
0.917 Keller G 1993 Geology 21 776
0.887 Glen W 1994 Mass Extinction Deba
0.879 Sharpton VL 1992 Nature 359 819
0.877 Alvarez LW 1980 Science 208 1,095
Periodicity
0.898 Patterson C 1987 Nature 330 248
0.873 Raup DM 1986 Science 231 833
0.859 Raup DM 1984 P Natl Acad Sci-Biol 81 801
0.720 Jablonski D 1986 Dynamics Extinction 183
0.679 Benton MJ 1985 Nature 316 811
0.629 Davis M 1984 Nature 308 715
0.608 Jablonski D 1986 Science 231 129
Permian extinction
0.812 Magaritz M 1989 Geology 17 337
0.444 Renne PR 1995 Science 269 1,413
0.436 Stanley SM 1994 Science 266 1,340
0.426 Erwin DH 1994 Nature 367 231
0.425 Wignall PB 1996 Science 272 1,155

6.3.2.2 The Periodicity of Mass Extinctions

The second largest area in the visualization landscape highlights the theme of
the periodicity of mass extinctions. The periodicity frame in Fig. 6.9 shows two
predominant landmarks, both from David Raup and John Sepkoski. The one on the
left is their 1984 article published in the Proceedings of the National Academy of
Sciences of the United States of America – Biological Sciences, entitled Periodicity
of extinctions in the geologic past. They showed a graph of incidences of extinction
of marine families through time, in which peaks coincided with the time of most
major extinction events, and suggested that mass extinctions occurred every 26
million years. The one on the right is their 1982 article in Science, entitled Mass
extinctions in the marine fossil record.
The catastrophism was one of the major beneficiaries of the periodicity paradigm
because only astronomical forces are known to be capable of producing such a
precise periodic cycle. There were also hypotheses that attempted to incorporate
various terrestrial extinction-making events such as volcanism, global climatic
change, and glaciations. There was even a theory that each time an impact triggered
the volcanic plume, but supporting evidence was rather limited. A few landmark
articles in the periodicity frame addressed the causes of the periodicity of mass
extinctions using the impact paradigm with a hypothesis that asteroids or comets
strike the earth catastrophically every 26 million years.
6.3 The Mass Extinction Debates 215

Fig. 6.9 The periodicity cluster

The initial reaction from the impact camp was that the periodicity hypothesis
completely conflicted with the impact theory. What can possibly make asteroids
hit the earth at such pace? The impact paradigm subsequently came up with a
hypothesis that an invisible death star would make it possible, but the hypothesis
was still essentially theoretical. Landmark articles labeled as Alvarez and Davis in
the visualization address such extensions of the impact paradigm.
Since the periodicity hypothesis required a theory that can explain not only
one but several mass extinctions, both gradualism and catastrophism considered to
extend their theories beyond the KT boundary. Patterson and Smith’s 1987 Nature
article (Patterson and Smith 1987) questioned whether the periodicity really existed.
Its high factor loading (0.898) reflected the uniqueness of the work. The landmark
article by Davis et al. in Nature has the factor loading of 0.629.

6.3.2.3 The Permian-Triassic Mass Extinction

The third cluster of articles features articles from Erwin, Wignall, and Knoll. Erwin
is the leading scientist on the Permian mass extinction, which was the greatest of all
five major mass extinctions. The Permian-Triassic (PT) mass extinction was much
severe than the KT extinction. Because it happened 248 million years ago, it is
extremely hard to find evidence in general, and for an impact theory in particular.
216 6 Tracing Competing Paradigms

Fig. 6.10 A year-by-year animation shows the growing impact of articles in the context of relevant
paradigms. The top-row snapshots show the citations gained by the KT impact articles (center),
whereas the bottom-row snapshots highlight the periodicity cluster (left) and the Permian extinction
cluster (right)

In the KT impact theory debate, the impact theory eventually emerged as an


increasingly predominant paradigm, opposed to the more traditional gradualism
views held by many paleontologists. The study of the PT mass extinction convinced
scientists from the impact theory camp that they should take the volcanism more
seriously. At the time of the KT boundary, there was a big outpour of volcanic lava
from the Deccan Traps. At the time of the PT boundary, there was the eruption of
the largest ever volcanoes – the Siberia Traps.
The 1996 article in Science by Knoll et al. (1996) suggested that the overturn
of anoxic deep oceans during the Late Permian introduced high concentrations of
carbon dioxide into surficial environments. Wignall’s 1996 Science article (Wignall
and Twitchett 1996) was on a similar topic, suggesting anoxic oceans might have
caused the Permian extinction.
Just below the 30-citation threshold in the visualization of the PT cluster there
was the 1995 Science article by Paul Renne and his colleagues (1995). They argued
that the Siberian plume changed the environment and climate, which in turn led to
the mass extinction. It was believed that 2–3 million cubic kilometers of Siberian
volcanic flood lasted less than million years. Erwin’s 1994 article in Nature (Erwin
1994) is among the most highly cited articles in the Permian cluster. He listed causes
such as intense climatic, tectonic and environmental change.
Figure 6.10 shows a few frames from a year-by-year animation of the growing
impact of articles in different paradigms. The citation skyline indicates that the
volcanism paradigm is one of the pioneering ones in the study of mass extinctions
and that the KT impact paradigm rapidly became the most prevalent paradigm more
recently. The animated growth of citation counts allows us to identify the role of a
particular landmark article in a broad context of the mass extinction debate. The co-
citation network provides a powerful context for us to understand the implications
of rises and falls of paradigms. In Fig. 6.11, we outline the citation profiles of the
three major clusters.
6.3 The Mass Extinction Debates 217

Fig. 6.11 Citation peaks of three clusters of articles indicate potential paradigms

To verify the major paradigms identified in our visualization and animation, we


located a book written by Walter Alvarez, one of the leading figures in the impact
paradigm. Alvarez described in this book the origin of the impact paradigm, its
development, and how the advances of the paradigm were driven by the search
of crucial evidence in detail in his book (Alvarez 1997). We compared what our
visualization showed and what was described in the book and found indeed a
substantial level of consistency between the two, especially regarding the KT impact
paradigm.
Henry Small in his longitudinal study of collagen research included a
questionnaire-based validation process (Small 1977). He sent questionnaires to
researchers in the field and asked them to describe major rapid changes of focus
in the subject domain. We are currently collecting comments in the form of
questionnaires to evaluate the groupings generated from co-citation patterns. We
are asking domain experts to identify their “nearest neighbors” in terms of research
specialties. The initial feedback revealed some insights into perceived specialties.
We will report the results in the near future.
The study of asteroids in mass extinctions has raised the question of now often it
can happen to the earth. According to NASA’s estimation, about 80–90 % asteroids
approaching to the earth are not under any surveillance and some of them are
potentially catastrophic if the earth is in their trajectories. More telescopes should
turn to the sky and join the search. The topic of our next case study is not about the
search for asteroids, but something of a much wider impact at the galactic level –
the search for supermassive black holes.
218 6 Tracing Competing Paradigms

6.4 Supermassive Black Holes

A large number of galaxies have extremely bright galactic centers. These luminous
nuclei of galaxies are known as quasars. Astronomers and cosmologists have long
suspected that black holes are the source of power. The concept of black hole is
derived from Einstein’s General Relativity. Recent evidence indicated the existence
of supermassive black holes at the centers of most galaxies (Richstone et al. 1998).
In the mass extinction case, searching for conclusive evidence had forged some
of the most significant developments for each competing paradigm. Because those
extinction events happened at least tens of million years ago, it is a real challenge
to establish what had really happened. In our second case study, astronomers faced
a similar challenge. Black holes by definition are invisible. Searching for evidence
that can support theories about the formation of galaxies and the universe has been
a central line of research concerning supermassive black holes. We apply the same
visualization method to the dynamics of citation patterns associated with this topic.
BBC2 broadcasted a 50-min TV program on supermassive black holes in 2000. The
transcripts are available on the Internet.1

6.4.1 The Active Galactic Nuclei Paradigm

In astronomy, active galactic nuclei (AGN) refers to several extraordinary phe-


nomenon, including quasars, Seyfert galaxies, and radio galaxies. In 1943, Carl
Seyfert published a catalogue of strange galaxies that have bright objects at their
centers and peculiar spectra. Seyfert galaxies have very bright nuclei with strong
emission lines of hydrogen and other common elements, showing velocities of
hundreds or thousands of kilometers per second.
The fundamental question that concerns astronomers is: What is powering these
AGN? A number of theories have been proposed, including starbursts, giant pulsars,
and supermassive black holes. In 1971, Martin Rees and Donlad Lynden-Bell were
among the first to propose that there must be a supermassive black hole hiding in
the galactic center. A supermassive black hole typically weighs between 106 and
109 times of the sun in our solar system. Now this paradigm for what powers high-
energy active nuclei (AGN) is known as the active galactic nuclei (AGN) paradigm
(Ho and Kormendy 2000). It is well established through observations and theoretical
arguments.
The AGN paradigm has offered the simplest and consistent explanations so far.
On the other hand, new evidence round the corner may overturn this paradigm
completely, as Kuhn’s theory would predict. According to (Kormendy and Rich-
stone 1995), among others, Terlevich, Filippenko, and Heckman made some of
the strongest arguments against the AGN paradigm. By 2000, as highlighted in

1
http://www.bbc.co.uk/science/horizon/massivebholes.shtml
6.4 Supermassive Black Holes 219

(John Kormendy and Ho 2000), the AGN paradigm still has an outstanding problem:
there was no dynamical evidence that black holes exist. Searching for conclusive
evidence has become a Holy Grail to the AGN paradigm (Ho and Kormendy 2000).
Kormendy and Richstone (1995) in 1995 staged the search for black holes in
three parts.
1. Look for dynamical evidence for central dark masses with high mass-to-light
ratios. A massive dark object is necessary but not sufficient evidence.
2. Narrow down the plausible explanations among identified massive dark matters.
3. Derive the mass function and frequency of incidence of black holes in various
types of galaxies.
According to the 1995 review, the status of the search was near the end of stage
one (Kormendy and Richstone 1995). Progress in the black hole search comes from
improvements in analysis as well as in observations. In 1995, M31, M32, and NGC
3115 were regarded as strong black hole cases (Kormendy and Richstone 1995). In
2000, the most compelling case for a black hole in any galaxy is our Milky Way
(Ho and Kormendy 2000). Richstone, Kormendy, and a dozen of other astronomers
have worked on surveying supermassive black holes. They called themselves the
“Nuker team.”
In 1997, the Nuker team announced the discovery of three black holes in three
normal galaxies. They suggest nearly all galaxies may have supermassive black
holes that once powered quasars but are now dormant. Their conclusion was based
on a survey of 27 nearby galaxies carried out by NASA’s Hubble Space Telescope
(HST) and ground-based telescopes in Hawaii.
Although this picture of active galaxies powered by supermassive black holes
is attractive, skeptics tend to point out that such a concentration of mass can be
explained without the concept of black holes. For example, they suggested that the
mass concentration in M87 could be a cluster of a billion or so dim stars such as
neutron stars or white dwarfs, instead of a supermassive black hole.
Skeptics in this case are in the minority with their attacks on the AGN paradigm.
Even so, the enthusiasts are expected to provide far stronger evidence than they have
managed to date. So what would constitute the definitive evidence for the existence
of a black hole?

6.4.2 The Development of the AGN Paradigm

We apply the same visualization method to reveal the dynamics of citation patterns
associated with the AGN paradigm over the last two decades. We intend to identify
some patterns of how the paradigm has been evolving.
Collecting citation data was straightforward in this case. Since a substantial body
of the astronomy and astrophysics literature is routinely covered by journal publica-
tions, the bibliographic data from Web of Science (WoS) provide a good basis for the
visualization of this particular topic. Citation data were drawn from with a complex
query on black holes and galaxies (See Table 6.4). The search retrieved 1,416
220 6 Tracing Competing Paradigms

Table 6.4 Search query used to locate articles for co-citation


analysis on black holes
Source: Web of science Description
Topic (Blackhole* or black hole*) and galax*
Database SCI expanded
Language English
Document type Article
Time span 1981–2000
* is a wildcard in the search query. For example, both Blackhole
and Blackholes would be relevant

articles in English from the SCI Expanded database dated between 1981 and 2000.
All matched to the query in at least one of the fields: titles, abstracts, and keywords.
Altogether these articles cited 58,315 publications, written by 58,148 authors.
We conducted both author co-citation analysis (ACA) and document co-citation
analysis (DCA) in order to detect the dynamics of prevailing paradigms. We chose
30 citations as the entry threshold for ACA and 20 citations for DCA. Ultimately,
373 authors and 221 publications were identified. We then generated three models of
the periods: 1981–1990, 1991–1995, and 1996–2000. The co-citation networks were
based on the entire range of citation data (1981–2000). The citation landscape in
each period conforms how often each article was cited within a particular sampling
window. In this book, we only describe the results of a document co-citation analysis
for this case study.
In document co-citation analysis (DCA), we visualized a co-citation network of
221 top-cited publications. We particularly examined citation profiles in the context
of co-citation structure. Articles with more than 20 citations were automatically
labeled on semi-transparent panels in the scene. These panels always face to the
viewer. The landscape of the 1981–1990 period is shown as a flat plane – this
landscape obviously pre-dated the existence of the majority of the 221 publications.
The visualization landscape of the period of 1991–1995 is showing an interesting
pattern – three distinct clusters are clearly visible in peripheral areas of the co-
citation network. M-31 has been regarded as one of the strongest supportive cases
for the AGN paradigm. Alan Dressler and John Kormendy are known for their
work within the AGN paradigm. One of the clusters included articles from both
of them regarding the evidence for supermassive black holes in M-31. Another
cluster is more theoretically oriented, including articles from Martin Rees, who
was a pioneer of the theory that giant black holes may provide the power at
quasars’ energetic centers. In addition, Martin Ree’s nearest neighbor in the
document co-citation network is Lynden-Bell’s article. Lynden-Bell provided the
most convincing argument for the AGN paradigm and showed that nuclear reactions
alone would have no way to power quasars. The cluster at the far end includes
ShakuraIvanovich’s article on black holes in binary systems, whereas the large area
in the center of the co-citation network remains unpopulated within this period.
A useful feature of a Pathfinder network is that the most cited articles tend to locate
6.4 Supermassive Black Holes 221

Fig. 6.12 Supermassive black holes search between 1991 and 1995. The visualization of the
document co-citation network is based on co-citation data from 1981 through 2000. Three
paradigmatic clusters highlight new evidence (the cluster near to the front) as well as theoretical
origins of the AGN paradigm

in the central area. Once these highly cited articles arrive, they will predominant the
overall citation profile of the entire co-citation network (See Fig. 6.12).
Citations in the central area remain very quiet, partly because some of the
documents located there were either newly published or not published yet. However,
the visualization of the third period, 1996–2000, clearly shows dramatic drops of the
overall citation profiles of once citation-prosperous clusters in the peripheral areas.
Two of the three distinct clusters have hardly been cited. In contrast, citations at the
center of the network now become predominant (See Fig. 6.13). Pathfinder-based
citation and co-citation visualizations are able to outline the movement of the AGN
paradigm in terms of which articles researchers cite during a particular period of
time.
The AGN paradigm is prevailing, but conclusive evidence is still missing. Some
astronomers have suggested alternative explanations. For example, could the mass
concentration in M87 be due to a cluster of a billion or so dim stars such as neutron
stars or white dwarfs, instead of supermassive black holes? Opponents of the AGN
paradigm such as Terlevich and colleagues have made strong arguments in their
articles. Some of these articles are located in a remote area towards the far end of
the co-citation network. In order to study how alternative theories had competed
222 6 Tracing Competing Paradigms

Fig. 6.13 The visualization of the final period of the AGN case study (1996–2000). The cluster
near to the front has almost vanished and the cluster to the right has also reduced considerably.
In contrast, citations of articles in the center of the co-citation network rocketed, leading by two
evidence articles published in Nature: one is about NGC-4258 and the other is about MCG-6-30-15

with the AGN paradigm directly, it is necessary to re-focus the visualization so that
both the AGN paradigm and its competitors are both within the scope of the initial
citation data. The current AGN visualization is the first step to help us understand
the fundamental works in this paradigm because we used the terms black holes
and galaxies explicitly in data sampling. In the mass extinction case, gradualism
and catastrophism debated over more than a decade since the impact theory was
first conceived to the identification of the Chicxulub crater. In the supermassive
black hole case, the AGN paradigm is so strong that its counterparts were likely
to be under-represented in the initial visualizations. This observation highlights an
issue concerning the use of such tools. The user may want to start with a simple
visualization, learn more about a set of related topics, and gradually expand the
coverage of the visualization.
In Fig. 6.13, the visualization of the latest period (1996–2000), the predominant
positions of two 1988 evidence articles in the front cluster have been replace by
two 1995 evidence articles. Makoto Miyoshi’s team at the National Astronomical
Observatory in Japan found evidence supporting the AGN paradigm based on their
study of a nearby galaxy NGC-4258. They used a network of radio telescopes
6.4 Supermassive Black Holes 223

Fig. 6.14 The rises and falls of citation profiles of 221 articles across three periods of the AGN
paradigm

called the Very Long Baseline Array, stretching from Hawaii to Puerto Rico. A
few highly cited articles in this period are located in the center of the co-citation
network, including a review article and a demographic article on supermassive black
holes. According to a three-stage agenda for the study of supermassive black holes
(Kormendy and Richstone 1995), a demographic article would correspond to the
third stage. The 1998 article by Magorrian and his collaborators is located between
the 1995 agenda article in the center and Ree’s article to the right.
It is clear from Fig. 6.14 that the peaks of citation have moved from one period
to another. There was no paradigm in the first period (1981–1990). In other words,
the core literature on this topic is no more than 10 years old. Three strands of
articles appeared in the second period, suggesting the first generation of theories
and evidence. The fall of two groups of citations in the third period and the rise of
a new landmark article in the center of the co-citation network indicate significant
changes in the field. The visualization of such changes in scientific literature may
provide new insights into scientific frontiers.
224 6 Tracing Competing Paradigms

6.5 Summary

In this chapter, we have included two case studies. Our visualizations have shown
the potential of the citation-based approach to knowledge discovery and to tracking
scientific paradigms. We do not expect that such visualizations would replace review
articles and surveys carefully made by domain experts. Instead, such visualizations,
if done properly, may lead to a more sensible literature search methodology than
the current fashionable but somewhat piecemeal retrieval-oriented approaches. By
taking into account values perceived by those who have domain expertise, our
generic approach has shown the potential of such visualizations as an alternative
“camera” to take snapshots of scientific frontiers.
We have drawn a great deal of valuable background information from Kormendy
and Richstone’s article Inward Bound (Kormendy and Richstone 1995). It was this
article that dominated the visualization landscape of the latest period. Kuhn later
suggested that specialization was more common. Instead of killing off a traditional
rival line of research immediately, a new branch of research may run in parallel.
The search for supermassive black hole is rapidly advancing. The media is full
of news on latest discoveries. In fact, the latest news announced at the winter
2001 American Astronomical Society meeting suggested that HST and the Chandra
X-ray Observatory have found evidence for an event horizon on Cygnus X-1, the
first object identified as a black hole candidate. Scientific visualism is increasingly
finding its way in modern science.
There are several possible research avenues to further develop this generic
approach to visualizing competing paradigms, for example:
1. Apply this approach to classic paradigm shifts identified by Kuhn and others
2. Refine the philosophical and sociological foundations of this approach.
3. Combine citation analysis with other modeling and analysis techniques, such as
automatic citation context indexing and latent semantic indexing (LSI), so as to
provide a more balance view of scientific frontiers.
4. Extend the scope of applications to a wider range of disciplines.
5. Track the development of the two case studies in the future with follow-up
studies.
6. Track the development of scientific frontiers. Work closely with domain experts
to evaluate and improve science mapping.
In the next chapter, we continue to explore issues concerning mapping scientific
frontiers with special focus on the discovery of latent domain knowledge. How do
scientists detect new and significant developments in knowledge? What does it take
a visualization metaphor to capture and predict the growth of knowledge? How do
we match the visualized intellectual structure to what scientists have in their mind?
References 225

References

Alvarez W (1997) T. rex and the crater of doom. Vintage Books, New York
Alvarez LW, Alvarez W, Asaro F, Michel HV (1980) Extraterrestrial cause for the Cretaceous-
Tertiary extinction. Science 208(4448):1095–1098
Chen C, Paul RJ (2001) Visualizing a knowledge domain’s intellectual structure. Computer
34(3):65–71
Erwin DH (1994) The Permo-Triassic extinction. Nature 367:231–236
Hildebrand AR, Penfield GT, Kring DA, Pilkington M, Carmargo ZA, Jacobsen SB et al (1991)
Chicxulub crater: a possible Cretaceous-Tertiary boundary impact crater on the Yucatan
Peninsula, Mexico. Geology 19(9):867–871
Hjorland B (1997) Information seeking and subject representation: an activity-theoretical approach
to information science. Greenwood Press, Westport
Hjorland B, Albrechtsen H (1995) Toward a new horizon in information science: domain analysis.
J Am Soc Inf Sci 46(6):400–425
Ho LC, Kormendy J (2000) Supermassive black holes in active galactic nuclei. In Murdin P
(ed) Encyclopedia of astronomy and astrophysics. Institute of Physics Publishing, Bristol.
http://eaa.crcpress.com/default.asp?actionDsummary&articleIdD2365
Keller G (1993) Is there evidence for Cretaceous-Tertiary boundary age deep-water deposits in the
Caribbean and Gulf of Mexico. Geology 21(9):776–780
Knoll AH, Bambach RK, Canfield DE, Grotzinger JP (1996) Comparative earth history and Late
Permian mass extinction. Science 273(5274):452–457
Kormendy J, Ho LC (2000) Supermassive black holes in inactive galaxies. In: Encyclopedia of
astronomy and astrophysics. Institute of Physics Publishing, Bristol
Kormendy J, Richstone D (1995) Inward bound: the search for supermassive black-holes in galactic
nuclei. Annu Rev Astron Astrophys 33:581–624
Patterson C, Smith AB (1987) Is the periodicity of extinctions a taxonomic artifact? Nature
330(6145):248–251
Renne P, Zhang Z, Richards MA, Black MT, Basu A (1995) Synchrony and causal relations be-
tween Permian-Triassic boundary crises and Siberian flood volcanism. Science 269:1413–1416
Richstone D, Ajhar EA, Bender R, Bower G, Dressler A, Faber SM et al (1998) Supermassive
black holes and the evolution of galaxies. Nature 395(6701):A14–A19
Saracevic T (1975) Relevance: a review of and a framework for the thinking on the notion in
information science. J Am Soc Inf Sci 26:321–343
Signor PW, Lipps JH (1982) Sampling bias, gradual extinction patterns, and catastrophes in the
fossil record. Geol Soc Am Spec Pap 190:291–296
Small H (1977) A co-citation model of a scientific specialty: a longitudinal study of collagen
research. Soc Stud Sci 7:139–166
Small H (1994) A SCI-MAP case study: building a map of AIDS research. Scientometrics
30(1):229–241
Wignall PB, Twitchett RJ (1996) Oceanic anoxia and the end Permian mass extinction. Science
272:1155–1158
Wilson P (1993) Communication efficiency in research and development. J Am Soc Inf Sci
44:376–382
Chapter 7
Tracking Latent Domain Knowledge

Knowledge is power.
Francis Bacon (1561–1626)

Conventional citation analysis typically focuses on distinctive members of a


specialty – the cream of the crop. Landscape visualizations naturally emphasize
the peaks rather than the valleys. Obviously such practices remind us either the
Matthew Effect or the winner-takes-it-all phenomenon. However, scientific frontiers
are constantly changing. We cannot simply ignore the “root” of the crop or the
valleys of an intellectual landscape. Today’s valleys may become tomorrow’s peaks
(Fig. 7.1).
In this chapter, we will focus on latent domain knowledge and techniques
that may reveal latent domain knowledge. Knowledge discovery and data mining
commonly rely on finding salient patterns of association from a vast amount of
data. Traditional citation analysis of scientific literature draws insights from strong
citation patterns. Latent domain knowledge, in contrast to the mainstream domain
knowledge, often consists of highly relevant but relatively infrequently cited scien-
tific works. Visualizing latent domain knowledge presents a significant challenge to
knowledge discovery and quantitative studies of science. We will explore a citation-
based knowledge visualization procedure and develop an approach that not only
captures knowledge structures from prominent and highly cited works, but also
traces latent domain knowledge through low-frequency citation chains. This chapter
consists of three cases:
1. Swanson’s undiscovered public knowledge;
2. A survey of cross-disciplinary applications of Pathfinder networks; and
3. An investigation of the current status of scientific inquiry of a possible link
between BSE, also known as mad cow disease, and vCJD, a type of brain disease
in human.

C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 227
DOI 10.1007/978-1-4471-5128-9 7, © Springer-Verlag London 2013
228 7 Tracking Latent Domain Knowledge

Fig. 7.1 An evolving landscape of research pertinent to BSE and CJD. The next hot topic may
emerge in an area that is currently not populated

7.1 Mainstream and Latent Streams

There may be many reasons why a particular line of research may fall outside the
body of the mainstream domain knowledge and become latent to a knowledge
domain. In a cross-disciplinary research program, researchers face an entirely
unfamiliar scientific discipline. Tracking the latest development into a different
discipline can be rather challenging. One example of such problems is the cross-
disciplinary use of Pathfinder networks, a structural and procedural modeling
method developed by cognitive psychologists in the 1980s (Schvaneveldt 1990;
Schvaneveldt et al. 1989). Pathfinder is a generic tool that has been adapted by
several fields of study, including some quite different adaptations from its original
cognitive applications. For example, we have adapted Pathfinder network scaling
as an integral component of our generic structuring and visualization framework
(Chen 1999a, b; Chen and Paul 2001). It is a challenging task to track down how
applications of Pathfinder networks have evolved over the past two decades across
a number of apparently unconnected disciplines.
Another type of latent domain knowledge can be explained in terms of
scientific paradigms. Thomas Kuhn (1962) described the development of science as
7.2 Knowledge Discovery 229

interleaved phrases of normal science and scientific revolutions. A period of normal


science is typically marked by the dominance of an established framework. The
foundations of such frameworks largely remain unchallenged until new discoveries
begin to cast doubts over fundamental issues – science falls into a period of crises.
To resolve such crises, radically new theories are introduced. New theories replace
with greater explanatory power the ones in trouble in a revolutionary manner.
Science regains another period of normal science.
Kuhn suggested that a paradigm shift in science should lead to a corresponding
change of citation patterns in scientific literature; therefore the study of such patterns
may provide indicators of the development of a scientific paradigm. Indeed, a
number of researchers pursued this line of research since 1970s. For example,
Henry Small studied the movement of highly cited publications on the topic of
collagen as a means of tracking major paradigm shifts in this particular field
(Small 1977). White and McCain used INDSCAL to depict changes in author
co-citation maps over consecutive periods (White and McCain 1998a). We have
started to investigate how information visualization can help us characterize the
dynamics of scientific paradigms (Chen et al. 2001, 2002). In particular, our focus
is on contemporary puzzle-solving topics in science and medicine: What caused
dinosaurs’ mass extinction? Are Bovine Spongiform Encephalopathy (BSE) and
the new variant Creutzfeldt-Jakob Disease (vCJD) connected? What powers active
galactic centers, super-massive black holes, or something else?
In this chapter, we introduce an approach to visualizing latent domain knowledge.
We demonstrate how one can accommodate latent domain knowledge and the main-
stream domain knowledge within the same visualization framework. We include two
case studies: Pathfinder network applications and theories of Bovine Spongiform
Encephalopathy (BSE), commonly known as mad cow disease. The rest of the
chapter is organized as follows. First, we outline existing work, including citation
analysis, knowledge discovery, and examples. We then extend our domain visual-
ization approach to visualize latent domain knowledge. We apply this approach to
two cases in which visualizing latent domain knowledge is involved: (1) tracing
applications of Pathfinder networks and (2) connecting a controversial theory of
BSE, mad cow disease, to the mainstream intellectual structure of research in BSE.

7.2 Knowledge Discovery

The advances of information visualization have revived the interest in a number of


challenging issues concerning knowledge tracking. Here we contrast two strands of
research the citation-based paradigm of knowledge discovery and the undiscovered
public knowledge approach. The key prerequisite for the citation-based paradigm is
a target scientific literature that is rich in citations, whereas the undiscovered public
knowledge deals with exactly the opposite situation when citation links are missing
or are considerably rare. A synergy of the two would lead to a more powerful tool
to facilitate knowledge discovery and knowledge management in general.
230 7 Tracking Latent Domain Knowledge

Knowledge tracking and technology monitoring tools have become an


increasingly important part of knowledge management. The rapid advances of
information visualization in the past few years have highlighted its great potential
in knowledge discovery and data mining (Chen 2002; Chen and Paul 2001).
In Chap. 6, we have studied a few examples of competing paradigms with
reference to Thomas Kuhn’s theory on the structure of scientific revolutions
(Kuhn 1962). According to Kuhn’s theory, most of the time scientists are engaged
in normal science, which is predominated by an established framework. The
foundations of such frameworks largely remain unchallenged until new discoveries
begin to cast doubts over fundamental issues – science falls into a period of crises.
To resolve such crises, radically new theories with greater explanatory power are
introduced. New theories replace the ones in trouble in a revolutionary manner.
Science regains another period of normal science. Scientific revolutions are an
integral part of science and such revolutionary changes advance science. We have
investigated the potential role of information visualization in revealing the dynamics
of scientific paradigms, such as scientific debates over dinosaurs’ mass extinction,
and supermassive black holes (See Chap. 6).

7.2.1 Undiscovered Public Knowledge

In Chap. 5, we mentioned Donald Swanson was the recipient of the 2000 Award
of Merit from ASIS&T for his work in undiscovered public knowledge. In his
Award of Merit acceptance speech, Swanson (2001) stressed the enormous and
fast-growing gap between the entire body of recorded knowledge and the limited
human capacity to make sense of it. He also pointed out knowledge fragmentation as
the consequences of inadequate cross-specialty communication. Because specialties
are increasingly divided into more and more narrowly focused subspecialties in
response to the information explosion.
Swanson has been pursuing his paradigm since 1986 when he began to realize
that there were two sizeable but bibliographically unrelated biomedical literatures:
one on the circulatory effects of dietary fish oil and the other on the peripheral
circulatory disorder – Raynaud’s disease. Prior to Swanson’s research, no medical
researcher had noticed this connection, and the indexing of these two literatures was
unlikely to facilitate the discovery of any such connections.
Swanson’s paradigm focuses on the possibility that information in one specialty
might be of value in another without anyone becoming aware of the fact. Specialized
literatures that do not intercommunicate by citing one another may nonetheless have
many implicit textual interconnections based on meaning. The number of latent,
unintended or implicit, connections within the literature of science may greatly
exceed the number of explicit connections.
Swanson and Smalheiser (1997) defined noninteractive literatures as two lit-
eratures that have not been connected by a significant citation tie. In other
words, scientists in both camps have not regarded the existence of a meaningful
7.2 Knowledge Discovery 231

connection between the two literatures. A key step in Swanson’s methodology is


the identification of the two premises A ! B and B ! C. In a large knowledge
domain, identifying two such premises is like searching for needles in a haystack.
Knowledge visualization aims to capture the structure of a knowledge domain and
increase the chance of finding something useful. Before we turn to issues faced by
domain visualization, let us take a look at Swanson’s approach to the discovery of
neglected knowledge.
Swanson’s paradigm explores connections in the biological world that can be
represented in the following generic form. If we know two premises that can be
expressed as connections of some form A ! B and B ! C, then the question is
whether A ! C holds. In the biological world, this may not be the case. One must
establish the transitivity explicitly. If A ! C does make sense, it will be worth
considering as a hypothesis to be tested by domain experts. Swanson suggests once
information scientists identify such hypotheses they should pass the question to
domain experts, who will handle it accordingly. In particular, in his Award of Merit
acceptance speech, Swanson gave his advice to information scientists (Swanson
2001):
First, information scientists should aim to produce new hypotheses or sugges-
tions – not discoveries. This is the job of lab scientists to test such hypotheses. Real
discoveries should come out of the lab, not the literature.
Second, when information scientists write for publication, subject content should
be limited to reporting factually, or simply quoting, selected passages from scholarly
and reputable literatures of the subject domain. Information scientists’ aim is to
highlight possible implicit links in subject literatures. It is a judgment call by
scientists with subject expertise to decide whether the links are plausible and
persuasive enough to merit testing.
After the successful detective work of identifying a link between fish oil and
Raynaud’s syndrome, which was later verified by medical researchers, Swanson was
able to continue his quest and find a few more examples falling into the same pattern,
especially by collaborating with Neil Smalheiser, a neurologist since 1994. By 1998,
the number of cases increased to seven. Arrowsmith is their web-based software for
discovering such links (Swanson 1999). See more details at its homepage: http://
kiwi.uchicago.edu.
Swanson describes three aspects of the context and nature of knowledge frag-
mentation (Swanson 2001):
• There is an enormous and constantly growing gap between the entire body of
recorded knowledge and the limited human capacity to make sense of it.
• Inadequate cross-specialty communication causes knowledge fragmentation. In
response to the information explosion, specialties are increasingly divided into
more and more narrowly focused subspecialties.
• One specialty might not be aware of potentially valuable information in another
specialty. Two specialized literatures may be isolated in terms of explicit citation
links, but they may have implicit, latent connections at the text level.
232 7 Tracking Latent Domain Knowledge

Table 7.1 Seven discoveries of undiscovered public knowledge, all published in the biomedical
literature
Seven examples published
Year in the biomedical literature A Potential cause factors C Disease
1986 Swanson (1986a) Fish oil Raynaud’s syndrome
1988 Swanson (1988) Magnesium Migraine
1990 Swanson (1990) Somatomedin C Arginine
1994 Smalheiser and Swanson (1994) Magnesium deficiency Neurologic disease
1996 Smalheiser and Swanson (1996a) Indomethacin Alzheimer’s disease
1996 Smalheiser and Swanson (1996b) Estrogen Alzheimer’s disease
1998 Smalheiser and Swanson (1998) Calcium-independent Schizophrenia
phospholipase A2

Swanson has been pursuing his paradigm since 1986 when he found two sizeable
biomedical literatures: one is on the circulatory effects of dietary fish oil and the
other is on the peripheral circulatory disorder, Raynaud’s disease. Swanson noticed
that these two literatures were not bibliographically related: No one from one camp
cited works in the other (Swanson, 1986a, b). On the other hand, he was pondering
the question that apparently no one had asked before: Is there a connection between
dietary fish oil and Raynaud’s disease?
Prior to Swanson’s research, no medical researcher had noticed this connection,
and the indexing of these two literatures was unlikely to facilitate the discovery of
any such connections. Swanson’s approach can be represented in a generic form.
Given two premises that A causes B (A ! B) and that B causes C (B ! C), the
question to ask is whether A causes C (A ! C). If the answer is positive, the causal
relation has the transitive property. In the biological world, such transitive properties
may not always be there. Therefore scientists must explicitly establish such tran-
sitivity relationships. Swanson suggests once information scientists identify such
possibilities, they should recommend domain experts to validate (Swanson 2001).
Swanson’s approach focuses on the discovery of such hypotheses from the vast
amount of implicit, or latent, connections. Swanson and Smalheiser (1997) defined
the concept of non-interactive literatures. If two literatures have never been cited
together at a notable level, they are non-interactive – scientists have not considered
both literatures together. In the past 15 years, Swanson identified several missing
links of the same pattern, notably migraine and magnesium (Swanson 1988),
and arginine and somatomedin C (Swanson 1990). Since 1994, the collaboration
between neurologist Neil Smalheiser and Swanson led to a few more such cases
(Smalheiser and Swanson 1994, 1996). Table 7.1 is a summary of various case
studies. They also made their software Arrowsmith available on the Internet
(Swanson 1999).
Swanson’s approach relies on the identification of the two premises A ! B and
B ! C. In a large knowledge domain, it is crucial for analysts to have sufficient
domain knowledge. Otherwise, to find two such premises is like searching for nee-
dles in a haystack. Knowledge domain visualization (KDViz) can narrow down the
search space and increase the chance of finding a fruitful line of scientific inquiry.
7.2 Knowledge Discovery 233

Fig. 7.2 A Venn diagram showing potential links between bibliographically unconnected
literatures (Figure 1 reprinted from Swanson and Smalheiser (1997))

In parallel, Swanson also published his work in the literature of library and
information science, notably (Swanson 1986a, b; Swanson 1987, 1988, 1990). The
Venn diagram in Fig. 7.2, adopted from (Swanson and Smalheiser 1997), shows sets
of articles, or bodies of literature, the target literature A and the source literature C.
Set A and set C have no articles in common, but they are linked through intermediate
literatures B1, B2, B3, and B4. Undiscovered links between A and C may be
found in through the intermediate literatures B’s. There may exist an intermediate
literature B such that a particular transitive relation can be established based on
A ! Bi and Bi ! C.
Figure 7.3 shows a schematic diagram of title-word pathways from a source
literature on the right (C terms), through intermediate title-words (B terms), to title
words of promising target literatures on the left (A terms) (Swanson and Smalheiser
1997). A ranking algorithm ranks discovered A-terms. The more B-pathways an A
term has, the higher it ranks. Term A3, magnesium, is the highest ranked title word.
It has a total of 7 pathways from B-terms. In this way, a pathway from migraine to
magnesium appears to be most promising.
Swanson called this algorithm procedure I. Swanson also developed what he
called Procedural II, in which titles from literatures A and C are downloaded first
in order to find words and phrases in common from the two literatures. Common
words and phrases are selected to form the so-called B-list. An output display
is then produced to help the human user compare A-titles and C-titles against
B-terms.
Figure 7.4 shows B-terms selected by Swanson’s Procedure II for magnesium
and migraine, and for fish-oil and Raynaud’s disease. The two numbers in front of
234 7 Tracking Latent Domain Knowledge

Fig. 7.3 A schematic diagram, showing the most promising pathway linking migraine in the
source literature to magnesium in the target literatures (C to A3) (Courtesy of http://kiwi.uchicago.
edu/)

B-terms are the number of articles within the BC and AB intersections, respectively.
The asterisks mark entries identified in the original studies (Swanson 1986a, 1988).
Table 7.2 lists B-term entries selected by Procedure II.

7.2.2 Visualizing Latent Domain Knowledge

We distinguish mainstream domain knowledge and latent domain knowledge along


two dimensions: relevance and citation. Scientific documents in the literature can be
classified into four categories according to their relevance to the subject domain
and their citations received from the scientific literature: Mainstream domain
knowledge, which typically consists of documents of high relevance (HR) and high
citations (HC); Latent domain knowledge, which are typically made of documents
of high relevance (HR) but low citations (LC); and two categories of documents of
low relevance. The traditional knowledge discovery such as citation analysis and
domain visualization focuses on the mainstream domain knowledge (HR C HC).
The focus of latent domain knowledge discovery and visualization is on the category
of HR and LC. We will introduce an approach that can extend the coverage of
knowledge domain visualization from mainstream to latent domain knowledge
(See Fig. 7.5).
7.2 Knowledge Discovery 235

Fig. 7.4 A schematic flowchart of Swanson’s Procedure II (Figure 4 reprinted from Swanson and
Smalheiser (1997), available at http://kiwi.uchicago.edu/webwork/fig4.xbm)

In our earlier work, we developed a four-step procedure for visualizing main-


stream domain knowledge (Chen and Paul 2001). In particular, the procedure
includes the following four steps:
1. Select highly relevant and highly cited documents from a citation database;
2. Derive citation networks based on the selected population of documents and
simplify citation networks using Pathfinder network scaling;
3. Partition the resultant Pathfinder network according to specialties identified
through Principal Component Analysis;
4. Superimpose the citation history of a document or author over the citation
network.
Our solution to visualizing latent domain knowledge is built upon this four-step
procedure. Instead of simply applying the procedure on highly relevant and highly
cited documents, we incorporate this procedure into a recursive process particularly
suitable for detecting patterns in highly relevant but sparsely cited documents.
Figure 7.6 illustrates the overall strategy of our approach. This approach has three
236 7 Tracking Latent Domain Knowledge

Table 7.2 B-term entries selected by procedure II for magnesium and


migraine (left column), and for fish-oil and Raynaud’s disease (right
column)
Migraine-magnesium B-list Raynaud-fishoil B-list
(selected)
BC AB B-term BC AB B-term
5 3 amine 1 1 angina
3 2 anticonvulsant 2 2 arthritis
5 2 calcium antagonist * 2 5 blood pressure
10 2 calcium channel * 10 5 blood viscosity *
4 1 calcium entry * 6 7 calcium
5 3 catecholamine 12 1 capillary
5 8 diabetes 2 1 collagen
3 3 dopamine 4 2 deformability *
14 2 epilepsy * 1 5 diabetic
5 6 epileptic * 3 1 fibrinolytic
11 11 hemodynamic 1 1 hemolytic uremic syndrome
14 13 histamine 9 2 hypertension
11 3 ht * 1 4 hypertensive
15 4 hydroxytryptamine * 1 1 iga
3 11 hypertension 3 3 infarction
3 2 hypoxia * 1 3 inhibition platelet *
6 3 immunoglobulin 1 5 ischemic
3 7 inflammatory * 8 2 lupus
2 0 ischaemia 1 1 mediterranean
12 3 ischemia 2 1 pgi2
6 1 ischemic 2 13 platelet aggregation *
9 8 muscle contraction 3 14 platelet function *
5 4 olfactory 1 1 polymorphonuclear
14 5 oral contraceptive 10 9 prostacyclin
10 3 paroxysmal 10 25 prostaglandin *
14 5 platelet aggregation * 2 1 prostaglandin i2
4 2 progesterone 1 1 reactivity *
14 4 prolactin 1 1 serotonin *
10 3 prolapse 1 2 thrombotic
12 5 prostaglandin * 6 11 thromboxane *
8 3 reactivity * 1 2 thyroid
16 7 relaxation
10 7 reserpine
8 14 seizure *
11 5 serotonin *
4 4 spasm *
5 2 spreading depression* *
7 5 stress *
6 7 tryptophan
4 5 vasospasm *
6 4 verapamil *
Source: Figure 6 in Swanson and Smalheiser (1997)
The asterisks mark entries identified in the original studies (Swanson
1986a, 1988)
7.2 Knowledge Discovery 237

Fig. 7.5 Mainstream domain knowledge is typically high in both relevance and citation, whereas
latent domain knowledge can be characterized as high relevance and low citation

Fig. 7.6 The strategy of visualizing latent domain knowledge. The global context is derived from
co-citation networks of highly cited works. An “exit” landmark is chosen from the global context
to serve as the seeding article in the process of domain expansion. The expanded domain consists
of articles connecting to the seeding article by citation chains of no more than two citation links.
Latent domain knowledge is represented through a citation network of these articles

sub-processes. The purpose of the first process is to establish a global context


by subsequent analysis and visualization. Indeed, in this process we apply our
four-step procedure to the mainstream domain knowledge and generate a citation
landscape. The second process is domain expansion, which means that we expand
238 7 Tracking Latent Domain Knowledge

our field of view from mainstream domain knowledge to latent domain knowledge.
A key component in this domain expansion process is the selection of a so-called
“exit” landmark from the citation landscape. This “exit” landmark will play a
pivot role in tracking latent knowledge by “pulling” highly relevant but relatively
rarely cited documents into the scene. The “exit” landmark is selected based on
both structural and topical characteristics. Structurally important documents in
the citation landscape include branching points, from which one can reach more
documents along citation paths preserved by the network. Topically important
documents are the ones that are closely related to the subject in question. Ideally,
a good “exit” landmark should be a classic work in a field of study and it can link
to a cluster of closely related documents by citation. We will explain in more detail
through case studies how we choose “exit” landmarks. Once an “exit” landmark is
chosen from the citation landscape, the four-step procedure can be applied again to
all the documents within a citation chain of up to two citation links. The resultant
citation network represents the latent domain knowledge. Finally, we embed this
local structure back into the global context by providing a reference from the “exit”
landmark in the global context to the latent knowledge structure.
In this chapter, we describe how we applied this approach to three case studies,
namely, Swanson’s work, cross-domain applications of Pathfinder network scaling
techniques, and the perceived connection between BSE and vCJD in contemporary
literature. We use the Web of Science, a Web-based interface to citation databases
compiled by the Institute for Scientific Information (ISI). We start with a search in
the Web of Science using some broad search terms in order to generate a global
context for subsequent visualization. For example, in the Pathfinder case, we chose
to use search terms such as knowledge discovery, knowledge acquisition, knowledge
modeling, and Pathfinder. Once the global context is visualized, it is straightforward
to identify an “exit” landmark. In the Pathfinder case, a classic citation of Pathfinder
networks is chosen as an “exit” landmark. This “exit” landmark article serves as the
seed in a citation search within the Web of Science. The citing space of the seeding
article s contains articles that either cite the seeding article directly or cite an article
that in turn cites the article.
COne  Step .s/ D fc jc ! sg
˚ ˇ 
CTwo  Step .s/ D c ˇ9c 0 ) c ! c 0 ^ c 0 ! s
CitingSpaceTheme .s/ D COne  Step .s/ [ CTwo  Step .s/

Such citing spaces may contain articles beyond the boundary of the mainstream
domain knowledge. One can repeatedly apply this method by identifying another
“exit” landmark. Articles connected to the landmark by two-step citation chains are
gathered to represent latent domain knowledge. By using different ways to select
citing articles, we can visualize latent knowledge structures with reference to highly
established and frequently cited knowledge structures. In the following two case
studies, we apply the same spiral methodology to illustrate our approach.
7.3 Swanson’s Impact 239

7.3 Swanson’s Impact

The following example is based on citation records retrieved from the Web of
Science as of 17th April 2001. First, a search was conducted across all databases
between 1981 and 2001, the entire coverage available to the version we access.
This search aimed to locate Swanson’s articles as many as possible within these
citation databases. The AUTHOR field for the search was “Swanson DR” and
the ADDRESS as “Chicago”. This search returned 30 records. These 30 articles
served as a seeding set. In the second step, we expanded this initial set of articles
by including articles that have cited at least one article in the seeding set. All the
citations from the expanded set of articles form the population for the subsequent
document co-citation analysis. We applied a threshold of 65 to select top-slice
articles from this all-citation set. A total of 246 articles that met this criterion were
selected and analyzed to form a series of document co-citation maps as the snapshots
of the impact of Swanson’s work.
Figure 7.7 shows an overview of the document co-citation map. The entire
network is divided into three focused areas, which are colored by factor loadings.

Fig. 7.7 An overview of the document co-citation map. Lit-up articles in the scene are Swanson’s
publications. Four of Swanson’s articles are embedded in the largest branch – information science,
including information retrieval and citation indexing. A dozen of his articles are gathered in the
green specialty – the second largest grouping, ranging from scientometrics, neurology, to artificial
intelligence. The third largest branch – headache and magnesium – only contains one of Swanson’s
articles
240 7 Tracking Latent Domain Knowledge

The largest area, in red, is information science, including information retrieval


and citation indexing. The second largest one, in green, includes scientometrics,
neurology, and artificial intelligence. The third largest area, in blue, contains articles
on headache and magnesium. Swanson’s articles are highlighted with stronger
brightness in the scene. A dozen of his articles are located in the second area.
About a handful of his articles also appear in the first area. The strongest impact
of Swanson’s work, purely according to this map, appears to be in the areas of
artificial intelligence and neurology.
Additional insights into the impact of Swanson’s 15-year quest become clearer
when we study a 3-dimensional visualization, in which most highly cited articles
are displayed in the context of the underlying co-citation network. The highest cited
article in the entire landscape is Swanson’s 1988 article in Perspectives in Biology
and Medicine, which identified 11 neglected connections between migraine and
magnesium. This article is almost located right on the boundary between the clinical
medicine literature and the literature of artificial intelligence and neurology. This
unique position and the fact that it has the highest citations in this data set imply
that this article is the gateway between the two disciplinary literatures. Not only has
Swanson established missing links between concepts in the literature of medical
sciences, he has also made a strong connection between information science and
medical sciences.

7.4 Pathfinder Networks’ Impact

In our earlier research, we incorporated Pathfinder networks into our Generalized


Similarity Analysis (GSA) framework (Chen 1998a, b, 1999b; Chen and Paul
2001; Chen et al. 2001, 2002). Traditionally a typical application of Pathfinder
networks relies on proximity data judged manually. The number of nodes in a
typical Pathfinder network ranges from 30 to 50, although Pathfinder networks
of 2,000 nodes were reported in one occasion in the 1980s (Schvaneveldt et al.
1989). We introduced a variety of computer-generated proximity measures along
with GSA including document-document similarity computed based on information
retrieval models, state transition probabilities derived from a Web navigation, and
co-citations of authors as well as documents (Chen 1999b) (See Fig. 7.8). These
proximity data have extended the scope of Pathfinder networks to a much wider
variety of phenomena beyond the amount of proximity data one can measure by
hand. This extension has not only enriched the topological properties of Pathfinder
networks but also led to valuable insights into the meaning of Pathfinder networks.
The Pathfinder case study is motivated by the question: How does this extension
fit into the general picture of Pathfinder network applications with reference to
traditional Pathfinder applications?
7.4 Pathfinder Networks’ Impact 241

Fig. 7.8 The procedure


of visualizing latent domain
knowledge

7.4.1 Mainstream Domain Knowledge

In the Pathfinder case, we focus on cross-domain applications of Pathfinder


networks, especially those non-mainstream applications of Pathfinder networks that
might be overshadowed by mainstream citation peaks.
The global context of the Pathfinder case shown in Fig. 7.9 contains clusters of
articles on knowledge discovery, knowledge acquisition, classification and machine
learning, artificial intelligence, expert systems, and domain knowledge modeling.
Pathfinder related articles are located in the far side of the landscape view, near to
the area labels of cognitive psychology and expert systems (See Fig. 7.10). This
indicates that applications of Pathfinder networks are closely to these two broad
categories. In order to pursue latent knowledge structures associated with Pathfinder
networks. Schvaneveldt’s 1985 article was chosen as the first “exit” landmark
because it is located at a point connecting the Pathfinder “peninsula” to other areas
in the landscape.
Table 7.3 lists further details concerning the structure of the global context as
derived from factor analysis. Up to 20 leading articles in each of the three largest
242 7 Tracking Latent Domain Knowledge

Fig. 7.9 An overview of the mainstream domain knowledge

factors, or specialties, are listed. In essence, factor one corresponds to research in


Pathfinder networks. Factor two corresponds to classic artificial intelligence. Factor
three corresponds to expert systems and decision support systems. The higher a
factor loading, the more typical an article is as a representative member of the
specialty. On the other hand, if an article has a wide impact, then its loadings on
individual factors may not be exceedingly high.

7.4.2 Latent Domain Knowledge

Figure 7.11 shows the latent knowledge structure derived from the citing space of
the “exit” landmark article. This structure is not overshadowed by high citations
7.4 Pathfinder Networks’ Impact 243

Fig. 7.10 A landscape view of the Pathfinder case. Applications of Pathfinder networks are found
in a broader context of knowledge management technologies, such as knowledge acquisition,
knowledge discovery, and artificial intelligence. A majority of Pathfinder network users are
cognitive psychologists

of classic artificial intelligence articles, but it maintains a connecting point with


the global context through the “exit” landmark, which is the highest citation bar
half way down in the branch pointing to the lower right corner. This detailed local
structure shows more articles related to the use of Pathfinder.
Similarly, Table 7.4 shows leading articles in this latent knowledge structure. The
classification is more detailed than the one in the global context.
Figure 7.12 shows an extended branch from the main Pathfinder network. This
branch represents a new area of applying Pathfinder networks. In fact, this is the area
in which Pathfinder networks have been adapted for citation-based visualizations.
Table 7.5 reveals the fact that articles in this branch all have negative loadings
on factor one and are virtually absent from the remaining factors. This is interesting
because on the one hand, the first specialty provides a dimension that can account for
both the traditional applications of Pathfinders and the new branch of applications;
on the other hand, since documents in the new branch are so consistently classified
by factor loading, they can be treated as a sub-specialty.
Figure 7.13 shows a simple research function which lights up all the articles by
Schvaneveldt, a central figure in the development of Pathfinder network scaling. The
position of each lit article and the direction of the hosting branch provide insightful
information into the nature of the article and the branch.
244 7 Tracking Latent Domain Knowledge

Table 7.3 Leading articles in the three largest specialties ranked by the strength of factor loading
F1 F2 F3
Publication Pathfinder Artificial Expert
Specialty networks intelligence systems
Elstein AS, 1978, Med Problem Solving 0.872
Card SK, 1983, Psychol Human Comput 0.872
Johnsonlaird PN, 1983, Mental Models 0.858
Nisbett RE, 1977, Psychol Rev, v84, p231 0.855
Glaser R, 1988, Nature Expertise, PR15 0.850
Gammack JG, 1985, Res Dev Expert Syste, p105 0.841
Chi MTH, 1981, Cognitive Sci, v5, p121 0.841
Cooke NM, 1986, P IEEE, v74, p1422 0.836
Cooke NM, 1987, Int J Man Mach Stud, v26, p533 0.830
Anderson JR, 1982, Psychol Rev, v89, p369 0.814
Anderson JR, 1987, Psychol Rev, v94, p192 0.813
Mckeithen KB, 1981, Cognitive Psychol, v13, p307 0.811
Chi MTH, 1989, Cognitive Sci, v13, p145 0.810
Anderson JR, 1983, Architcture Cogniti 0.807
Cordingley ES, 1989, Knowledge Elicitatio, p89 0.804
Cooke NJ, 1994, Int J Hum-comput St, v41, p801 0.798
Hoffman RR, 1987, Ai Mag, v8, p53 0.797 0.528
Chase WG, 1973, Cognitive Psychol, v4, p55 0.794
Klein GA, 1989, IEEE T Syst Man Cyb, v19, p462 0.792 0.508
Schvaneveldt RW, 1985, Int J Man Mach Stud, v23, p699 0.789 0.532
Marcus S, 1988, Automating Knowledge 0.951
Musen MA, 1987, Int J Man Mach Stud, v26, p105 0.949
Bennett JS, 1985, J Automated Reasonin, v1, p49 0.947
Clancey WJ, 1989, Mach Learn, v4, p285 0.942
Newell A, 1982, Artif Intell, v18, p87 0.942
Musen MA, 1989, Knowl Acquis, v1, p73 0.941
Cancey WJ, 1985, Artif Intell, v27, p289 0.940
Ford KM, 1993, Int J Intell Syst, v8, p9 0.933
Kahn G, 1985, 9TH P Int Joint C Ar, p581 0.933
Musen MA, 1989, Automated Generation 0.930
Neches R, 1991, Ai Mag, v12, p36 0.929
Marcus S, 1989, Artif Intell, v39, p1 0.926
Chandrasekaran B, 1986, IEEE Expert, v1, p23 0.925
Lenat DB, 1990, Building Large Knowl 0.923
Chandrasekaran B, 1983, Ai Mag, v4, p9 0.921
Davis R, 1982, Knowledge Based Syst 0.920
Davis R, 1979, Artif Intell, v12, p121 0.918
Gruber TR, 1987, Int J Man Mach Stud, v26, p143 0.914
Shadbolt N, 1990, Current Trends Knowl, p313 0.912
Dekleer J, 1984, Artif Intell, v24, p7 0.910
Holland JH, 1986, Induction Processes 0.771
Oleary DE, 1987, Decision Sci, v18, p468 0.713
Waterman DA, 1986, Guide Expert Systems 0.526 0.712
(continued)
7.4 Pathfinder Networks’ Impact 245

Table 7.3 (continued)


F1 F2 F3
Publication Pathfinder Artificial Expert
Specialty networks intelligence systems
Michalski RS, 1980, Int J Man Mach Stud, v12, p63 0.593 0.674
Olson JR, 1987, Expert Syst, v4, p152 0:668 0.672
Miller GA, 1956, Psychol Rev, v63, p81 0.671
Hart A, 1986, Knowledge Acquisitio 0:640 0.664
Prerau DS, 1990, Dev Managing Expert 0.657
Messier WF, 1988, Manage Sci, v34, p1403 0:611 0.635
Quinlan JR, 1979, Expert Systems Micro 0:644 0.631
Jackson P, 1990, Intro Expert Systems 0:530 0.627
Johnson PE, 1983, J Med Philos, v8, p77 0:510 0.612
Boose JH, 1986, Expertise Transfer E 0:578 0.601
Rumelhart DE, 1986, Parallel Distributed 0:575 0.599
Harmon P, 1985, Expert Systems 0.546 0.597
Kim J, 1988, Decision Support Sys, v4, p269 0:654 0.591
Shaw MLG, 1987, Knowl Acquis, p109 0:580 0.585
Quinlan JR, 1979, Expert Systems Micro, p168 0.585
Saaty TL, 1980, Anal Hierarchy Proce 0.508 0.580
Michalski R, 1980, Int J Pol Anal Inf S, v4, p125 0:664 0.571
Absolute values less than 0.500 are suppressed from the table
Factors F1, F2, and F3 define three specialties
The “exit” landmark belongs to the first specialty

Fig. 7.11 This citation map shows that the most prolific themes of Pathfinder network applications
include measuring the structure of expertise, eliciting knowledge, measuring the organization of
memory, and comparing mental models. No threshold is imposed
246 7 Tracking Latent Domain Knowledge

Table 7.4 Leading articles in the three most prominent specialties ranked by the strength of factor
loading
F1 F2 F3
Publication Pathfinder, cognitive Educational Knowledge
Specialty psychology psychology acquisition
Schvaneveldt RW, 1985, Int J Man Mach 0.916
Stud, v23, p699
Anderson JR, 1983, Architecture Cogniti 0.906
Reitman JS, 1980, Cognitive Psychol, 0.874
v12, p554
Friendly ML, 1977, Cognitive Psychol, 0.861
v9, p188
Mckeithen KB, 1981, Cognitive Psychol, 0.848
v13, p307
Ericsson KA, 1984, Protocol Anal 0.845
Cooke NM, 1987, Int J Man Mach Stud, 0.837
v26, p533
Chi MTH, 1981, Cognitive Sci, v5, p121 0.825
Kruskal JB, 1977, Statistical Methods 0.822
Cooke NM, 1986, P IEEE, v74, p1422 0.822
Hayesroth F, 1983, Building Expert Syst 0.807
Murphy GL, 1984, J Exp Psychol Learn, 0.806
v10, p144
Roskehoestrand RJ, 1986, Ergonomics, 0.803
v29, p1301
Anderson JR, 1982, Psychol Rev, v89, 0.801
p369
Cooke NJ, 1988, Int J Man Mach Stud, 0.800 0.514
v29, p407
Tversky A, 1977, Psychol Rev, v84, p327 0.798
Kelly GA, 1955, Psycol Personal Con 0.790
Butler KA, 1986, Artificial Intellige 0.789
Collins AM, 1969, J Verb Learn Verb Be, 0.784
v8, p240
Schvaneveldt RW, 1985, MCCS859 New 0.777
Mex Stat
Goldsmith TE, 1991, J Educ Psychol, 0.840
v83, p88
Gonzalvo P, 1994, J Educ Psychol, v86, 0.789
p601
Acton WH, 1994, J Educ Psychol, v86, 0.777
p303
Gomez RL, 1996, J Educ Psychol, v88, 0.754
p572
Johnson PJ, 1994, J Educ Psychol, v86, 0.747
p617
Novak JD, 1990, J Res Sci Teach, v27, 0.747
p937
Novak JD, 1984, Learning Learn 0.744
(continued)
7.4 Pathfinder Networks’ Impact 247

Table 7.4 (continued)


F1 F2 F3
Publication Pathfinder, cognitive Educational Knowledge
Specialty psychology psychology acquisition
Schvaneldt RW, 1989, Psychol Learn 0:744
Motiv, p249
Fenker RM, 1975, Instr Sci, v4, p33 0:737
Schvaneveldt RW, 1988, Comput Math 0:734
Appl, v15, p337
Scvaneveldt RW, 1990, Pathfinder Ass 0.601 0:726
Netwo
Wilson JM, 1994, J Res Sci Teach, v31, 0:734
p1133
Arabie P, 1993, Contemp Psychol, v38, 0:720
p66
Preece PFW, 1976, J Educ Psychol, 0:716
v68, p1
Rosch E, 1975, J Expt Psychol Gener, 0:711
v104, p192
Gomez RL, 1996, J Hlth Psychol, v1, 0:710
p107
Gomez RL, 1994, J Exp Psychol Learn, 0:710
v20, p396
Craik KJW, 1943, Nature Explanation 0:706
Canas JJ, 1994, Int J Hum-Comput St, 0:698
v40, p795
Schvaneveldt RW, 1989, Psychol Learn 0:696 0.501
Motiv, v24, p249
Shaw MLG, 1989, Knowl Acquis, v1, 0.623
p341
Kitto CM, 1989, Int J Man Mach Stud, 0.618
v31, p149
Kitto CM, 1987, P Westex 87 W C Exp, 0.571
p96
Sanderson PM, 1994, Human Computer 0.566
Inter, v9, p251
Cooke NJ, 1996, Hum-comput Interact, 0.560
v11, p29
Cooke NJ, 1992, Int J Man Mach Stud, 0:551 0.517
v37, p721
Walsh JP, 1988, Organ Behav Hum Dec, 0.511
v42, p194
Rowe AL, 1996, J Exp Psychol-appl, v2, 0.503
p31
Wielinga BJ, 1992, Knowl Acquis, v4, p5 0.503
Absolute values less than 0.500 are suppressed from the table
At least above-threshold factor loading is required to be included in the listing
The first member of the first specialty is the “exit” landmark chosen for domain expansion
248 7 Tracking Latent Domain Knowledge

Fig. 7.12 This branch represents a new paradigm of incorporating Pathfinder networks into
Generalized Similarity Analysis (GSA), a generic framework for structuring and visualization, and
its applications especially in strengthening traditional citation analysis

7.5 BSE and vCJD

Stanley Prusiner, professor of neurology, virology, and biochemistry at the


University of California San Francisco, published an article in Science (Prusiner
1982), in which he first introduced the concept of prions – infectious proteins.
Stanley Prusiner, a 1997 Nobel Prize winner for his discovery of prions – a type of
bad protein, suggested that an abnormal form of a protein is responsible for diseases
such as scrapie in sheep, Bovine Spongiform Encephalopathy (BSE) in cattle –mad
cow disease, and Creutzfeldt-Jakob disease (CJD) in humans. These diseases are
known as Transmissible Spongiform Encephalopathy (TSE).

7.5.1 Mainstream Domain Knowledge

BSE was first found in 1986 in England. A sponge-like malformation was found
in the brain tissue from affected cattle. It was identified as a new prion disease,
a new TSE disease. The BSE epidemic in Britain reached its peak in 1992 and
has since steadily declined. CJD was first discovered in the 1920s by two German
7.5 BSE and vCJD 249

Table 7.5 Strong negative Publication F1


factor loading in factor one
suggesting a unique specialty. McCain KW, 1995, J Am Soc Inform Sci, v46, p306 0.619
These articles Pathfinder Bush V, 1945, Atlantic Monthly, v176, p101 0.631
networks are used, but not in Kamada T, 1989, Inform Process Lett, v31, p7 0.651
any way similar to a typical Chen CM, 1996, Hum-Comput Interact, v11, p125 0.652
publication in the Pathfinder Conklin J, 1987, IEEE Comput, v20, p17 0.657
specialty Braam RR, 1991, J Am Soc Inform Sci, v42, p233 0.661
Marshall C, 1994, P Echt 94 Ed Sept, p13 0.661
Dillon A, 1996, Int J Hum-Comput St, v45, p619 0.664
Green SJ, 1998, P 7 Int World Wid We 0.664
Benyon D, 1997, P Hum Comp Int Inter, p39 0.664
Campagnoni FR, 1989, Acm T Inform Syst, v7, p271 0.666
McCain KW, 1990, J Am Soc Inform Sci, v41, p433 0.667
White HD, 1981, J Am Soc Inform Sci, v32, p163 0.668
Hemmje M, 1994, P 17 Ann Int Acm Sig, p249 0.670
White HD, 1997, Annu Rev Inform Sci, v32, p99 0.672
Small H, 1973, J Am Soc Inform Sci, v24, p265 0.673
Chen C, 1997, New Rev Hypermedia M, v3, p67 0.675
Vicente KJ, 1988, Int J Man Mach Stud, v29, p647 0.680
Deerwester S, 1990, J Am Soc Inform Sci, v41, p391 0.680
Small H, 1999, J Am Soc Inform Sci, v50, p799 0.682
Chen C, 1998, P 9 Acm C Hyp Hyp Hy, p77 0.684
Chalmers M, 1992, P 15 Ann Int Acm Sig, p330 0.688
Chen CM, 1998, J Visual Lang Comput, v9, p267 0.693
Chen CM, 1998, Interact Comput, v10, p107 0.695
Salton G, 1983, Intro Modern Informa 0.697
White HD, 1998, J Am Soc Inform Sci, v49, p327 0.724
Small H, 1997, Scientometrics, v38, p275 0.724
Hetzler B, 1998, P 5 Int Isko C Struc 0.724
Small H, 1994, Scientometrics, v30, p229 0.724
Chen HC, 1998, J Am Soc Inform Sci, v49, p582 0.727
Fox KL, 1999, J Am Soc Inform Sci, v50, p616 0.736
Chen CM, 1999, Inform Process Manag, v35, p401 0.743

neurologists. It is the principal form of a number of human TSE diseases. In humans,


the prion-based disease is related to CJD, Kuru (transmitted by cannibalism),
Gerstmann-Sträussler-Scheinker Disease (GSS), and Fatal Familial Insomnia (FFI).
Creutzfeldt-Jakob Disease (CJD) is an illness usually found in people aged over
55. It was first identified by two German neurologists in 1920 and has no known
cause. Patients die about 6 months after diagnosis. It is the principal form of
a number of human Transmissible Spongiform Encephalopathy (TSE) diseases.
New variant CJD (vCJD) is an unrecognized variant of CJD discovered by the
National CJD Surveillance Unit in Edinburgh. vCJD is characterized clinically by
a progressive neuropsychiatric disorder. Neuropathology shows marked spongiform
change throughout the brain. The media reported a growing concern in the general
public that BSE may have passed from cattle to humans.
250 7 Tracking Latent Domain Knowledge

Fig. 7.13 Schvaneveldtl’s “exit” landmark in the landscape of the thematic visualization

While no definitive link between prion disease in cattle and vCJD in humans
has been proven, the conditions are so similar most scientists are convinced that
infection by a BSE prion leads to vCJD in humans. The emergence of vCJD came
after the biggest ever epidemic of BSE in cattle. The fact that the epidemic was in
the UK and most vCJD victims lived in Britain added to evidence of a link. The
British government assured the public that the beef is safe, but in 1996 it announced
there is possibly a link between BSE and CJD. A brief timeline of relevant events is
shown in Table 7.6.
The British government assured the public that the beef is safe, but in 1996 it
announced there is possibly a link between BSE and vCJD. The central question in
this case study is what scientific literature tells us about the possible link between
BSE and vCJD.
First, we generated a mainstream-driven thematic landscape of the topic of BSE
and CJD by searching the Web of Science with the term “BSE or CJD” (See
Fig. 7.14). The strongest specialty Prion Protein is colored in red; the BSE specialty
is in green; and the CJD specialty is in blue. GSS is in purple next to the prion
protein specialty. In particular, the very light color of the vCJD specialty indicates
that this is an area where other specialties overlap.
7.5 BSE and vCJD 251

Table 7.6 A brief timeline of the BSE crisis in the UK


Year Events
1960s British scientists Tikvah Alper and J. S. Griffith proposed that an infectious
agent lacking nucleic acid could cause scrapie.
1980 American neurologist Stanley Prusiner published his theory of prion – a new
kind of disease causing agent.
1986 First BSE case diagnosed in the UK.
1988 The feed ban.
1992 The number of confirmed infections in cattle peaked.
1996 New variant CJD (nvCJD) was identified in the UK.
1996 In March U.K. announced a possible link between BSE, or “mad cow”
disease, which was primarily found in the U.K., and Creutzfeldt-Jakob
disease (CJD), a rare but fatal condition in humans.
1996 The European Commission imposed a ban on exports of British beef and beef
products.
1997 Stanley Prusiner won the Nobel Prize for his discovery of prions.
1999 The European Commission’s ban on exports of British beef and beef
products was lifted.

Fig. 7.14 An overview of 379 articles in the mainstream of BSE and vCJD research

In the Prion specialty, Prusiner’s 1982 article in Science and Gajdusek’s 1966
article in Nature are located next to each other. Gajdusek received 1976’s Nobel
Prize for his work on kuru, a prion-related brain disease. The Prion specialty also
includes radiation biologist Tikvah Alper’s 1967 article in Nature. Alper studied
scrapie in sheep and found that brain tissue remained infectious even after she
subjected it to radiation that would destroy any DNA or RNA. In 1969, J. S. Griffith
of Bedford College, London, suggested in an article published in Nature that an
infectious agent that lacked nucleic acid could cause disease. Griffith suggested in
a separate paper that perhaps a protein, which would usually prefer one folding
pattern, could somehow misfold and then catalyze other proteins to do so. Such an
252 7 Tracking Latent Domain Knowledge

Fig. 7.15 A year-by-year animation shows the growing impact of research in the connections
between BSE and vCJD. Top-left: 1991–1993; Top-right: 1994–1996; Bottom-left: 1997–1999;
Bottom-right: 2000–2001

idea seemed to threaten the very foundations of molecular biology, which held that
nucleic acids were the only way to transmit information from one generation to the
next.
Fifteen years later, in 1982, Prusiner followed up this idea of self-replication
proposed in the 1960s and described the “proteinaceous infectious particles” as the
cause of scrapie in sheep and hamsters. He suggested that scrapie and a collection of
other wasting brain diseases, some inherited, some infectious, and some sporadic,
were all due to a common process: a misfolded protein that propagates and kills
brain cells.
Prusiner and his colleagues reported in Science in 1982 that they had found an
unusual protein in the brains of scrapie-infected hamsters that did not seem to be
present in healthy animals. Their article, entitled “Novel proteinaceous infectious
particles cause scrapie,” has been cited 941 times by March 2001. A year later,
they identified the protein and called it prion protein (PrP). Prusiner led a series of
experiments, demonstrating that PrP actually is present in healthy animals, but in a
different form from the one found in diseased brains. The studies also showed that
mice lacking PrP are resistant to prion diseases. Taken together, the results have
convinced many scientists that the protein is indeed the agent behind CJD, scrapie,
mad cow disease, and others. Figure 7.15 shows four frames from an animation
sequence of the year-by-year citation growth. Figure 7.16 shows the following four
most cited articles over the period of 1995–2000.
• Will, R. G., Ironside, J. W., Zeidler, M., Cousens, S. N., Estibeiro, K.,
Alperovitch, A., Poser, S., Pocchiari, M., Hofman, A., & Smith, P. G. (1996).
A new variant of Creutzfeldt-Jakob disease in the UK. Lancet, 347, 921–925.
7.5 BSE and vCJD 253

Fig. 7.16 Articles cited more than 50 times during this period are labeled. Articles labeled 1–3
directly address the BSE-CJD connection. Article 4 is Prusiner’s original article on prion, which
has broad implications on brain diseases in sheep, cattle, and human

• Collinge, J., Sidle, K., Meads, J., Ironside, J., & Hill, A. (1996). Molecular
analysis of prion strain variation and the aetiology of ‘new variant’ CJD. Nature,
383, 685–691.
• Bruce, M. E., Will, R. G., Ironside, J. W., McConnell, I., Drummond, D., Suttie,
A., McCardle, L., Chree, A., Hope, J., Birkett, C., Cousens, S., Fraser, H., &
Bostock, C. J. (1997). Transmissions to mice indicate that ‘new variant’ CJD is
caused by the BSE agent. Nature, 389(6650), 498–501.
• Prusiner, S. B. (1982). Novel Proteinaceous Infectious Particles Cause Scrapie.
Science, 216(4542), 136–144.
Research by Moira Bruce at the Neuropathogenesis Unit in Edinburgh has
confirmed that sheep can produce a range of prion particles but finding the one
that causes BSE has eluded researchers until now. There is no evidence that people
can catch BSE directly from eating sheep but most research has focused on cattle so
the possibility cannot be ruled out. Such a discovery would also devastate consumer
confidence.
According to Bruce et al. “Twenty cases of a clinically and pathologically
atypical form of Creutzfeldt-Jakob disease (CJD), referred to as ‘new variant’ CJD
(vCJD), have been recognized in unusually young people in the United Kingdom,
and a further case has been reported in France. This has raised serious concerns that
BSE may have spread to humans, putatively by dietary exposure.”
254 7 Tracking Latent Domain Knowledge

7.5.2 The Manganese-Copper Hypothesis

The mainstream view on BSE has focused on the food chain: Cows got BSE by
eating feed made from sheep infected with scrapie, and, similarly, humans get vCJD
by eating BSE infected beef. However, Mark Purdey, a British organic dairy farmer,
believed that the unbalanced manganese and copper in the brain is the real cause of
BSE and vCJD (Stourton 2001). He studied the environment in areas known to have
found spongiform diseases, such as Colorado in the United States, Iceland, Italy
and Slovakia. He found a high level of manganese and low levels of copper in all of
them.
Purdey’s research on the manganese-copper hypothesis shows the sign of latent
domain knowledge. He has published in scientific journals, but they are not highly
cited by other researchers. We need to find a gateway from which we can expand
the global landscape of mainstream research in BSE and vCJD and place Purdey’s
research into the big picture of this issue. Recall that we need to find an “exit”
landmark in the global landscape to conduct the domain expansion, but none of
Purdey’s publications was featured in the scene. To solve this problem, we need to
find someone who is active in the manganese-copper paradigm and also included in
the mainstream visualization view.
David R. Brown, a biochemist at Cambridge University, is among scientists
who did cite Purdey’s publications. Brown provides a good candidate for an “exit”
landmark. On the one hand, Brown is interested in the role of the manganese-copper
balance in prion diseases (Brown et al. 2000) and he cited Purdey’s articles. On
the other hand, he is interested in Prusiner’s prion theory and published about 50
articles on prion diseases. Indeed two of his articles are featured in the mainstream
view visualization of the case study. We chose his 1997 article published in
Experimental Neurology as the “exit” landmark. Because of the relatively low
citations of Purdey’s articles, conventional citation analysis is unlikely to take
them into account. Predominant articles in this cluster all address the possible link
between BSE and vCJD. This observation suggests how Purdey’s articles might fit
into the mainstream domain knowledge.
The moral of this story is that citation networks can pull into articles that would
be excluded by conventional citation analysis such that researchers can explore the
development of a knowledge domain across a wider variety of works. This approach
provides a promising tool for finding weak connections in scientific literature that
would be otherwise overshadowed by those belong to the cream of the crop. This
example shows that Purdey’s theory is connected to the mainstream research on
BSE and CJD through Brown and his group.
We have demonstrated it that our approach can be successfully applied to find
connections that would be otherwise obscured. The BSE case study has shown that
Purdey’s theory is feeding in the mainstream research on BSE and CJD through
Brown and his group.
7.6 Summary 255

7.6 Summary

Typical citation-based domain visualization approaches have focused on citation


frequencies of high-profiled research in a knowledge domain. Consequently, resul-
tant visualizations are strongly biased towards highly cited works. Although highly
cited works constitute the core knowledge of a domain, its presence inevitably
outshines the presence of latent domain knowledge if we measure them with the
same yardstick. The use of two-step citation chains allows us to glean latent
domain knowledge and maintain the global picture of where such latent domain
knowledge fits.
In order to track the development of scientific paradigms, it is necessary to take
into account latent as well as mainstream domain knowledge. By incorporating
an information visualization procedure originally developed for visualizing main-
stream domain knowledge into a recursive process, it is possible for us to visualizing
not only highly relevant and highly cited documents, but also highly relevant but
infrequently cited documents.
A natural extension of the research is to explore ways that can combine
approaches based on citation patterns and those based on word-occurrence patterns
to pin point a significant mismatch between the citation strength and word
co-occurrence patterns. There are other potentially useful ways to uncover latent
domain knowledge. Many techniques developed in scientometrics for quantitative
studies of science can be used to generate structural representations of domain
knowledge. By comparing and contrasting differences across a variety of structural
representations one can expect to spot missing links and potentially noteworthy
connections. For example, if a co-word analysis reveals a strong link between
intellectually related works. In contrast, if such links are absent or weak in citation
networks, then it could be important for scientists to know whether they might have
overlooked something potentially significant.
On the one hand, visualizing domain knowledge in general is a revival of a
long established quest for quantitative studies of scientific discoveries and scientific
paradigms, especially due to the advances in enabling techniques such as digital
libraries and information visualization. On the other hand, visualizing domain
knowledge should set its own research agenda in the new era of science and
technology so as to provide valuable devices for scientists, philosophers of science,
sociologists of knowledge, librarians, government agencies, and others to grasp
crucial developments in science and technology.
In this chapter, we have examined the role of citation chains in visualizing
latent domain knowledge. The new visualization approach can not only capture the
intellectual structure of highly cited works but also make it possible to uncover con-
nections between latent domain knowledge and the body of the mainstream domain
knowledge. The two case studies have shown that this approach has the potential as
a new way of supporting knowledge tracking and knowledge management.
256 7 Tracking Latent Domain Knowledge

References

Brown DR, Hafiz F, Glasssmith LL, Wong BS, Jones IM, Clive C et al (2000) Consequences of
manganese replacement of copper for prion protein function and proteinase resistance. EMBO
J 19(6):1180–1186
Chen C (1998a) Bridging the gap: the use of pathfinder networks in visual navigation. J Vis Lang
Comput 9(3):267–286
Chen C (1998b) Generalised similarity analysis and pathfinder network scaling. Interact Comput
10(2):107–128
Chen C (1999a) Information visualisation and virtual environments. Springer, London
Chen C (1999b) Visualising semantic spaces and author co-citation networks in digital libraries.
Inf Process Manag 35(2):401–420
Chen C (2002) Visualization of knowledge structures. In: Chang SK (ed) Handbook of software
engineering and knowledge engineering, vol 2. World Scientific Publishing Co, River Edge,
p 700
Chen C, Paul RJ (2001) Visualizing a knowledge domain’s intellectual structure. Computer
34(3):65–71
Chen C, Paul RJ, O’Keefe B (2001) Fitting the jigsaw of citation: information visualization in
domain analysis. J Am Soc Inf Sci 52(4):315–330
Chen C, Cribbin T, Macredie R, Morar S (2002) Visualizing and tracking the growth of competing
paradigms: two case studies. J Am Soc Inf Sci Technol 53(8):678–689
Kuhn TS (1962) The structure of scientific revolutions. University of Chicago Press, Chicago
Prusiner SB (1982) Novel proteinaceous infectious particles cause scrapie. Science
216(4542):136–144
Schvaneveldt RW (ed) (1990) Pathfinder associative networks: studies in knowledge organization.
Ablex Publishing Corporations, Norwood
Schvaneveldt RW, Durso FT, Dearholt DW (1989) Network structures in proximity data. In: Bower
G (ed) The psychology of learning and motivation, 24. Academic Press, New York, pp 249–284
Smalheiser NR, Swanson DR (1994) Assessing a gap in the biomedical literature – magnesium-
deficiency and neurologic disease. Neurosci Res Commun 15(1):1–9
Smalheiser NR, Swanson DR (1996a) Indomethacin and Alzheimer’s disease. Neurology 46:583
Smalheiser NR, Swanson DR (1996b) Linking estrogen to Alzheimer’s disease: an informatics
approach. Neurology 47:809–810
Smalheiser NR, Swanson DR (1998) Calcium-independent phospholipase A2 and schizophrenia.
Arch Gen Psychiatry 55:752–753
Small H (1977) A co-citation model of a scientific specialty: a longitudinal study of collagen
research. Soc Stud Sci 7:139–166
Stourton E (Writer) (2001) Mad cows and an Englishman [TV]. In L. Telling (Producer). London:
BBC2
Swanson DR (1986a) Fish oil, Raynauds syndrome, and undiscovered public knowledge. Perspect
Biol Med 30(1):7–18
Swanson DR (1986b) Undiscovered public knowledge. Libr Q 56(2):103–118
Swanson DR (1987) Two medical literatures that are logically but not bibliographically connected.
J Am Soc Inf Sci 38:228–233
Swanson DR (1988) Migraine and magnesium: eleven neglected connections. Perspect Biol Med
31:526–557
Swanson DR (1990) Somatomedin C and arginine: implicit connections between mutually-isolated
literatures. Perspect Biol Med 33:157–186
Swanson DR (1999) Computer-assisted search for novel implicit connections in text databases.
Abstracts of Papers of the American Chemical Society, 217, 010-CINF
References 257

Swanson DR (2001) On the fragmentation of knowledge, the connection explosion, and assembling
other people’s ideas. Bull Am Soc Inf Sci Technol 27(3):12–14
Swanson DR, Smalheiser NR (1997) An interactive system for finding complementary literatures:
a stimulus to scientific discovery. Artif Intell 91(2):183–203
White HD, McCain KW (1998) Visualizing a discipline: an author co-citation analysis of
information science, 1972–1995. J Am Soc Inf Sci 49(4):327–356
Chapter 8
Mapping Science

A critical part of a scientific activity is to discern how a new idea is related to what
we know and what may become possible. As the number of new scientific publica-
tions arrives at a rate that rapidly outpaces our capacity of reading, analyzing, and
synthesizing scientific knowledge, we need to augment ourselves with information
that can guide us through the rapidly growing intellectual space effectively. In this
chapter, we address some fundamental issues concerning with what information
may serve as early signs of potentially valuable ideas. In particular, we are interested
in information that is routinely available and derivable upon the publication of a
scientific paper without assuming the availability of additional information such as
its usage and citations.

8.1 System Perturbation and Structural Variation

Many phenomena in the world share essential properties of a complex adaptive


system (CAS). Complex adaptive systems are a special type of complex system.
The study of CAS focuses on complex, emergent, and macroscopic properties of
the system. John H. Holland defines a CAS as a system that has a large number of
components that interact, adapt, or learn. These components are often called agents.
The most important properties of a CAS are concerned with a large population
of agents, non-linear and dynamic interactions between agents, open and blurred
boundaries, a constant flow of energy to maintain its organization, and autonomous
agents, and self-organizing mechanisms such as feedback.
In this chapter, we introduce a conceptualization of science as a complex adaptive
system and propose a theory that may have the potential of identifying early signs
of transformative ideas in science. We will demonstrate how the CAS perspective
can be used to detect information that triggers transformative and holistic changes
to the system.

C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 259
DOI 10.1007/978-1-4471-5128-9 8, © Springer-Verlag London 2013
260 8 Mapping Science

8.1.1 Early Signs

Detecting early signs of potentially valuable ideas has theoretical and practical
implications. For instance, peer reviews of new manuscripts and new grant proposals
are under a growing pressure of accountability for safeguarding the integrity of
scientific knowledge and optimizing the allocation of limited resources (Chubin
1994; Chubin and Hackett 1990; Häyrynen 2007; Hettich and Pazzani 2006).
Long-term strategic science and technology policies require visionary thinking and
evidence-based foresights into the future (Cuhls 2001; Martin 2010; Miles 2010). In
foresight exercises on identifying future technology, experts’ opinions were found
to be overly optimistic on hindsight (Tichy 2004). The increasing specialization
in today’s scientific community makes it unrealistic to expect an expert to have a
comprehensive body of knowledge concerning multiple key aspects of a subject
matter, especially in interdisciplinary research areas.
The value, or perceived value, of an idea can be quantified in many ways. For
example, the value of a good idea can be measured by the number of people’s
life it has saved, the number of jobs it has created, or the amount of revenue it
has generated. In the intellectual world, the value of a good idea can be measured
by the number of other ideas it has inspired or the amount of attention it has
drawn. In this chapter, we are concerned with identifying patterns and properties of
information that can tell us something about the potential values of ideas expressed
and embodied in scientific publications. A citation count of a scientific publication is
the number of times other scientific publications have referenced to the publication.
Using citations to guide the search for relevant scientific ideas by way of association,
known as citation indexing, was pioneered by Eugene Garfield in the 1950s
(Garfield 1955). It is a general consensus that citation behavior can be motivated
by both scientific and non-scientific reasons (Bornmann and Daniel 2006). Citation
counts have been used as an indicator of intellectual impact on subsequent research.
There have been debates over the nature of citations and whether positive, negative,
and self-citations should all be treated equally. Nevertheless, even a negative citation
makes it clear that the referenced work cannot be simply ignored.
Researchers have searched for other clues that may inform us about the potential
impact of a newly published scientific paper, especially clues that can be readily
extracted from routinely available information at the time of publication instead
of waiting for download and citation patterns to build up over time. Factors such
as track record of authors, the prestige of authors’ institutions, the prestige of
the journal in which an article is published are among the most promising ones
that can provide an assurance of the quality of the article to an extent (Boyack
et al. 2005; Hirsch 2007; Kostoff 2007; van Dalen and Kenkens 2005; Walters
2006). The common assumption central to approaches in this category is that great
researchers tend to continuously deliver great work and, along a similar vein, an
article published in a high impact journal is also likely to be of high quality itself.
On the one hand, these approaches avoid the reliance on data that may not be readily
available upon the publication of an article and thus free analysts from constraints
8.1 System Perturbation and Structural Variation 261

due to the lack of download and citation data. On the other hand, the sources of
information used in these approaches are indirect to the new ideas reported in
scientific publications. In an analogy, we give credits to an individual based on
his/her credit history instead of assessing the risk of the current transaction directly.
In such approaches, we will not be able to know where precisely the novelty of an
idea is coming from. We will not be able to know whether similar ideas have been
proposed in the past.
Many studies have addressed factors that could explain or even predict future
citations of a scientific publication (Aksnes 2003; Hirsch 2007; Levitt and Thelwall
2008; Persson 2010). For example, is a paper’s citation count last year a good
predictor for new citations this year? Are the download times a good predictor
of citations? Is it true that the more references a paper cites, the more citations
it will receive later on? Similarly, the potential role of prestige, or the Matthew
Effect coined by Robert Merton, has been commonly investigated, ranging from
the prestige of authors to the prestige of journals in which articles are published
(Dewett and Denisi 2004). However, many of these factors are loosely and indirectly
coupled with the conceptual and semantic nature of the underlying subject matter of
concern. We refer them as extrinsic factors. In contrast, intrinsic factors have direct
and profound connections with the intellectual content and structure. One example
of intrinsic factor is concerned with the structural variation of a field of study.
A notable example is the work by Swanson on linking previously disjoint bodies
of knowledge, such as the connection between fish oil and Reynaud’s syndrome
(Swanson 1986a).
Researchers have made various attempts to characterize future citations and
identify emerging core articles (Shibata et al. 2007; Walters 2006). Shibata et al.
for example, studied citation networks in two subject areas, Gallium Nitride and
Complex Networks, and found that while past citations are a good predictor of near-
future citations, the betweenness centrality is correlated with citations in a longer
term.
Upham et al. (2010) studied the role of cohesive intellectual communities –
schools of thoughts – in promoting and constraining knowledge creation. They
analyzed publications on management and concluded that it is significantly bene-
ficial for new knowledge to be a part of a school of thought and the most influential
position within a school of thought is in the semi-periphery of the school. In
particular, boundary-spanning research positioned at the semi-periphery of a school
would attract attention from other schools of thought and receive the most citations
overall. Their study used a zero-inflated negative binomial regression (ZINB).
Negative binomial regression models have been used to predict the expected mean
patent citations (Fleming and Bromiley 2000). Hsieh (2011) studied inventions as
a combination of technological features. In particular, the closeness of features
plays an interesting role. Neither overly related nor loosely related features are
good candidates for new inventions. Useful inventions arise with rightly positioned
features where the cost of synthesis is minimized.
Takeda and Kjikawa (2010) reported three stages of clustering in citation
networks. In the first stage, core clusters are formed, followed by the formation
262 8 Mapping Science

of peripheral clusters and the continuous growth of the core clusters. Finally,
the core clusters’ growth becomes predominant again. Buter et al. (2011) studied
the emergence of an interdisciplinary research area from fields that did not show
interdisciplinary connections before. They used journal subject categories as a proxy
for fields and citations as a measure of interdisciplinary connection.
Lahiri et al. addressed how structural changes of a network may influence the
spread of information over the network (Lahiri et al. 2008). Although they did not
study bibliographic networks per se, their study indicates predictions made about
how information spreads over a network are sensitive to structural changes of the
network. This observation underlines the importance of taking structural change into
account in the development of metrics based on topological properties of networks.
Leydesdorff (2001) raised questions (p. 146) that are closely related to what we
are addressing: “How does the new text link up to the literature, and what is its
impact on the network of previously existing relations?” He took a quite different
approach and analyzed word occurrences in scientific papers from an information-
theoretic perspective. In his approach, the publication of a paper is perceived as an
event that may lead to the reduction of uncertainty involved in the current state of
knowledge. He devised diagrams that depict pathways of how a particular paper
improves the efficiency of communication. Although the information-theoretic
approach and our structural variation approach currently operate on different units of
analysis with distinct theoretical underpinnings, both share the fundamental concern
of changes introduced by newly published scientific papers on the existing body of
knowledge.
As shown above, many studies in the literature have addressed factors that may
influence citations. The value of our work is the introduction of the structural
variation paradigm along with computational metrics that can be integrated into in-
teractive exploration systems to better understand precisely the impact of individual
links made by a new article.

8.1.2 A Structural Variation Model

There is a recurring theme from a diverse body of work on creativity. A major


form of creative work is to bridge previously disjoint bodies of knowledge. Notable
studies include the work of Ronald S. Burt in sociology (Burt 2004), Donald
Swanson in information science (Swanson 1986a), and conceptual blending as a
theoretical framework for exploring human information integration (Fauconnier and
Turner 1998). We have been developing an explanatory and computational theory
of transformative discovery based on criteria derived from structural and temporal
properties (Chen 2011; Chen et al. 2009).
In the history of science, there are many examples of how new theories revolu-
tionized the contemporary knowledge structure. For example, the 2005 Nobel Prize
in medicine was awarded to the discovery of Helicobacter pylori, a bacterium which
was not believed to be possible to find in human’s gastric system (Chen et al. 2009).
8.1 System Perturbation and Structural Variation 263

Fig. 8.1 An overview of the structural variation model

In literature-based discovery, Swanson discovered previously unnoticed linkage


between fish oil and Reynaud’s syndrome (Swanson 1986a). In terrorism research,
before the September 11 terrorist attacks, it was widely believed that only those
who directly witness a traumatic scene or directly experience a trauma could have
the risk of post-traumatic stress disorder (PTSD); however, later research had shown
that people may develop PTSD syndromes even by simply watching the coverage of
a traumatic scene on TV (Chen 2006). In drug discovery, one of the major challenges
is to find new compound structures effectively in the vast chemical space that satisfy
an array of constraints (Lipinski and Hopkins 2004). In mapping scientific frontiers
(Chen 2003) and studies in science of science (Price 1965), it would be particularly
valuable if scientists, funding agencies, and policy makers can have tools that may
assist them to assess the novelty of ideas in terms of their conceptual distance from
the contemporary domain knowledge. In these and many more scenarios, a common
challenge for coping with a constantly changing environment is to estimate the
extent to which the structure of a network should be updated in respond to newly
available information (Fig. 8.1).
264 8 Mapping Science

The basic assumption in the structural variation approach is that the extent
of a departure from the current intellectual structure is a necessary condition
for a potentially transformative idea in science. In other words, a potentially
transformative idea needs to bring changes to the existing structure of knowledge
in the first place. In order to measure the degree of structural variation introduced
by a scientific article, the intellectual structure at a particular moment of time needs
to be represented in such a way that structural changes can be computationally
detected and manually verifiable. Bibliographic networks can be computationally
derived from scientific publications. Research in scientometrics and citation analysis
routinely uses citation and co-citation networks as a proxy of the underlying
intellectual structure. Here we will focus on using several types of co-citation and
co-occurrence networks as the representation of a baseline network.
A network represents how a set of entities are connected. Entities are represented
as nodes, or vertices, in the network. Their connections are represented as links, or
edges. Relevant entities in our context include several types of information that can
be computationally extracted from a scientific article, such as references cited by the
article, authors and their affiliations, the journal in which the article is published, and
keywords in the article. We will limit our discussions to networks that are formed
with a single type of entities, although networks of multiple types of entities are
worth considering once we establish a basic understanding of structural variations
in networks of a single type of entities.
Once the type of entities is chosen, the nature of the interconnectivity between
entities is to be specified to form a network. Networks of co-occurring entities
represent a wide variety of types of connectivity. A network of co-occurring words
represents how words are related in terms of whether and how often they appear
in the vicinity of each other. Co-citation networks of entities such as references,
authors, and journals can be seen as a special case of co-occurring networks. For
example, co-citation networks of references are networks of references that appear
together in the bodies of scientific papers – these references are co-cited.
Networks of co-cited references represent more specific information than net-
works of co-cited authors because references of different articles by the same author
would be lumped together in a network of co-cited authors. Similarly, networks
of co-cited references are more specific than networks of co-cited journals. We
refer such differences in specificity as the granularity of networks. Measurements
of structural variation need to take the granularity factor into account because it is
reasonable to expect that networks at different levels of granularity would lead to
different measures of structural variations.
Another decision to be made about a baseline network is a sampling issue. Taking
a particular year as a standing point to look at in the past, how far back should we
consider in the construction of a baseline network that would adequately represent
the underlying intellectual structure? Does the network become more accurate if
we go back more into the past? Will it be more efficient if we limit it to the most
recent years that really matter the most? Given articles published in a particular year
Y, the baseline network represents the intellectual structure using information from
articles published up to year Y–1. Two types of baseline networks are investigated
8.1 System Perturbation and Structural Variation 265

here: ones using a moving window of a fixed size [Y–k, Y–1] and ones using the
entire history (Yo , Y–1], where Yo is the earliest year of publication for records in
the given dataset.

8.1.3 Structural Variation Metrics

We expect that the degree of structural variation introduced by a new article can
offer prospective information because of the boundary spanning mechanism. If an
article introduces novel links that span the boundaries of different topics, then we
expect this signifies its potential in taking the intellectual structure for a new turn.
Given a baseline network, structural variations can be measured based on
information provided by a particular article. We will introduce three metrics of
structural variation. Each metric quantifies the degree of change in the baseline
network introduced by information provided by an article. No usage data is involved
in the measurement. The three metrics are modularity change rate, inter-cluster
linkage, and centrality divergence. The definitions of the first two metrics depend
on a partition of the baseline network, but the third one does not. A partition
of a network decomposes the network into non-overlapping groups of nodes. For
example, clustering algorithms such as spectral clustering can be used to partition a
network.
The theoretical underpinning of the structural variation is that scientific discov-
eries, at least a subset of them, can be explained in terms of boundary spanning,
brokerage, and synthesis mechanisms in an intellectual space (Chen et al. 2009).
This conceptualization generalizes the principle of literature-based discovery pio-
neered by Swanson (1986a, b), which assumes that connections between previously
disparate bodies of knowledge are potentially valuable. In Swanson’s famous ABC
model, the relationships AB and BC are known in the literature. The potential
relationship AC becomes a candidate that is subject to further scientific investigation
(Weeber 2003). Our conceptualization is more generic in several ways. First,
in the ABC model, the AC relation changes an indirect connection to a direct
connection, whereas our structural variation model makes no assumption about
any prior relations at all. Second, in the ABC model, the scope of consideration is
limited to relationships involving three entities. In contrast, our structural variation
model takes a wider context into consideration and addresses the novelty of a
connection that links groups of entities as well as connections linking individual
entities. Because of the broadened scope of consideration, it becomes possible to
search for candidate connections more effectively. In other words, given a set of
entities, the size of the search space of potential connections can be substantially
reduced if additional constraints are applicable for the selection of candidate
connections. For example, the structural hole theory developed in social network
analysis emphasizes the special potential of nodes that are strategically positioned
to form brokerage, or boundary spanning, links and create good ideas (Burt 2004;
Chen et al. 2009).
266 8 Mapping Science

8.1.3.1 Modularity Change Rate (MCR)

Given a partition of a network, i.e. a configuration of clusters, the modularity of


the network measures the degree of interconnectivity among the groups of nodes
identified by the partition. If different clusters are loosely connected, then the
overall modularity would be high. In contrast, if clusters are interwoven, then
the modularity would be low. We follow Newman’s algorithm (Newman 2006)
to calculate the modularity with reference to a cluster configuration generated by
spectral clustering (Chen et al. 2010; von Luxburg 2006). Suppose the network G is
partitioned by a partition C into k clusters such that G D c1 C c2 C : : : C ck , Q(G)
is defined as follows, where m is the total number of edges in the network G. n is
the number of nodes in G. •(ci , cj ) is known as the Kronecker’s delta. It is 1 if nodes
ni and nj belong to the same cluster and 0 otherwise. deg(ni) is the degree of node
ni . The range of Q(G) is between 1 and 1.
 !
1 X 
n
 deg .ni /  deg nj
Q.G; C/ D ı ci ; cj  Aij 
2m i;j D0 2m

The modularity of a network is a measure of the overall structure of the network.


Its range is between 1 and 1. The Modularity Change Rate of a scientific paper
measures the relative structural change due to the information from the published
paper with reference to a baseline network. For each article a, and a baseline network
Gbaseline , we define the Modularity Change Rate (MCR) as follows:

Q .Gbaseli ne ; C /  Q .Gbaseli ne ˚ Ga ; C /
M CR.a/ D  100
Q .Gbaseli ne ; C /

where Gbaseline ˚ Ga is the updated baseline network by information from the article
a. For example, suppose reference nodes ni and nj are not connected in a baseline
network of co-cited references but they are co-cited by article a, a new link between
ni and nj will be added to the baseline network. In this way, the article changes the
structure of the baseline network.
Intuitively, adding a new link anywhere in a network should not increase the
modularity of the network. It should either reduce it or leave it intact. However, the
change of modularity is not a monotonic function as we initially expect. In fact, it
depends on where the new link is added and how the network is structured. Adding
a link may reduce the proportion of the modularity in some clusters, but it may
increase the modularity in other clusters in the network. Thus, the overall modularity
change is not monotonic.
Without losing any generality, assume that an article adds one link at a time
to a given baseline network. If the new link connects two distinct clusters, then
it has no effect on the corresponding term in the updated modularity because by
definition •ij D 0 and the corresponding term becomes 0. Such a link is illustrated
8.1 System Perturbation and Structural Variation 267

Fig. 8.2 Scenarios that may increase or decrease individual terms in the modularity metric

by the dashed link e5,10 in the top diagram in Fig. 8.2. The new link eij will increase
the degree of nodes i and j by one, i.e. deg(i) will become deg(i) C 1. The total
number of edges m will increase to m C 1. A simple calculation at the bottom
of Fig. 8.2 shows that terms in the modularity formula involving blue links will
decrease from their previous values. However, if the network has clusters such as
CA with no changes in node degrees, then the corresponding values of terms of
lines in red will increase from their previous values as the denominator increases
from 2 m to 2(m C 1). In summary, the updated modularity may increase as well
as decrease, depending on the structure of the network and where the new link
is added. With this particular definition of modularity, between-cluster links are
always associated with a zero valued term in the overall modularity formula due
to the Kronecker’s delta. What we see in the change of modularity is a combination
of results from several scenarios that are indirectly affected by the newly added link.
We will introduce our next metric to reflect the changes in terms of between-cluster
links directly.
268 8 Mapping Science

8.1.3.2 Cluster Linkage (CL)

The Cluster Linkage (CL) measures the overall structural change introduced by an
article a in terms of new connections added between clusters. Its definition assumes
a partition of the network. We introduce a function of edges œ(ci ,cj ) which is the
opposite of •ij used in the modularity definition. The value of œij is 1 for an edge
across distinct clusters ci and cj . It will be 0 for edges within a cluster. œij will allow
us to concentrate on between-cluster links and ignore within-cluster links, which is
the opposite of how the modularity metric is defined. The new metric Linkage is the
sum of all the weights of between-cluster links eij divided by K – the total number
of clusters in the network. Linking to itself is not allowed, i.e. we assume eii D 0
for all nodes. Using link weights makes the metric sensitive to links that strengthen
connections between clusters in addition to novel links that make unprecedented
connections between clusters.
It is possible to take into account the size of clusters that a link is con-
necting so that connections between larger-sized clusters become more promi-
nent
q in the measurement. For example, one option is to multiple each eij by
 
si ze .ci /  si ze cj = max .si ze .ck //. Here we define the metric without such
modifications for the sake of simplicity. Suppose C is a partition of G, the Linkage
metric is defined as follows:
Pn
i ¤j ij eij
Li nkage.G; C / D
K

0; ni 2 cj
ij D
1; ni … cj

The Cluster Linkage is defined as the difference of Linkage before and after new
between-clusters links added by an article a.

CL.a/DLinkage.a/DLinkage .Gbaseline ˚ Ga ; C / Linkage.Gbaseline ; C /

Linkage(G C G) is always greater than or equal to Linkage(G). Thus, CL is


non-negative.

8.1.3.3 Centrality Divergence (CKL )

The Centrality Divergence metric measures the structural variation caused by an


article a in terms of the divergence of the distribution of betweenness centrality
CB (vi ) of nodes vi in the baseline network. This definition does not involve any
partitions of the network. If n is the total number of nodes. The degree of structural
change CKL (G, a) can be defined in terms of the K-L divergence.
8.1 System Perturbation and Structural Variation 269

X
n  
pi
CKL .Gbaseli ne ; a/ D pi  log
i D0
qi

pi D CB .vi ; Gbaseli ne /
 
qi D CB vi ; Gupdat ed

For nodes where pi D 0 or qi D 0, we reset them as a small number 106 to avoid


log(0).

8.1.4 Statistical Models

We constructed negative binomial (NB) and zero-inflated negative binomial (ZINB)


models to validate the role of structural variation in predicting future citation counts
of scientific publications. The negative binomial distribution is generated by a
sequence of independent Bernoulli trials. Each trial is either a ‘success’ with a
probability of p or a ‘failure’ with a probability of (1–p). Here the terminology
of success and failure in this context does not necessarily represent any practical
preferences. The random number of successes X before encountering a predefined
number of failures r has a negative binomial distribution:

X  NB .r; p/

One can adapt this definition to describe a wide variety of count events. Citation
counts belong to a type of count events with an over-dispersion, i.e. the variance is
greater than the mean. NB models are commonly used in the literature to study this
type of count events. Two types of dispersion parameters are used in the literature,
™ and ’, where ™•’ D 1.
Zero-inflated count models are commonly used to account for excessive zero
counts (Hilbe 2011; Lambert 1992). Zero-inflated models include two sources of
zero citations: the point mass at zero If0g (y) and the count component with a count
distribution fcount (counts) such as negative binomial or Poisson (Zeileis et al. 2011).
The probability of observing a zero count is inflated with probability   D fzero (zero
citations).

fzeroinflated .citations/ D   If0g .citations/ C .1  /  fcount .citations/

ZINB models are increasingly used in the literature to model excessive occur-
rences of zero citations (Fleming and Bromiley 2000; Upham et al. 2010). The report
of a ZINB model consists of two parts: the count model and the zero-inflated model.
One way to test whether a ZINB model is superior to a corresponding NB model is
known as the Vuong test. The Vuong test is designed to test the null hypothesis that
the two models are indistinguishable. Akaike’s Information Criterion (AIC) is also
commonly used to evaluate the goodness of a model. Models with lower AIC scores
are regarded as better models.
270 8 Mapping Science

We illustrate the model using global citation counts of scientific publications


recorded in the Web of Science. NB models are defined as follows using log as the
link function.
Global citations  Coauthors C Modularity Change Rate C Cluster Linkage C
Centrality Divergence C References C Pages
Global citations is the dependent variable. Coauthors is a factor of three levels of
1, 2, and 3. Level 3 is assigned to articles with three or more coauthors. Coauthors is
an indirect indicator of the extent to which an article synthesizes ideas from different
areas of expertise represented by each coauthor.
Three structural variation metrics are included as co-variants in generalized
linear models, namely Modularity Change Rate (MCR), Cluster Linkage (CL), and
Centrality Divergence (CKL ). According to our theory of creativity, groundbreaking
ideas are expected to cause strong structural variations. If global citation counts
provide a reasonable proxy of recognitions of intellectual contributions in a
scientific community, we would expect that at least some of the structural variation
metrics will have statistically significant main effects on global citations.
The number of cited references and the number of pages are commonly reported
in the literature as good predictors of citations. In order to compare the effects of
structural variation with these commonly reported extrinsic properties of scientific
publications, References and Pages are included in the models. Our theory offers
a simpler explanation why the more references a paper cites, the more citations it
appears to get. Due to the boundary spanning synthetic mechanism, an article needs
to explain multiple parts and how they can be innovatively connected. This process
will result in citing more references than an article that covers a narrower range of
topics. Review papers by their nature belong to this category.
It is known that articles published earlier tend to have more citations than articles
published later. The exposure time of an article is included in the NB models in
terms of a logarithmically transformed year of publication of an article.
An intuitive way to interpret coefficients in NB models is to use incidence rate
ratios (IRRs) estimated by the models. For example, if Coauthors has an IRR of
1.5, it means that as the number of coauthors increases by one the global citation
counts would be expected to increase a factor of 1.5, i.e. increasing 1.5 times, while
holding other variables in the model constant. In our models, we will particularly
examine statistically significant IRRs of structural variation models.
Zero-inflated negative binomial models (ZINB) use the same set of variables.
The count model of ZINB is identical to the NB model described above. The zero-
inflated model of ZINB uses the same set of variables to predict the excessive zeros.
We found little in the literature about good predictors of zeros in a comparable
context. We choose to include all the six variables in the zero-inflated model to
provide a broader view of the zero-generating process. ZINBs are defined as follows:
Global citations  Coauthors C Modularity Change Rate C Cluster Linkage C
Centrality Divergence C References C Pages
Zero citations  Coauthors C Modularity Change Rate C Cluster Linkage C
Centrality Divergence C References C Pages
8.1 System Perturbation and Structural Variation 271

Fig. 8.3 The structure of the system before the publication of the ground breaking paper by Watts

Fig. 8.4 The structure of the system after the publication of Watts 1998

8.1.5 Complex Network Analysis (1996–2004)

Figures 8.3 and 8.4 illustrate how the system adapts to the publication of the
groundbreaking paper by Watts’98. The network was derived from 5,135 articles
published on small-world networks between 1990 and 2010. The network of 205
references and 1,164 co-citation links is divided into 12 clusters with a modularity
of 0.6537 and the mean silhouette of 0.811. The red lines are made by the
top-15 articles measured by the centrality variation rate. Only major clusters’
labels are shown in the figure. Dashed lines in red are novel connections made
272 8 Mapping Science

by (Watts and Strogatz 1998) at the time of its publication. The article has the
highest scores in Cluster Linkage and CKL scores, 5.43 and 1.14, respectively. The
figure offers a visual confirmation that the article was indeed making boundary-
spanning connections. Recall that the data set was constructed by expanding the
seed article based on forward citation links. These boundary-spanning links provide
empirical evidence that the groundbreaking paper was connecting two groups of
clusters. The emergence of Cluster #8 complex network was the consequence of the
impact.
Table 8.1 summarizes the results of five NB regression models with different
types of networks. They have an average dispersion parameter ™ of 0.5270, which
is equivalent to an alpha of 1.8975. Coauthors has an average IRR of 1.3278.
References has an average IRR of 1.0126. Pages has an average IRR of 0.9714.
The effects of the three variables are consistent and stable across the five types of
networks. In contrast, the effects of structural variations are less stable. On the other
hand, structural variations appear to have a stronger impact on global citations than
other more commonly studied measures such as Coauthors and References. For
example, CL has an IRR of 3.160 in networks of co-cited references and an IRR of
1.33  108 in networks of noun phrases. IRRs that are greater than 1.0 predict an
increase of global citations.
We have found statistical evidence of the boundary-spanning mechanism. An
article that introduces novel connections between clusters of co-cited references is
likely to become highly cited subsequently. In addition, we have found that the
IRRs of Cluster Linkage are more than twice as much as the IRRs of Coauthors
and References. This finding provides a more fundamental explanation of why the
number of references cited by an article appears to be a good predictor of its future
citations as found in many previous studies. As a result, the structural variation
paradigm clarifies why a number of extrinsic features appear to be associated with
high citations.
A distinct characteristic of the structural variation approach is the focus on the
potential connection between the degree of structural variation introduced by an
article and its future impact. The analytic and modeling procedure demonstrated
here is expected to serve as an exemplar for subsequent studies along this line of
research. More importantly, the focus on the underlying mechanisms of scientific
activity is expected to provide additional insights and practical guidance for
scientists, sociologists, historians, and philosophers of scientific knowledge.
There are many new challenges and opportunities ahead. For example, how
common is the boundary-spanning mechanism in scientific discoveries overall?
What are the other major mechanisms and how do they interact with the boundary-
spanning mechanism? There are other potentially valuable techniques that we have
not utilized in the present study, including topic modeling, citation context analysis,
survival analysis and burst detection. In short, a lot of work is to be done and this is
an encouraging start.
Figure 8.5 shows that the structural variation approach is applied to the study
of the potential of patents. The patent US6537746 is ranked high on the structural
variation scale. Its position is marked by a star. The areas where the patent made
Table 8.1 Negative binomial regression models (NBs) of Complex Network Analysis (1996–2004) at five different levels of granularity
of units of analysis
Data Source: Complex Network Analysis (1996–2004), top 100 records per time slice, 2-year sliding window
Unit of analysis Reference Keyword Noun phrase Author Journal
Relation Co-citation Co-occurrence Co-occurrence Co-citation Co-citation
Offset (exposure) log2 (Year) log2 (Year) log2 (Year) log2 (Year) log2 (Year)
Number of citing articles 3,515 3,072 3,254 3,271 3,271
Global citations Incidence Rate Ratios (IRRs) in NB models
Coauthors 1.306 0.000 1.298 0.000 1.326 0.000 1.359 0.000 1.350 0.000
Modularity change rate 1.083 0.025 1.038 0.086 1.047 0.305 1.055 0.276 1.060 0.180
8.1 System Perturbation and Structural Variation

Weighted cluster linkage 3.160 0.000 0.205 0.095 1.33  108 0.000 2.879 0.000 1.204 0.049
Centrality divergence 0.343 0.184 3.679 0.023 1.534 0.665 23.400 0.000 7.620 0.000
Number of references 1.013 0.000 1.013 0.000 1.013 0.000 1.012 0.000 1.012 0.000
Number of pages 0.970 0.000 0.971 0.000 0.971 0.000 0.973 0.000 0.972 0.000
Dispersion parameter (™) 0.5284 0.5258 0.5150 0.5282 0.5375
2  log-likelihood 31,771 28,331 29,491 29,506 29,613
Akaike’s Information Criterion (AIC) 31,787 28,347 29,508 29,522 29,629
References involves the least amount of ambiguity with the finest granularity, whereas the other four types of units introduce ambiguity
at various levels
Models constructed with units of higher ambiguity are slightly improved in terms of Akaike’s Information Criterion (AIC)
273
274 8 Mapping Science

Fig. 8.5 The structural variation method is applied to a set of patents related to cancer research.
The star marks the position of a patent (US6537746). The red lines show where the boundary-
spanning connections were made by the patent. Interestingly, the impacted clusters are about
recombination

boundary-spanning links are clusters #88 and #83, both labeled as recombination.
The map shows that multiple streams of innovation have moved away from the
course of older streams.
We conclude that structural variation is an essential aspect of the development of
scientific knowledge and it has the potential to reveal the underlying mechanisms
of the growth of scientific knowledge. The focus on the underlying mechanisms
of knowledge creation is the key to the predictive potential of the structural
variation approach. The theory-driven explanatory and computational approach sets
an extensible framework for detecting and tracking potentially creative ideas and
gaining insights into challenges and opportunities in light of the collective wisdom.

8.2 Regenerative Medicine

The Nobel Prize in Physiology or Medicine 2012 was announced on October 8,


2012. The award was shared by Sir John B. Gurdon and Shinya Yamanaka for
the discovery that mature cells can be reprogrammed to become pluripotent. The
potential of a cell to differentiate into different cell types is known as the potency
8.2 Regenerative Medicine 275

of the cell. Simply speaking, a differentiation process refers to how a cell is divided
into new cells. Cells in the next generation, in general, become more specialized
than their parent generation. Cells with the broadest range of potential can produce
all kinds of cells in an organism. This potential is called totipotency. The next
level of potency is called pluripotency, which means very many in its Latin origin
plurimus. A pluripotent cell can differentiate into more specialized cells. In contrast,
a unipotent cell can differentiate into only one cell type.
Prior to the work of Gurdon and Yamanaka, it was generally believed that the
path of cell differentiation is irreversible in that the potency of a cell becomes more
and more limited in generations of differentiated cells. Induced pluripotent stem
cells (iPS cells) result from a reprogramming of the natural differentiation. Starting
with a non-pluripotent cell, human intervention can reverse the process so that the
non-pluripotent cell could regain a more generic potency.
John B. Gurdon discovered in 1962 that the DNA of a mature cell may still have
all the information needed to develop all cells in a frog. He modified an egg cell of
a frog by replacing its immature nucleus with the nucleus from a mature intestinal
cell. The modified egg cell developed into a normal tadpole. His work demonstrated
that the specialization of cells is reversible. Shinya Yamanaka’s discovery was made
more than 40 years later. He found out how mature cells in mice could be artificially
reprogrammed to become induced pluripotent stem cells.

8.2.1 A Scientometric Review

On August 25, 2011, more than a year ago before the 2012 Nobel Prize was
announced, I received an email from Emma Pettengale. She is the Editor of a
peer-reviewed journal Expert Opinion on Biological Therapy (EOBT). The journal
provides expert reviews of recent research on emerging biotherapeutic drugs and
technologies. She asked if I would be interested in preparing a review of emerging
trends in regenerative medicine using CiteSpace and she would give me 3 months
to complete the review.
EOBT is a reputable journal with an impact factor of 3.505 according to the
Journal Citation Report (JCR) compiled by Thomson Reuters in 2011. Emma’s
invitation was an unusual one. The journal is a forum for experts to express their
opinions on emerging trends but I am not a specialist in regenerative medicine at
all. Although CiteSpace has been used in a variety of retrospective case studies,
including terrorism, mass extinctions, string theory, and complex network analysis,
we were able to find independent reviews of most of the case studies to cross validate
our results or contact domain experts to verify specific patterns. The invitation was
both challenging and stimulating. We would be able to analyze emerging trends in
a rapidly advancing field with CiteSpace. Most importantly, we wanted to find out
if we can limit our source of information exclusively to patterns that are obviously
identified by CiteSpace.
276 8 Mapping Science

Regenerative medicine is a rapidly growing and fast-moving interdisciplinary


field of study, involving stem cell research, tissue engineering, biomaterials, would
healing, and patient-specific drug discovery (Glotzbach et al. 2011; Polak 2010;
Polykandriotis et al. 2010). The potential of reprogramming patients’ own cells
for biological therapy, tissue repairing and regeneration is critical to regenerative
medicine. It has been widely expected that regenerative medicine will revolutionize
medicine and clinical practices far beyond what is currently possible. Mesenchymal
Stem Cells (MSCs), for example, may differentiate into bone cells, fat cells, and
cartilage cells. Skin cells can be reprogrammed into induced pluripotent stem cells
(iPSCs). The rapid advance of the research has also challenged many previous
assumptions and expectations. Although iPSCs resemble embryonic stem cells in
many ways, comparative studies have found potentially profound differences (Chin
et al. 2009; Feng et al. 2010; Stadtfeld et al. 2010).
The body of the relevant literature grows rapidly. The Web of Science has 4,295
records between 2000 and 2011 based on a topic search of the term “regenerative
medicine” in titles, abstracts, or indexing terms. If we include records that are
relevant to regenerative medicine, but do not use the term “regenerative medicine”
explicitly, the number could be as ten times higher. Stem cell research plays a
substantial role in regenerative medicine. There are over two million publications on
stem cells on Google Scholar. There are 167,353 publications specifically indexed
as related to stem cell research in the Web of Science. Keeping abreast the fast-
moving body of literature is critical not only because new discoveries emerge from
a diverse range of areas but also because new findings may fundamentally alter the
collective knowledge as a whole (Chen 2012).
In fact, a recent citation network analysis (Shibata et al. 2011) identified future
core articles on regenerative medicine based on their positions in a citation networks
derived from 17,824 articles published before the end of 2008. In this review, we
demonstrate a scientometric approach and use CiteSpace to delineate the structure
and dynamics of the regenerative medicine research. CiteSpace is specifically
designed to facilitate the detection of emerging trends and abrupt changes in
scientific literature. Our study is unique in several ways. First, our dataset contains
relevant articles published between 2000 and 2011. We expect that it will reveal
more recent trends emerged within the last 3 years. Second, we use a citation index-
based expansion to construct our dataset, which is more robust than defining a
rapidly growing field with a list of pre-defined keywords. Third, emerging trends
are identified based on indicators computed by CiteSpace without domain experts’
intervention or prior working knowledge of the topic. This approach makes the
analysis repeatable with new data and verifiable by different analysts.
CiteSpace is used to generate and analyze networks of co-cited references based
on bibliographic records retrieved from the Web of Science. An initial topic search
for “regenerative medicine” resulted in 4,295 records published between 2000 and
2011. After filtering out less representative record types such as proceedings papers
and notes, the dataset was reduced to 3,875 original research articles and review
articles.
8.2 Regenerative Medicine 277

Fig. 8.6 Major areas of regenerative medicine

The 3,875 records do not include relevant publications if the term “regenerative
medicine” does not explicitly appear in the titles, abstracts, or index terms. We
expanded the dataset by citation indexing. If an article cites at least one of the 3,875
records, then the article will be included in the expanded dataset based on the as-
sumption that citing a regenerative medicine article makes the citing article relevant
to the topic. The citation index-based expansion resulted in 35,963 records, con-
sisting of 28,252 (78.6 %) original articles and 7,711 (21.4 %) review articles. The
range of the expanded set remains to be 2000–2011. Thus the analysis focuses on
the development of regenerative medicine over the last decade. The 35,963-article
dataset is used in the subsequent analysis. Incorrect citation variants to the two
highly visible references, a 1998 landmark article by Thomson et al. (1998) and a
1999 article by Pittenger (Pittenger et al. 1999), were corrected prior to the analysis.

8.2.2 The Structure and Dynamics

Figure 8.6 shows a visualization of the literature relevant to regenerative medicine.


This visualization provides an overview of major milestones in history. The concen-
trations of colors indicate the chronological order of the development. For example,
cluster #12 mesenchymal stem cell was one of the earlier focuses of the research,
278 8 Mapping Science

Table 8.2 Major clusters of co-cited references


Cluster ID Size Silhouette Label (TFIDF) Label (LLR) Label (MI) Year Ave.
9 97 0.791 Evolving concept Mesenchymal stem Cardiac 1999
cell progenitor
cell
17 71 0.929 Somatic control Drosophila Drosophila 1994
spermatogenesis
6 67 0.980 Mcf-7 cell Intestinal-type Change 2001
gastric cancer
12 62 0.891 Midkine Human embryonic Dna 2002
stem cell
5 53 0.952 Grid2ip gene Silico Gastric cancer 2002
19 42 0.119 Bevacizumab Combination Cartilage 2004
7 40 0.960 Monogenic disease Induced pluripotent Clinic 2008
treatment stem cell
15 25 0.930 Tumorigenic Cancer stem cell Cancer 2003
melanoma cell prevention
Clusters are referred in terms of the labels selected by LLR

followed by #20 human embryonic stem cell, and then followed by the latest and
current #32 induced pluripotent stem cell. The patches of red rings in #32 indicate
this area is rapidly expanding as suggested by citation bursts.
Table 8.2 lists eight major clusters by their size, i.e. the number of members in
each cluster. Clusters with few members tend to be less representative than larger
clusters because small clusters are likely to be formed by the citing behavior of
a small number of publications. The quality of a cluster is also reflected in terms
of its silhouette score, which is an indicator of its homogeneity or consistency.
Silhouette values of homogenous clusters tend to close to 1. Most of the clusters
are highly homogeneous, except Cluster #19 with a low silhouette score of 0.119.
Each cluster is labeled by noun phrases from titles of citing articles of the cluster
(Chen et al. 2010).
The average year of publication of a cluster indicates its recentness. For example,
Cluster #9 on mesenchymal stem cell (MSCs) has an average year of 1999. The most
recently formed cluster, Cluster #7 on induced pluripotent stem cell (iPSCs), has an
average year of 2008.
Cluster #7 contains numerous nodes with red rings of citation bursts. The
visualized network also shows highly burst terms found in the titles and abstracts
of citing articles to the major clusters. For example, terms stem-cell-renewal and
germ-line-stem-cells are not only used when articles cite references in Cluster #17
drosophila spermatogenesis, but also used with a period of rapid increase. Similarly,
the term induced-pluripotent-stem-cells is a burst term associated with Cluster
#7, which is consistently labeled as induced pluripotent stem cell by a different
selection mechanism, the log-likelihood ratio test (LLR). We will particularly focus
on Cluster #7 in order to identify emerging trends in regenerative medicine.
Cluster #7 is the most recently formed cluster. We selected ten most cited
references in this cluster and 10 citing articles (See Table 8.3).
Table 8.3 Cited references and citing articles of Cluster #7 on iPSCs
Cluster #7 induced pluripotent stem cell
Cited references Citing articles
Cites Author (Year) Journal, Volume, Page Coverage % Author (Year) Title
1,841 Takahashi K (2006) Cell, v126, p663 95 Stadtfeld, Matthias (2010) induced pluripotency: history,
mechanisms, and applications
1,583 Takahashi K (2007) Cell, v131, p861 80 Kiskinis, Evangelos (2010) progress toward the clinical
8.2 Regenerative Medicine

application of patient-specific pluripotent stem cells


1,273 Yu JY (2007) Science, v318, p1917 77 Masip, Manuel (2010) reprogramming with defined factors:
from induced pluripotency to induced transdifferentiation
762 Okita K (2007) Nature, v448, p313 77 Sommer, Cesar A. (2010) experimental approaches for the
generation of induced pluripotent stem cells
640 Wernig M (2007) Nature, v448, p318 73 Lowry, William E. (2010) roadblocks en route to the clinical
application of induced pluripotent stem cells
615 Park IH (2008) Nature, v451, p141 73 Archacka, Karolina (2010) induced pluripotent stem cells –
hopes, fears and visions
501 Nakagawa M (2008) Nat Biotechnol, v26, p101 73 Yoshida, Yoshinori (2010) recent stem cell advances:
induced pluripotent stem cells for disease modeling and
stem cell-based regeneration
445 Okita K (2008) Science, v322, p949 73 Rashid, S. Tamir (2010) induced pluripotent stem cells –
alchemist’s tale or clinical reality? rid c-6368-2011
391 Maherali N (2007) Cell Stem Cell, v1, p55 68 Kun, Gabriel (2010) gene therapy, gene targeting and
induced pluripotent stem cells: applications in monogenic
disease treatment
348 Stadtfeld M (2008) Science, v322, p945 65 Robbins, Reiesha D. (2010) inducible pluripotent stem cells:
not quite ready for prime time?
279
280 8 Mapping Science

Table 8.4 Most cited references


Citation counts References Cluster #
2,486 Pittenger MF, 1999, Science, v284, p143 9
2,223 Thomson JA, 1998, Science, v282, p1145 12
2,102 Reya T, 2001, Nature, v414, p105 [Review] 15
1,841 Takahashi K, 2006, Cell, v126, p663 7
1,583 Takahashi K, 2007, Cell, v131, p861 7
1,273 Yu JY, 2007, Science, v318, p1917 7
1,145 Jain RK, 2005, Science, v307, p58 19
1,061 Jiang YH, 2002, Nature, v418, p41 9
1,030 Evans MJ, 1981, Nature, v292, p154 12
945 Al-Hajj M, 2003, P Natl Acad Sci USA, v100, p3983 15

The most cited article in this cluster, Takahashi 2006 (Takahashi and Yamanaka
2006), demonstrated how pluripotent stem cells can be directly generated from
mouse somatic cells by introducing only a few defined factors as opposed to
transferring nuclear contents to oocytes, or egg cells. Their work is a major
milestone. The second most cited reference (Takahashi et al. 2007), from the same
group of researchers, further advanced the state-of-the-art by demonstrating how
differentiated human somatic cells can be reprogrammed into pluripotent stem cells
using the same factors identified in their previous work. As it turns out, the work
represented by the two highly ranked papers was awarded the 2012 Nobel Prize in
Medicine.
Cluster #7 consists of 40 co-cited references. The 10 selected citing articles are
all published in 2010. They cited 65–95 % of these references. The one that has the
highest citation coverage of 95 % is an article by Stadtfeld et al. Unlike works that
aim to refine and improve the ways to produce iPSCs, their primary concern was
whether iPSCs are equivalent, molecularly and functionally, to blastocyst-derived
embryonic stem cells. The Stadtfeld article itself belongs to the cluster. Other citing
articles also seem to question some of the fundamental assumptions or call for more
research before further clinical development in regenerative medicine.
The most cited articles are usually regarded as the landmarks due to their
groundbreaking contributions (See Table 8.4). Cluster #7 has 3 articles in the top 10
landmark articles. Each of Clusters #9, #12, and #15 has two. The most cited article
in our dataset is Pittenger MF (1999) with 2,486 citations, followed by Thomson
JA (1998) with 2,223 citations. The third one is a review article by Reya T (2001).
Articles at the 4th–6th positions are all from Cluster #7, namely Takahashi K (2006),
Takahashi K (2007), and Yu JY (2007). These three are also the more recent articles
on the list, suggesting that they have inspired intense interest in induced pluripotent
stem cells.
A citation burst has two attributes: the intensity of the burst and how long the
burst status lasts. Table 8.5 lists references with the strongest citation bursts across
the entire dataset during the period of 2000–2011. The first four articles with strong
citation bursts are from Cluster #7 on iPSCs. Interestingly, one 2009 article (again
8.2 Regenerative Medicine 281

Table 8.5 References with the strongest citation bursts


Citation bursts References Cluster #
124.73 Takahashi K, 2006, Cell, v126, p663 7
121.36 Takahashi K, 2007, Cell, v131, p861 7
81.37 Yu JY, 2007, Science, v318, p1917 7
71.24 Okita K, 2008, Science, v322, p949 7
66.23 Meissner A, 2008, Nature, v454, p766 13
63.12 Vierbuchen T, 2010, Nature, v463, p1035 8
62.54 Zhou HY, 2009, Cell Stem Cell, v4, p381 7

Table 8.6 Structurally and temporally significant references


Sigma Burst Centrality Citations References Cluster #
377340.46 124.73 0.11 1,841 Takahashi K, 2006, Cell, v126, p663 7
29079.18 37.38 0.32 202 Bjornson CRR, 1999, Science, v283, p534 9
195.15 121.36 0.04 1,583 Takahashi K, 2007, Cell, v131, p861 7
58.91 81.37 0.05 1,273 Yu JY, 2007, Science, v318, P1917 7
15.97 19.53 0.15 130 Kiger AA, 2000, Nature, v407, p750 17

in Cluster #7) and one 2010 article (in Cluster #8, a small cluster) are detected to
have considerable degrees of citation burst. The leader of the group that authored
the top two references was awarded the 2012 Nobel Prize in Medicine.
The Sigma metric measures both structural centrality and citation burstness of a
cited reference. If a reference is strong in both measures, it will have a higher Sigma
value than a reference that is only strong in one of the two measures.
As shown in Table 8.6, the pioneering iPSCs article by Takahashi (2006) has
the highest Sigma of 377340.46, which means it is structurally essential and
inspirational in terms of its strong citation burst. The second highest work by this
measure is a 1999 article in Science by Bjornson et al. (1999). They reported an
experiment in which neural stem cells were found to have a wider differentiation
potential than previously thought because they evidently produced a variety of blood
cell types.

8.2.3 System-Level Indicators

The modularity of a network measures the degree to which nodes in the network
can be divided into a number of groups such that nodes within the same group are
connected tighter than nodes between different groups. The collective intellectual
structure of the knowledge of a scientific field can be represented as associated
networks of co-cited references. Such networks evolve over time. Newly published
articles may introduce profound structural variation or have little or no impact on
the structure.
282 8 Mapping Science

Fig. 8.7 The modularity of the network dropped considerably in 2007 and even more in 2009,
suggesting that some major structural changes took place in these 2 years in particular

Figure 8.7 shows the change of modularity of networks over time. Each network
is constructed based on a 2-year sliding window. The number of publications per
year increased considerably. It is noticeable that the modularity dipped in 2007 and
bounced back to the previous level before it dropped even deeper in 2009. Based
on this observation, it is plausible that groundbreaking works appeared in 2007 and
2009. We will therefore specifically investigate potential emerging trends in these
2 years.
Which publications in 2007 would explain the significant decrease of the
modularity of the network formed based on publications prior to 2007? If a 2007
publication has a subsequent citation burst, then we expect that this publication
played an important role in changing the overall intellectual structure. Eleven
publications in 2007 are found to have subsequent citation bursts (Table 8.7).
Notably, Takahashi 2007 and Yu 2007 top the list. Both of them represent pioneering
investigations of reprogramming human body cells to iPSCs. Both of them have
current citation bursts since 2009. Other articles on the list address the pluripotency
of stem cells related to human cancer, including colon cancer and pancreatic cancer.
Two review articles on regenerative medicine and tissue repair are published in 2007
with citation bursts since 2010. These observations suggest that the modularity
change in 2007 is an indication of an emerging trend in the human induced
pluripotent stem cells research. The trend is current and active as shown by the
number of citation bursts associated with publications in 2007 alone.
If the modularity change in 2007 indicates an emerging trend in human iPSCs
research, what caused the even more profound modularity change in 2009? The
Table 8.7 Articles published in 2007 with subsequent citation bursts in descending order of local citation counts
References Local citations Title Burst Duration Range (2000–2011)
Takahashi et al. (2007) 1,583 Induction of pluripotent stem cells from adult human 121:36 2009–2011
fibroblasts by defined factors
Yu et al. (2007) 1,273 Induced pluripotent stem cell lines derived from human 81:37 2009–2011
8.2 Regenerative Medicine

somatic cells
Wernig et al. (2007) 640 In vitro reprogramming of fibroblasts into a pluripotent 26:70 2008–2009
ES-cell-like state
O’Brien et al. (2007) 438 A human colon cancer cell capable of initiating tumour 18:13 2008–2009
growth in immunodeficient mice
Ricci-Vitiani et al. (2007) 427 Identification and expansion of human 8:83 2008–2009
colon-cancer-initiating cells
Li et al. (2007) 299 Identification of pancreatic cancer stem cells 9:78 2008–2008
Mikkelsen et al. (2007) 283 Genome-wide maps of chromatin state in pluripotent 19:59 2010–2011
and lineage-committed cells
Laflamme et al. (2007) 265 Cardiomyocytes derived from human embryonic stem 16:48 2010–2011
cells in pro-survival factors enhance function of
infarcted rat hearts
Gimble et al. (2007) [R] 247 Adipose-derived stem cells for regenerative medicine 25:19 2010–2011
Phinney and and Prockop 229 Concise review: mesenchymal stem/multipotent stromal 16:52 2010–2011
(2007) [R] cells: the state of transdifferentiation and modes of
tissue repair—current views
Khang et al. (2007) [In 90 Recent and future directions of stem cells for the 35:25 2008–2009
Korean] application of regenerative medicine
283
284 8 Mapping Science

cluster that is responsible for the 2009 modularity change is Cluster #7 induced
pluripotent stem cell (iPSC). On the one hand, the cluster contains Takahashi 2006
and Takahashi 2007, which pioneered the human iPSCs trend. On the other hand,
the cluster contains many recent publications. The average age of the articles in
this cluster is 2008. Therefore, we examine the members of this cluster closely,
especially focusing on 2009 publications.
The impact of Takahashi 2006 and Takahashi 2007 is so profound that their
citation rings would overshadow all other members in Cluster #7. After excluding
the display of their overshadowing citation rings, it becomes apparent that this
cluster is full of articles with citation bursts, which are shown as citation rings in
red. We labeled the ones published in 2009 and also two 2008 articles and one 2010
article (Fig. 8.2 and Table 8.8).
The pioneering reprogramming methods introduced by Takahashi 2006 and
Takahashi 2007 modify adult cells to obtain properties similar to embryonic stem
cells using a cancer-causing oncogene c-Myc as one of the defined factors and a
virus to deliver the genes into target cells (Nakagawa et al. 2008). It was shown
later on that c-Myc is not needed. The use of viruses as the delivery vehicle raised
safety concerns of its clinical implications in regenerative medicine because viral
integration into target cells’ genome might activate or inactivate critical host genes.
Searching for virus-free techniques motivated a series of such studies, leading by an
article (Okita et al. 2008) appeared on October 9, 2008.
What many of these 2009 articles have in common appear to be the focus on
improving previous techniques of reprogramming human somatic cells to regain a
pluripotent state. It was realized that the original method used to induce pluripotent
stem cells has a number of possible drawbacks associated with the use of viral
reprogramming factors. Several subsequent studies investigated alternative ways
to induce pluripotent stem cells with lower risks or improved certainty. These
articles were published within a short period of time. For instance, Woltjen 2009
demonstrated a virus-independent simplification of induced pluripotent stem cell
production. On March 26, 2009, Yu et al.’s article demonstrated that reprogramming
human somatic cells can be done without genomic integration or the continued
presence of exogenous reprogramming factors. On April 23, 2009, Zhou et al.’s
article demonstrated how to avoid using exogenous genetic modifications by
delivering recombinant cell-penetrating reprogramming proteins directly into target
cells. Soldner 2009 reported a method without using viral reprogramming factors.
Kaij reported a virus-free pluripotency induction method. On May 28, 2009, Kim
et al.’s article introduced a method of direct delivery of reprogramming proteins.
Vierbuchen 2010 is one of the few most recent articles that are found to have
citation bursts. The majority of the 2009 articles with citation bursts focused
on reprogramming human somatic cells to an undifferentiated state. In contrast,
Vierbuchen 2010 expanded the scope of reprogramming by demonstrating the
possibility of converting fibroblasts to functional neurons directly (Fig. 8.8).
Table 8.8 Articles published in 2009 with citation bursts
References Local Citations Title Burst Burst Duration Range (2000–2011)
Woltjen et al. (2009) 320 piggyBac transposition reprograms fibroblasts to 52:65 2009–2011
induced pluripotent stem cells
Yu et al. (2009) 300 Human induced pluripotent stem cells free of vector 59:97 2010–2011
and transgene sequences
Zhou et al. (2009) 293 Generation of induced pluripotent stem cells using 62:54 2010–2011
recombinant proteins
8.2 Regenerative Medicine

Soldner et al. (2009) 288 Parkinson’s disease patient-derived induced 53:94 2010–2011
pluripotent stem cells free of viral
reprogramming factors
Kaji et al. (2009) 284 Virus-free induction of pluripotency and subsequent 46:71 2009–2011
excision of reprogramming factors
Kim et al. (2009a, b) 235 Generation of human induced pluripotent stem cells 56:03 2010–2011
by direct delivery of reprogramming proteins
Ebert et al. (2009) 211 Induced pluripotent stem cells from a spinal 41:91 2010–2011
muscular atrophy patient
Kim et al. (2009b) 194 Oct4-induced pluripotency in adult neural stem cells 31:87 2009–2011
Vierbuchen et al. (2010) 193 Direct conversion of fibroblasts to functional 63:12 2010–2011
neurons by defined factors
Lister et al. (2009) 161 Human DNA methylomes at base resolution show 51:93 2010–2011
widespread epigenomic differences
Chin et al. (2009) 158 Induced pluripotent stem cells and embryonic stem 45:39 2010–2011
cells are distinguished by gene expression
signatures
Discher et al. (2009) 149 Growth factors, matrices, and forces combine and 43:14 2010–2011
control stem cells
Hong et al. (2009) 138 Suppression of induced pluripotent stem cell 43:71 2010–2011
generation by the p53–p21 pathway
Slaughter et al. (2009) 97 Hydrogels in regenerative medicine 31:68 2010–2011
285
286 8 Mapping Science

Fig. 8.8 Many members of Cluster #7 are found to have citation bursts, shown as citation rings in
red. Chin MH 2009 and Stadtfeld M 2010 at the bottom area of the cluster represent a theme that
differs from other themes of the cluster

8.2.4 Emerging Trends

Two articles of particular interest appear at the lower end of Cluster #7, Chin et al.
(2009) and Stadtfeld et al. (2010). Chin et al.’s article has 158 citations within
the dataset. A citation burst was detected for Chin 2009 since 2010. Chin et al.
questioned whether induced pluripotent stem cells (iPSCs) are indistinguishable
from embryonic stem cells (ESCs). Their investigation suggested that iPSCs should
be considered as a unique subtype of pluripotent cell.
The co-citation network analysis has identified several articles that cite the work
by Chin et al. In order to establish whether Chin et al. represents the beginning of
a new emerging trend, we inspect these citing articles listed in Table 8.9. Stadtfeld
2010 is the most cited citing article by itself with 134 citations. Similarly to Chin
et al., Stadtfeld 2010 addresses the question whether iPSCs are molecularly and
functionally equivalent to blastocyst-derived embryonic stem cells. Their work
identified the role of Dlk1-Dio3 gene cluster in association with the level of induced
pluripotency. In other words, these studies focus on mechanisms that govern induced
pluripotency, which can be seen as a distinct trend from the earlier trend on
improving reprogramming techniques. Table 8.9 includes two review articles cited
by Stadtfeld 2010.
8.2 Regenerative Medicine 287

Table 8.9 Articles that cite Chin et al.’s 2009 article (Chin et al. 2009) and their citation counts
as of November 2011
Article Citations Title
Stadtfeld et al. (2010) 134 Aberrant silencing of imprinted genes on chromosome
12qF1 in mouse induced pluripotent stem cells
Boland et al. (2009) 109 Adult mice generated from induced pluripotent stem cells
Feng et al. (2010) 72 Hemangioblastic derivatives from human induced
pluripotent stem cells exhibit limited expansion and
early senescence
Kiskinis and Eggan 59 Progress toward the clinical application of patient-specific
(2010) [R] pluripotent stem cells
Laurent et al. (2011) 48 Dynamic changes in the copy number of pluripotency and
cell proliferation genes in human ESCs and iPSCs
during reprogramming and time in culture
Bock et al. (2011) 31 Reference maps of human ES and iPS cell variation
enable high-throughput characterization of pluripotent
cell lines
Zhao et al. (2011) 22 Immunogenicity of induced pluripotent stem cells
Boulting et al. (2011) 17 A functionally characterized test set of human induced
pluripotent stem cells
Young (2011) [R]a 16 Control of the embryonic stem cell state
Ben-David and Benvenisty 11 The tumorigenicity of human embryonic and induced
(2011) [R]a pluripotent stem cells
[R] Review articles
a
Cited by Stadtfeld et al. (2010)

The new emerging trend is concerned with the equivalence of iPSCs and their
human embryonic stem cell counterparts in terms of their short- and long-term
functions. The new trend has critical implications on the therapeutic potential of
iPSCs. In addition to the works by Chin et al. and Stadtfeld et al., an article
published on August 2, 2009 by Boland et al. (2009) reported an investigation of
mice derived entirely from iPSCs. Another article (Feng et al. 2010) appeared on
February 12, 2010 investigated abnormalities such as limited expansion and early
senescence found in human iPSCs. The Stadtfeld 2010 article (Stadtfeld et al. 2010)
we discussed earlier appeared on May 13, 2010.
Some of the more recent citing articles of Chin et al. focused on providing
resources for more stringent evaluative and comparative studies of iPSCs. On
January 7, 2011, an article (Laurent et al. 2011) reported a study of genomic
stability and abnormalities in pluripotent stem cells and called for frequent genomic
monitoring to assure phenotypic stability and clinical safety. On February 4, 2011,
Bock et al. (2011) published genome-wide reference maps of DNA methylation
and gene expression for 20 previously derived human ES lines and 12 human iPS
cell lines. In a more recent article (Boulting et al. 2011) published on February 11,
2011, Boulting et al. established a robust resource that consists of 16 iPSC lines and
a stringent test of differentiation capacity.
iPSCs are characterized by their self-renewal and versatile ability to differentiate
into a wide variety of cell types. These properties are invaluable for regenerative
medicine. However, the same properties also make iPSCs tumorigenic or cancer
288 8 Mapping Science

Fig. 8.9 A network of the regenerative medicine literature shows 2,507 co-cited references cited
by top 500 publications per year between 2000 and 2011. The work associated with the two labelled
references was awarded the 2012 Nobel Prize in Medicine

prone. In a review article published in April 2011, Ben-David and Benvenisty (Ben-
David and Benvenisty 2011) reviewed the tumorigenicity of human embryonic
and iPSCs. Zhao et al. challenged a generally held assumption concerning the
immunogenicity of iPSCs in an article (Zhao et al. 2011) on May 13, 2011. The
immunogenicity of iPSCs has clinical implications on therapeutically valuable cells
derived from patient-specific iPSCs.
In summary, a series of more recent articles have re-examined several funda-
mental assumptions and properties of iPSCs with more profound considerations
for clinical and therapeutic implications on regenerative medicine (Patterson et al.
2012) (Fig. 8.9).

8.2.5 Lessons Learned

The analysis of the literature of regenerative medicine and a citation-based


expansion has outlined the evolutionary trajectory of the collective knowledge over
8.2 Regenerative Medicine 289

the last decade and highlighted the areas of active pursuit. Emerging trends and
patterns identified in the analysis are based on computational properties selected by
CiteSpace, which is designed to facilitate sense-making tasks of scientific frontiers
based on relevant domain literature.
Regenerative medicine is a fascinating and a fast-moving subject matter. As
information scientists, we have demonstrated a scientometric approach to tracking
the advance of the collective knowledge of a dynamic scientific community by
tapping into what experts in the domain have published in the literature and how
information and computational techniques can help us to discern patterns and trends
at various levels of abstraction, namely, cited references and clusters of co-cited
references.
Based on the analysis of structural and temporal patterns of citations and co-
citations, we have identified two major emerging trends. The first one started in
2007 with pioneering works on human induced pluripotent stem cells (iPSCs),
including subsequently refined and alternative techniques for reprogramming. The
second one started in 2009 with an increasingly broad range of examinations
and re-examinations of previously unchallenged assumptions with clinical and
therapeutic implications on regenerative medicine, including tumorigenicity and
immunogenicity of iPSCs. It is worth noting that this expert opinion is solely based
on scientometric patterns revealed by CiteSpace without prior working experience
in the regenerative medicine field.
The referential expansion of the original topic search of regenerative medicine
has revealed a much wider spectrum of intellectual dynamics. The visual analysis of
the broader domain outlines the major milestones throughout the extensive period
of 2000–2011. Several indicators and observations converge to the critical and
active role of Cluster #7 on iPSCs. By tracing interrelationships along citation links
and citation bursts, visual analytic techniques of scientometrics are able to guide
our attention to some of the most vibrating and rapidly advancing research fronts
and identify the strategic significance of various challenges addressed by highly
specialized technical articles. The number of review articles on relevant topics is
rapidly increasing, which is also a sign that the knowledge of regenerative medicine
has been advancing rapidly. We expect that visual analytic tools as we utilized in this
review will play a more active role in supplement to traditional review and survey
articles. Visual analytic tools can be valuable in finding critical developments in the
vast amount of newly published studies.
The key findings of the regenerative medicine and related research over the last
decade have shown that regenerative medicine has become more and more feasible
in many areas and that it will ultimately revolutionize clinical and healthcare
practice and many aspects of our society. On the other hand, the challenges
ahead are enormous. The biggest challenge is probably related to the fact that
human beings are a complex system in that a local perturbation may lead to
unpredictable consequences in other parts of the system, which in turn may affect
the entire system. The state of the art in science and medicine has a long way to
go to handle such complex systems in a holistic way. Suppressing or activating a
seemingly isolated factor may have unforeseen consequences.
290 8 Mapping Science

The two major trends identified in this review have distinct research agendas as
well as different perspectives and assumptions. In our opinion, the independencies
of such trends at a strategic level are desirable at initial stages of these emerging
trends so as to maximize the knowledge gain that is unlikely to be achieved by a
single line of research alone. In a long run, more trends are expected to emerge from
probably the least expected perspectives. Existing trends may be accommodated by
new levels of integration. We expect that safety and uncertainty will remain to be
the central concern of regenerative medicine.

8.3 Retraction

The reproducibility of the results in a scientific article is a major cornerstone


of science. If fellow scientists follow the procedure described in a scientific
publication, would they be able to reproduce the same results in the original
publication? If not, why not? The publication of a scientific article is subject to the
scrutiny of fellow scientists, the authors’ own institutions, and everyone who may
be concerned, including patients, physicians, and regulatory bodies of guidelines.
The retraction of a scientific article is a formal action that is taken to purge the
article from the scientific literature on the ground that the article in question is not
trustworthy and therefore disqualified to be part of the intellectual basis of scientific
knowledge. Retraction is a self-correction mechanism of the scientific community.
Scientific articles can be retracted for a variety of reasons, ranging from self-
plagiarism, editorial errors, to scientific misconduct, which may include fabrication
and falsification of data and results. The consequences of these diverse types
of mistakes differ. Some are easier to detect than others. For example, clinical
studies contaminated by fabrications of data or results may directly risk the safety
of patients, whereas publishing a set of valid results simultaneously in multiple
journals is not ethical but nonetheless less likely to harm patients directly. On the
one hand, some retracted articles may remain to be controversial even after their
retraction. For example, Lancet partially retracted a 1998 paper (Wakefield et al.
1998) that suggested a possible link between a combination of vaccines against
measles, mumps, and rubella and autism. The ultimate full retraction of the Lancet
article didn’t come until 2010. On the other hand, the influence of other retracted
articles may come to an end more abruptly after their retraction, for example, the
fabricated stem cell clone by Woo-Suk Hwang (Kakuk 2009).
The rate of retraction from the scientific literature appears to be increasing. For
example, retractions in MEDLINE were found to have increased sharply since 1980
and reasons for retraction included errors or non-reproducible findings (40 %),
research misconduct (28 %), redundant publication (17 %) and unstated/unclear
(5 %) (Wager and Williams 2011). We verified the increase of retraction in PubMed
on 3/29/2012. As shown in Fig. 8.10, the total number of annual publications
in PubMed increased from slightly more than 543,000 articles in 2001 to more
than 984,000 articles in 2011. The increase is remarkably steady, by about 45,000
8.3 Retraction 291

Fig. 8.10 The rate of retraction is increasing in PubMed (As of 3/29/2012)

new articles per year. The rate of retracted articles is calculated as the number of
eventually retracted articles published in a year divided out of the total number of
articles published in the same year in PubMed. The rate of retraction is the number
of retraction notices issued each year out of the total number of publications in
PubMed in the same year. The retraction rate in 2001 was 0.00005. It was doubled
three times since then, in 2003, 2006, and 2011, respectively. The retraction rate
in 2011 was 0.00046. Figure 8.10 shows that the number of retracted articles per
year peaked in 2006. The blue line is the retraction rate, which is growing fast. The
red line is the actual number of retracted articles. Although currently fewer recent
articles have been retracted than the 2006 peak number, we expect that this is in part
due to a delay in recognizing potential flaws in newly published articles. We will
quantify the extent of such delays later in a survival analysis.
On the one hand, the increasing awareness of mistakes in scientific studies
(Naik 2011), especially due to the publicity of high-profile cases of retraction and
fraudulent cases (Kakuk 2009; Service 2002) has led to a growing body of studies
of retractions. On the other hand, the study of retracted articles, the potential risk
that these articles may bring to the scientific literature in a long run, and actions
that could be taken to reduce such risks is relatively underrepresented, given the
urgency, possible consequences, and policy implications of the issue. We will
address some common questions concerning retracted articles. In particular, we
introduce a visual analytic framework and a set of tools that can be used to facilitate
situation awareness tasks at macroscopic and microscopic levels.
At the macroscopic level, we will focus on questions concerned with retracted
articles in a broader context of the rest of scientific literature. Given a retracted
article, which areas of the scientific literature are affected? Where are the articles
that directly cited the retracted article? Where are the articles that may have related
to the retracted articles indirectly?
292 8 Mapping Science

Table 8.10 The number of retractions found in major sources of scientific publications (As of
3/29/2012)
Sources Items Document type Search criteria
PubMeda 2,073 Retracted article “Retracted publication” [pt]
2,187 Retraction notice “Retraction of publication” [pt]
Web of Science 1,775 Retracted article Title contains “(Retracted article.)”
(1980–present)
1,734 Retraction notice Title contains “(Retraction of vol)”
Google scholar 219 Retracted article Allintitle: “retracted article”
Elsevier Content 659 Retracted article Title: Retracted article
Syndication (CONSYN) (full text)
a
http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&Cmd=DetailsSearch&Term=
%22retracted+publication%22%5Bpublication+type%5D

At the microscopic level, we will focus on questions concerned with post-


retraction citations to a retracted article. Are citations prior to retractions distin-
guishable from post-retraction citations, quantitatively and qualitatively?
PubMed is the largest publically available resource of the scientific literature
with the most extensive coverage of scientific publications in medicine and related
disciplines. Each PubMed record has an attribute called Publication Type [pt]. The
retraction of an article is officially announced in a retraction notice. The publication
type of the retraction notice is “Retraction of Publication.” The retracted article’s
publication type is updated to “Retracted Publication.” PubMed provides a list of
special queries, including one for “retracted publication.”1
The Web of Science, compiled by Thomson Reuters, has a field called Document
Type. The value of Document Type includes Article, Review, Correction, and a few
other types of value. The Document Type of Correction2 is used for retractions
as well as corrections of other types such as additions and errata. The title of the
retraction notice consists of the title of the retracted article and a phrase “(Retraction
of)” so that the title is self-sufficient for identifying the retracted article. The title of
a retracted article in the Web of Science is also modified with a phrase to mark the
fact that it is retracted. For example, the Wakefield paper is shown with a phrase
“(Retracted article. See vol 375, pg 445, 2010).”
In Google Scholar, retracted articles are identified with a prefix of
“RETRACTED ARTICLE” to their title. In advanced Scholar search, one can
limit the search to all the records with the phrase in the title.
Table 8.10 summarizes the number of retractions found in major sources of
scientific publications as of 3/29/2012. The search on PubMed contains all the years
available, whereas the search on the Web of Science is limited by the coverage of
our institutional subscription (1980–present).

1
http://www.ncbi.nlm.nih.gov/PubMed?term=retracted+publication+[pt]
2
Correction: Correction of errors found in articles that were previously published and which have
been made known after that article was published. Includes additions, errata, and retractions. http://
images.webofknowledge.com/WOKRS51B6/help/WOS/hs document type.html
8.3 Retraction 293

8.3.1 Studies of Retraction

A retraction sends a strong signal to the scientific community that retracted articles
are not trustworthy and they should be effectively purged from the literature. Studies
of retraction are often limited to formally retracted articles. It is a common belief
that many more articles should have been retracted (Steen 2011). On the other hand,
it has been noted that retraction should be made to scientific misconduct, whereas
correction is a more appropriate term for withdrawing articles with technical errors
(Sox and Rennle 2006). We outline some of the representative studies of retraction
as follows in terms of how they addressed several common questions.
Time to retraction – How long does it take on average for a scientific publication
to be retracted? Does the time to retraction differ between senior and junior
researchers?
Post-retraction citations – Does the retraction of an article influence how the article
is cited, quantitatively and qualitatively? How soon can one detect the decrease
of citations after retraction?
Cause of concern – How was an eventually retracted article noticed in the first place?
Are there any early signs that one can watch for and safeguard the integrity of
scientific publications?
Reasons for retractions – What are the most common reasons for retraction?
How are these common causes distributed? Should they be retreated equally or
differently as far as retraction is concerned?
Deliberate or accidental – Do scientists simply make mistakes with good faith or
some of them intended to cheat in terms of deliberate misconduct.
Table 8.11 outlines some of the most representative and commonly studied
aspects of retraction, including corresponding references of individual studies.
Several studies found that on average it took about 2 years to retract a scientific
publication and it took even longer for articles that were responsible by senior
researchers. Time to retraction of articles was particularly studied in a survival
analysis in (Trikalinos et al. 2008). Based on retractions made in top-cited high-
impact journals, it was found that the median survival time of eventually retracted
articles was 28 months. In addition, it was found that it took much longer to retract
articles authored by senior researchers, i.e. professors, lab directors, or researchers
with more than 5 years of publication records, than junior ones.
Post-retraction citations were studied at different time points after retraction,
ranging from the next calendar year, 1 year after retraction, to 3 years after
retractions. In general, citation counts tend to reduce after a retraction, but there
are outliers that are apparently unaware of a retraction after 23 years.
Irreproducibility and unusually high-level of productivity are among the most
common causes of initial concern. For example, Jan Hendrik Schön fabricated
17 papers in 2 years in Science and in Nature. He produced a new paper every
8 days at his peak (Steen 2011). Irreproducibility can be further explained in terms
of an array of specific types of reasons, including types of errors and deliberate
294 8 Mapping Science

Table 8.11 Major aspects of retraction


Attributes of retraction Findings and references
Time to retraction 28 months (mean) (Budd et al. 1998); Fraudulent – 28.41 months
(months) (mean), Erroneous – 22.72 months (mean) (Steen 2011); 28
months (median), Senior researchers implicated – 79 months,
junior researcher implicated – 22 months (Trikalinos et al.
2008); case study (Korpela 2010)
Post-retraction citations 1 year after retraction (Budd et al. 1998); 3 years after (Neale et al.
(lag time) 2007); next calendar year (Pfeifer and Snodgrass 1990)
Cause of concern Irreproducibility, unusually high-level of productivity (Budd et al.
1998; Steen 2011)
Reasons for retraction Scientific misconduct, irreproducibility, errors (Wager and
Williams 2011)
Types of errors Errors in method, data or sample; duplicated publication; text
plagiarism (Budd et al. 1998)
Types of misconduct Identified or presumed; fraud, fabrication, falsification, data
plagiarism (Budd et al. 1998; Neale et al. 2007; Steen 2011)
Deliberate or accidental A higher rate of repeat offenders found in fraudulent papers than
erroneous papers (Steen 2011)
Sources of the literature PubMed/MEDLINE (Budd et al. 1998; Neale et al. 2007; Steen
2011)

misconduct. It has been argued that, pragmatically speaking, fabricating data and
results is perceived to be much more harmful than plagiarizing a description or an
expression. For example, some researchers distinguish data plagiarism from text
plagiarism and retreat data plagiarism as a scientific misconduct (Steen 2011).
A sign that may differentiate a deliberate fraudulent behavior from a good faith
mistake is whether it happens repeatedly with the same researcher. A higher rate
of repeat offenders was indeed found in fraudulent papers than erroneous papers
(Steen 2011).
Studies of retraction almost exclusively focused on the literature of medicine,
where the stake is high in terms of the safety of patients. PubMed and the Web
of Science are the major resources used in these studies. Analysts in these studies
typically searched for retracted articles and analyzed the content of retraction
notices as well as other types of information. Most of these studies appear to rely
on labor-intensive procedures with limited or no support for visual analytic tasks.
Several potentially important questions have not been adequately addressed due to
such constraints.

8.3.1.1 k-Degree Post-retraction Citation Paths

An article may cite a retracted article without realizing the corresponding retraction.
This type of citing articles may infect the integrity of the scientific literature. Studies
of retraction so far essentially focused on first-degree citing articles, i.e. articles that
8.3 Retraction 295

directly cited a retracted article. Citation counts and whether it is evident that the
citers were aware of the status of retracted articles are the most commonly studied
topics.
Given a published article ato , retracted or not, a citation path between a
subsequently published article atk and the original article can be defined in terms
of pairwise citation relations as follows: ato at1  atk , where denotes
a direct citation reference, ti < tj if i < j, and the length of each segment of the path
is minimized. In other words, ati ati C1 means ati C1 has no direct citation to any
of the articles on the path prior to ati . The length of a citation path is the number
of direct citation links included in the path. Existing studies of citations to retracted
articles are essentially limited to citation paths that contain one step only. Longer
citation paths originated from a retracted article have not been studied. It is clear
that the retraction of the first article is equivalent to the removal of the first article
from a potentially still growing path such as ato at1  atk because newly
published articles may unknowingly cite the last article atk without questioning the
validity of the potentially risky path. By k-degree post-retraction citation analysis,
we introduce a study of such paths formed by k pairwise direct citation links as in
ato at1  atk:

8.3.1.2 Citation Networks Involving Retracted Articles

Over the recent years, tremendous advances have been made in scientometrics
(Boyack and Klavans 2010; Leydesdorff 2001; Shibata et al. 2007; Upham et al.
2010), science mapping (Chen 2006; Cobo et al. 2011; Small 1999; van Eck
and Waltman 2010), and visual analytics (Pirolli 2007; Thomas and Cook 2005).
Existing studies of citations to retracted articles have not yet incorporated these
relative new and more powerful techniques. Vice versa researchers who have access
to the new generation of analytic tools have not applied these tools to the analysis
of citation networks involving retracted articles.

8.3.1.3 Citation Context

It is important to find out how much a citing article’s authors know about the
current status of a retracted article when they refer to the retracted article. Previous
studies have shown that this is not always clear in text. A retracted article may have
been cited by hundreds of subsequently published articles. Manually examining
individual citation instances is time consuming and cognitively demanding. It is
an even more challenging task for analysts to synthesize emergent patterns from
individual citation instances and discern changes in terms of how a retracted article
has been cited over an extensive period of time because it is known that retracted
articles can be cited continuously for a long time after the retraction.
296 8 Mapping Science

Table 8.12 Survival analysis of time to retraction


Meana Median
95 % confidence interval 95 % confidence interval
Estimate Std. error Lower bound Upper bound Estimate Std. error Lower bound Upper bound
2.578 0.066 2.448 2.707 2.000 0.052 1.898 2.102
a
Estimation is limited to the largest survival time if it is censored

The provision of full text articles would make it possible to study the context
of citations to a retracted article with computational tools. It would also make it
possible to study higher-level patterns of citations and how they change over time
with reference to retraction events.
We address these three questions and demonstrate how visual analytic methods
and tools can be developed and applied to the study of citation networks and citation
contexts involving retracted articles. There are many other issues that are important
to study but we decide to focus on the ones that are relatively fundamental.

8.3.2 Time to Retraction

In the Web of Science, the title of a retracted article includes a suffix of “Retracted
article.” As of 3/30/2012, there are 1,775 records of retracted articles. The distribu-
tion of the 1,775 retracted articles since 1980 shows that retractions appear to have
peaked in 2007 with 254 retracted articles recorded in the Web of Science alone.
On the other hand, it might be still too soon to rule out the possibility of more
retrospective retractions.
It is relatively straightforward to calculate on average how long it may last
before the retraction of an article since its publication. It is common that the time
of retraction of an article is retrievable from the amended title of the article. For
example, if the title of an article published in 2010 is followed by a clause in the
form of (Retracted article. See vol. 194, pg. 447, 2011), then we know that the
article was retracted in 2011. We loaded the data into a built-in relational database
of CiteSpace and used the substring function in SQL to extract the year of retraction
from the title by counting backwards, i.e. substring (title, 5, 4). We found that the
mean time to retraction is 2.57 years, or 30 months, based on the retraction time
of the 1,721 retracted articles, excluding 54 records with no retraction date. The
median time to retraction is 2 years, i.e. 24 months (See Table 8.12).
Figure 8.11 shows a plot of the survival function of retraction. The probability
of surviving retraction reduces rapidly for the first few years since publication.
In other words, the majority of retractions took place within the first few years.
The probability of survival is below 0.2 for a 4-year old eventually to be retracted
article.
8.3 Retraction 297

Fig. 8.11 The survival function of retraction. The probability of surviving retraction for 4 years
or more is below 0.2

8.3.3 Retracted Articles in Context

Table 8.13 lists the ten most highly cited retracted articles in the Web of Science.
The 1998 Lancet paper by Wakefield et al. has the highest citations of 740. The
least cited of the ten has 366 citations. Three papers on the list were published in
Science and two in Lancet. In the rest of the article, we will primarily focus on these
high-profile retractions in terms of their citation contexts at both macroscopic and
microscopic levels.
We are interested in depicting the context of retracted articles in a co-citation
network of a broadly defined and relevant set of scientific publications. First, we
retrieved 29,756 articles that cited 1,584 retracted articles in the Web of Science.
We use CiteSpace to generate a co-citation network based on the collective citation
behavior of the 29,756 articles between 1998 and 2011. The top 50 % most cited
references were included to the formation of the co-citation network with an upper
limit of 3,000 references per year. The resultant network contains 7,217 references
and 155,391 co-citation links. A visualization of the co-citation network is generated
298

Table 8.13 The ten most highly cited retracted articles


Citations Lead author Publication—retraction Title (retraction notice) Journal
740 Wakefield AJ 1998–2010 Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and Lancet
pervasive developmental disorder in children (See vol 375,
pg 445, 2010)
727 Reyes M 2001–2009 Purification and ex vivo expansion of postnatal human marrow Blood
mesodermal progenitor cells (See vol. 113, pg. 2370, 2009)
659 Fukuhara A 2005–2007 Visfatin: A protein secreted by visceral fat that mimics the Science
effects of insulin (See vol 318, pg 565, 2007)
618 Nakao N 2003–2009 Combination treatment of angiotensin-II receptor blocker and Lancet
angiotensin-converting-enzyme inhibitor in non-diabetic
renal disease (COOPERATE): a randomised controlled trial
(See vol. 374, pg. 1226, 2009)
512 Chang G 2001–2006 Structure of MsbA from E-coli: A homolog of the multidrug Science
resistance ATP binding cassette (ABC) transporters (See
vol 314, pg 1875, 2006)
492 Kugler A 2000–2003 Regression of human metastatic renal cell carcinoma after Nature Medicine
vaccination with tumor cell-dendritic cell hybrids (See
vol. 9, p. 1221, 2003)
433 Rubio D 2005–2010 Spontaneous human adult stem cell transformation (See Cancer Research
vol. 70, pg. 6682, 2010)
391 Gowen LC 1998–2003 BRCA1 required for transcription-coupled repair of oxidative Science
DNA damage (See vol 300, pg 1657, June 13 2003)
375 Hwang WS 2004–2006 Evidence of a pluripotent human embryonic stem cell line Science
derived from a cloned blastocyst (See vol 311, pg 335,
2006)
366 Makarova TL 2001–2006 Magnetic carbon (See vol 440, pg 707, 2006) Nature
8 Mapping Science
8.3 Retraction 299

Fig. 8.12 An overview of co-citation contexts of retracted articles. Each dot is a reference of an
article. Red dots indicate retracted articles. The numbers in front of labels indicate their citation
ranking. Potentially damaging retracted articles are in the middle of an area that otherwise free
from red dots

and overlaid with the top-ten most cited retracted articles as well as other highly
cited articles without retractions (See Fig. 8.12). Each dot in the visualization
represents an article cited by the set of 29,756 citing articles. The dots in red
are retracted articles. Lines between dots are co-citation links. The color of a co-
citation link is the earliest time a co-citation between two articles was made. The
earliest time is in blue; more recent time is in yellow and orange. The size of a dot,
or a disc, is proportional to the citation counts of the corresponding cited article.
The top ten most cited retracted articles are labeled in the visualization. Retracted
articles are potentially more damaging if they are located in the middle of a densely
co-cited articles. In contrast, isolated red dots are relatively less damaging. This
type of visualizations will be valuable to highlight how deeply a retracted article is
embedded in the scientific literature.
Figure 8.13 shows a close-up view of the visualization shown in Fig. 8.12. The
retracted article by Nakao N et al. on the left, for example, has a sizable red disc,
indicating its numerous citations. Its position on a densly connected island of other
articles indicates its relevant to a significant topic. Hwang WS (slightly to the right)
and Potti A at the lower right corner of the image have similar citation context
profiles. More profound impacts are likely to be found in interconnected citation
contexts of multiple retracted articles.
Figure 8.14 shows an extensive representation of the citation context of the
retracted 2003 article by Nakao et al. First, 609 articles that cited the Nakao paper
were identified in the Web of Science. Next, 9,656 articles were retrieved because
300 8 Mapping Science

Fig. 8.13 Red dots are retracted articles. Labeled ones are highly cited. Clusters are formed by
co-citation strengths

Fig. 8.14 An extensive citation context of a retracted 2003 article by Nakao et al. The co-citation
network contains 27,905 cited articles between 2003 and 2011. The black dot in the middle of
the dense network represents the Nakao paper. Red dots represent 340 articles that directly cited
the Nakao paper (there are 609 such articles in the Web of Science). Cyan dots represent 2,130 of
the 9,656 articles that bibliographically coupled with the direct citers

they have at least one common references with the 609 direct citing articles. Top
6,000 most cited references per year between 2003 and 2011 were chosen to form
a co-citation network of 27,905 references and 2,162,018 co-citation links. The
retracted Nakao paper is shown as the black dot in the middle of the map. The red
8.3 Retraction 301

dots are 340 direct citers of the total of 609 available in the Web of Science. The cyan
dots share common references with the direct citers, not necessarily the retracted
article. The labels are the most cited articles in this topic area, which are not retracted
articles themselves.

8.3.4 Autism and Vaccine

The most cited retracted article among all the retracted articles in the Web of Science
is the 1998 Lancet article by Wakefield et al. A citation burst of 0.05 was detected
for this article. The article was partially retracted in 2004 and fully retracted in 2010.
The Lancet’s retraction notice in February 2010 noted that several elements of the
1998 paper are incorrect, contrary to the findings of an earlier investigation, and that
the paper made false claims of an “approval” of the local ethics committee.
In order to find out what exactly was said when researchers cited the controversial
article, we studied citation sentences, which are the sentences that contain references
to the Wakefield paper. A set of full text articles were obtained from Elsevier’s
Content Syndication (ConSyn), which contains 3,359 titles of scholarly journals
and 6,643 non-serial titles. Since the Wakefield paper is concerned with a claimed
causal relation between a combined MMR vaccine and autism, we searched for full
text journal articles on autism and vaccine in ConSyn and found 1,250 relevant
articles. The Wakefield paper was cited by 156 full text articles in the 1,250 articles
from the ConSyn collection. A total of 706 citation sentences are found in the 156
citing articles. We used the Lingo clustering method provided by Carrot2, an open
source framework for building search clustering engines,3 to cluster these citation
sentences into 69 clusters.
Figure 8.15 is a visualization of the 69 clusters formed by 706 sentences that
cited the 1998 Lancet paper. The visualization is called Foam Tree in Carrot. See
Chap. 9 for more details on Carrot. Clusters with the largest areas represent the
most prominent clusters of phrases used when researchers cited the 1998 paper. For
example, inflammatory bowel disease, mumps and rubella, and association between
MMR vaccine and autism are the central topics of the citations. These topics indeed
characterize the role of the retracted Lancet paper, although in this study we did
not differentiate positive and negative citations. Identifying the orientation of an
instance of citation from a citation context, for example, the citing sentence and
its surrounding sentences, is a very challenging task even for an intelligent reader
because the position of the argument becomes clear only when a broader context is
taken into account, for example, after reading the entire paragraph in many cases.
In addition to aggregate citation sentences into clusters at a higher level of
abstraction, we further developed a timeline visualization that can be used to depict
year-by-year flows of topics to facilitate analytics to discern changes associated with

3
http://project.carrot2.org/
302 8 Mapping Science

Fig. 8.15 69 clusters formed by 706 sentences that cited the 1998 Wakefield paper

Fig. 8.16 Divergent topics in a topic-transition visualization of the 1998 Wakefield et al. article

citations to the retracted article. The topic-flow visualization was constructed as


follows. First, we group the citation sentences into groups defined by their publica-
tion time. Citation sentences made in each year are clustered into topics. Similarities
between topics in adjacent years are computed in terms of the overlapping topic
terms between them. Topic flows connect topics in adjacent years that meet a user
defined similarity threshold (See Fig. 8.16).
Each topic in the flow map can be characterized as convergent and divergent
topics as well as steady topics. A convergent topic in a particular year is defined
8.3 Retraction 303

Table 8.14 Specific sentences that cite the eventually retracted 1998 Lancet paper by Wakefield
et al.
Year of citation Ref Sentence
1998 1 The report by Andrew Wakefield and colleagues confirms the clinical
observations of several paediatricians, including myself, who have
noted an association between the onset of the autistic spectrum and
the development of disturbed bowel habit
1998 1 Looking at the ages of the children in Wakefield’s study, it seems that
most of them would have been at an age when they could well have
been vaccinated with the vaccine that has since been withdrawn
1998 1 We are concerned about the potential loss of confidence in the mumps,
measles, and rubella (MMR) vaccine after publication of Andrew
Wakefield and colleagues’ report (Feb 28, p 637), in which these
workers postulate adverse effects of measles-containing vaccines
1998 1 We were surprised and concerned that the Lancet published the paper
by Andrew Wakefield and colleagues in which they alluded to an
association between MMR vaccine and a nonspecific syndrome,
yet provided no sound scientific evidence
2001 34 In 1998, Wakefield et al. [34] have published a second paper including
two ideas: that autism may be linked to a form of inflammatory
bowel disease and that this new syndrome is associated with
measles–mumps–rubella (MMR) immunization
2007 5 Vaccine scares in recent years have linked MMR vaccination with
autism and a variety of bowel conditions, and this has had an
adverse impact on MMR uptake [5]
2007 5 When comparing MMR uptake rates before (1994–1997) and after
(1999–2000) the 1998 Wakefield et al. article [5] it is seen that
prior to 1998 Asian children had the highest uptake
2010 2 This addresses a concern raised by a now-retracted article by
Wakefield et al. and adds to the body of evidence that has failed to
show a relationship between measles vaccination and autism (1, 2)

in terms of the number of related topics in the previous year. The convergent topic
sums up elements from multiple previously separated topics. In 1999, the topic of
Rubella MMR Vaccination is highlighted by an explicit label because it is associated
with several distinct topics in 1998. In 2004, the year Lancet partially retracted
the Wakefield paper, the prominent convergent topic was Developmental Disorders.
The visualization shows that numerous distinct topics in 2003 were converged into
the convergent topic in 2004. We expect that this type of topic-flow visualizations
can enable new ways of analyzing and studying the dynamics of topic transitions in
specific citations to a particular article.
Table 8.14 lists examples of sentences that cited the 1998 Lancet paper by
Wakefield et al. For example, as early as 1998, researchers were concerned about the
lack of sound scientific evidence to support the claimed association between MMR
vaccine and inflammatory bowel disease. The adverse impact on MMR uptake is
also evident in these citation sentences. Many more analytic tasks may become
feasible with this type of text and pattern-driven analyses at multiple levels of
granularity.
304 8 Mapping Science

Using visualization and science mapping techniques we have demonstrated


that many high-profile retracted articles belong to vibrant lines of research. Such
complex attachments make it even more challenging to restore the validity of
the scientific literature in a timeliness manner. We introduced a set of novel
and intuitive tools to facilitate the analysis and exploration of the influence of
a retracted article in terms of how they are specifically cited in the scientific
literature. We have demonstrated that topic-transition visualizations derived from
citation sentences can bridge the cognitive and conceptual gap between macroscopic
patterns and microscopic individual instances. The topic flow of citation sentences is
characterized in terms of convergent and divergent topics, which serve as conceptual
touchstones for analysts to discern the dynamics of topic transitions associated with
the perceived role of a retracted article.

8.3.5 Summary

The perceived risk introduced by retracted articles alone is the tip of an iceberg.
Many high-profile retracted articles are interwoven deeply with the scientific
literature and in many cases they are embedded in fast-moving significant lines of
research. It is essential to raise the awareness that much of the potential damages
introduced by a retracted article are hidden and likely to grow quietly for a long time
after the retraction via indirect citations. The original awareness of the invalidity of
a retracted article may be lost in subsequent citations. New tools and services are
needed so that researchers and analysts can easily verify the status of a citation
genealogy to ensure that the current status of the origin of the genealogy is clearly
understood. Such tools should become part of the workflow of journal editors and
publishers.
From a visual analytic point of view, it is essential to bring in more techniques
and tools that can support analytic and sense making tasks from the dynamic and
unstructured information and allow analysts and researchers to move back and forth
freely across multiple levels of analytic and decision making tasks. The ability of
trailblazing evidence and arguments through an evolving space of knowledge is a
critical step for the creation of scientific knowledge and maintaining a trustworthy
documentation of the collective intelligence.

8.4 Global Science Maps and Overlays

Science mapping has made remarkable advances in the past decade. Powerful
techniques have become increasingly accessible to researchers and analysts. In this
chapter, we present some of the most representative efforts towards generating maps
of science. At the highest level, the goal is to identify how scientific disciplines
are interrelated, for example, how medicine and physics are connected, what topics
8.4 Global Science Maps and Overlays 305

are shared by chemistry and geology, how federal funding is distributed across the
landscape of disciplines. Drawing a boundary line for a disciplinary is challenging;
drawing a boundary line for a constantly evolving disciplinary is even more so. We
will highlight some recent examples of how researchers deal with such challenges.

8.4.1 Mapping Scientific Disciplines

Derek de Solla Price is probably the first person to anticipate that the Science
Citation Index (SCI) may contain the information for revealing the structure of
science. Price suggested that the appropriate units of analysis would be journals and
aggregations of journals by journal-journal citations would reveal the disciplinary
structure of science. An estimation mentioned in (Leydesdorff and Rafols 2009)
sheds light on the density of a science map at the journal level. Among the 6,164
unique journals in the 2006 SCI, there were only 1,201,562 pairs of journal citation
relations out of the possible 37,994,896 connections. In other words, the density
of the global science structure is 3.16 %.4 How stable is such a structure at the
level of journal? How volatile is the structure of science at the document level or a
topic level? Where are the activities concentrated or distributed with reference to a
discipline, an institution, or an individual?
A widely seen global map of science is the USCD map, depicting 554 clusters
of journals and how they are interconnected as sub-disciplines of science (See
Fig. 8.17). The history of the UCSD map is described in (Borner et al. 2012).
The map was first created by Richard Klavans and Kevin Boyack in 2007 for the
University of California San Diego (UCSD). The source data for the map was
a combination of Thomson Reuters Web of Science (2001–2004) and Elsevier’s
Scopus (2001–2005). Similarities between journals were computed in 18 different
ways to form matrices of journal-journal connections. These matrices were then
combined to form a single network of 554 sub-disciplines in terms of clusters of
journals. The layout of the map was generated using the 3D Fruchterman-Reingold
layout function in Pajek. The spherical map was then unfolded to a 2D map on a
flat surface with a Mercator projection. Each cluster was manually labeled based on
journal titles in the cluster. The 2D version of the map was further simplified to a
1D circular map – the circle map. The 13 labeled regions were ordered using factor
analysis. The circle map is used in Elsevier’s SciVal Spotlight.
The goal of the UCSD map was to provide a base map for research evaluation.
With 554 clusters, it provides more categories than the subject categories of the Web
of Science. While the original goal was for research evaluation, the map is being
used as a base map to superimpose overlays of additional information in systems
such as Sci2 and VIVO.5 Soon after the creation of the UCSD map, Richard Klavans

4
Assume this is a directed graph of 6,146 journals.
5
http://ivl.cns.iu.edu/km/pres/2012-borner-portfolio-analysis-nih.pdf
306 8 Mapping Science

Fig. 8.17 The UCSD map of science. Each node in the map is a cluster of journals. The clustering
was based on a combination of bibliographic couplings between journals and between keywords.
Thirteen regions are manually labeled (Reproduced with permission)

and Kevin Boyack came to the conclusion that research evaluation requires maps
with clusters at the article level rather than at the journal level.
The UCSD map was generated for UCSD to show their research strengths and
competencies. Although the discipline-level map characterizes the global structure
of scientific literature, much more details are necessary to quantify research
strengths at UCSD. The similar procedure was applied to generate an article-
level map as opposed to a journal-level map. Clusters of articles were calculated
based on co-citations. In addition to the discipline-level circle map, the paper-level
clustering provides much more detailed classification information. In contrast to the
554 journal clusters, the paper-level clustering of co-cited references identified over
84,000 clusters, which are called paradigms (Fig. 8.18).
In a 2009 Scientometrics paper (Boyack 2009), Boyack described how a
disciplinary-level map can be used for collaboration. He collected 1.35 million
papers from 7,506 journals and 1,206 conference proceedings. These papers contain
29.23 million references. Similarities between references were calculated in terms
of bibliographic coupling. These reference-level similarities were then aggregated to
obtain similarities between journals. For each journal, the top 15 most similar jour-
nals in terms of bibliographic coupling were retained for generating the final map.
The map layout step served two purposes: one is to optimize the arrangement
of the journals so that the distance between journals on the map is proportional to
8.4 Global Science Maps and Overlays 307

Fig. 8.18 Areas of research leadership for China. Left: A discipline-level circle map. Right:
A paper-level circle map embedded in a discipline circle map. Areas of research leadership are
located at the average position of corresponding disciplines or paradigms. The intensity of the
nodes indicates the number of leadership types found, Relative Publication Share (RPS), Relative
Reference Share (RRS), or state-of-the art (SOA) (Reprinted from Klavans and Boyack 2010 with
permission)

their dissimilarity; the other is to group individual journals into clusters based on
the distance generated by the layout process.
The map layout was made using the VxOrd algorithm, which ignores long-range
links in its layout process. The proximity of nodes in the resultant graph layout
was used to identify clusters using a modified single-linkage clustering algorithm.
In single linkage, the distance between two clusters is computed as the distance
between the two closest elements in the two clusters. The resultant map contains
812 clusters of journals and conference proceedings (See Fig. 8.19). The map was
used as a base map for a variety of overlays. In particular, the presence of an
institution can be depicted with this map. A cluster with a clear circle contains
journal papers only. In contrast, a cluster with a shaded circle contains proceeding
papers. As shown in the map, the majority of proceeding papers are located between
computer science (CS) and Physics. Disciplines such as Virology are almost entirely
dominated by journal papers.
More recently, Klavans and Boyack created a new global map of science based on
Scopus 2010. The new Scopus 2010 map is a paper-level map, representing 116,000
clusters of 1.7 million papers (See Fig. 8.20). The Scopus 2010 map is hybrid in
that clusters were generated from citations and the layout was done based on text
similarity. The similarities between clusters were calculated based on words from
titles and abstracts of papers in each cluster using the Okapi BM25 text similarity.
The clustering step did not use a hybrid similarity based on both text and citation
simultaneously. For each cluster, 5–15 clusters with the strongest connections were
retained. Labels of clusters were manually added.
308 8 Mapping Science

Fig. 8.19 A discipline-level map of 812 clusters of journals and proceedings. Each node is a
cluster. The size of a node represents the number of papers in the cluster (Reprinted from Boyack
2009 with permission)

Just as what we have described earlier in the book about a geographic base map
and thematic overlays, global maps of scientific disciplines provide a convenient
base map to depict additional thematic features. Figure 8.21 shows an example of
adding a thematic overlay to the Scopus 2010 base map. The overlay superimposes
a layer of orange dots on clusters in the Scopus 2010 map. The orange dots mark the
papers that acknowledged the support of grants from the National Cancer Institute
(NCI). The overlay provides an intuitive overview of the scope of NCI grants in the
context of research areas.

8.4.2 Interdisciplinarity and Interactive Overlays

In parallel to the efforts we introduced earlier, researchers have been develop-


ing another promising approach to generate global science maps and use them
to facilitate the analysis of issues concerning interrelated disciplines and the
interdisciplinarity of a research program.
8.4 Global Science Maps and Overlays 309

Fig. 8.20 The Scopus 2010 global map of 116,000 clusters of 1.7 million articles (Courtesy of
Richard Klavans and Kevin Boyack, reproduced with permission)

Ismael Rafols, a researcher of Science and Technology Policy Research (SPRU)


at the University of Sussex in England, Alan Porter, a professor at the Technology
Policy and Assessment Center of Georgia Institute of Technology in the U.S.A,
and Loet Leydesdorff, a professor in the Amsterdam School of Communication
Research (ASCoR) at the University of Amsterdam, The Netherlands, have been
studying interdisciplinary research, especially topics that have profound societal
challenges such as climate change and the diabetes pandemic. Addressing such
societal challenges requires communications and incorporations of different bodies
of knowledge, both from disparate parts of academia and from social stakeholders.
Interdisciplinary research involves a great deal of cognitive diversity. How can
we measure and convey such cognitive diversity to researchers and evaluators
in individual disciplines? Rafols, Porter, and Leydesdorff developed what they
called science overlay mapping method to study a number of issues concerning
interdisciplinary research (Rafols et al. 2010).
Figure 8.22 shows a global science overlay base map. Each node represents
a Web of Science Category. Loet Leydesdorff provides a set of tools that one
can use to generate an overlay on the base map. One of the earlier papers on
310 8 Mapping Science

Fig. 8.21 An overlay on the Scopus 2010 map shows papers that acknowledge NCI grants
(Courtesy of Kevin Boyack, reproduced with permission)

science overlay maps, a paper published in February 2009 (Leydesdorff and Rafols
2009), was featured as a fast breaking paper by Thomson Rueters’ ScienceWatch
in December 2009.6 Fast breaking papers are publications that have the largest
percentage increase in citations in their field from one bimonthly update to the next.
The overlay method has two steps: (1) creating a global map of science as the
base map and (2) superimposing a specific set of publications, for example, from a
given institution or topic. Along with the method, the researchers have made a set
of tools available so that everyone could use their tools and generate his or her own
science overlay maps. The toolkit is freely available.7

6
http://archive.sciencewatch.com/dr/fbp/2009/09decfbp/09decfbpLeydET/
7
http://www.leydesdorff.net/overlaytoolkit
8.4 Global Science Maps and Overlays 311

Agri Sci Ecol Sci


Geosciences

Infectious
Diseases
Environ Sci & Tech
Clinical Med

Mech Eng
Chemistry

Materials Sci
Biomed Sci

Psychological Sci.
Physics

Health & Social Issues

Clinical Computer Sci


Psychology

Math Methods

Social Studies Business & MGT

Econ Polit & Geography

Fig. 8.22 A global science overlay base map. Nodes represent Web of Science Categories. Grey
links represent degree of cognitive similarity (Reprinted from Rafols et al. 2010 with permission)

A collection of interactive science overlay maps are maintained on a web site.8


These interactive maps allow us to explore how disciplines are related and how
individual publications from an organization are distributed across the landscape.
Figure 8.23 is a screenshot of one of the interactive maps. The mouse-over feature
highlights GSK’s publications associated with the discipline of clinical medicine in
circled red dots.
Initially, the science overlay map was based only on the Science Citation Index
(SCI). The Social Science Citation Index (SSCI) was incorporated in later versions.
In spite of well-known inaccuracies in the assignation of articles to the Web of
Science Categories, Rafols and Leydesdorff have shown in a series of publications
that the overall structure is quite robust to changes in classifications, to degree of
aggregation using journals rather than subject categories, and over the time period
so far studied (2006–2010).
In the overlay step, an overlay map superimposes the areas of activity of a
given source of publications, for example, an organization or team, as seen from
its publication and referencing practices, on top of the global science base map.
One can use any document set downloaded from the Web of Science and use it as

8
http://idr.gatech.edu/maps.php
312 8 Mapping Science

Fig. 8.23 An interactive science overlay map of Glaxo-SmithKline’s publications between 2000
and 2009. The red circles are GSK’s publications in clinical medicine (as moving mouse-over the
Clinical Medicine label) (Reprinted from Rafols et al. 2010 with permission, available at http://idr.
gatech.edu/usermapsdetail.php?id=61)

an overlay. The strength of this overlay approach is that one can easily identify the
activity of an institution with references spreading over multiple disciplinary regions
as well as an institution with a much focused discipline.
The flexibility of the science overlay maps has been demonstrated in studies
of interdisciplinarity of fields over time (Porter and Rafols 2009), comparing
departments, universities and R&D bases of large corporations (Rafols et al. 2010),
and tracing the diffusion of research topics over science (Leydesdorff and Rafols
2011). Figure 8.24 shows a more recent base map generated by Loet Leydesdorff in
VOSViewer.

8.4.3 Dual-Map Overlays

Many citation maps are designed to show either the sources or the targets of citations
in a single display but not both. The primary reason is that a representation with
a mixed of citing and cited articles may considerably increase the complexity of
its structure and dynamics. There doesn’t seem to be a clear gain if we combine
them together in a single view. Although it is conceivable that a combined structure
may be desirable in situations such as a heated debate, researchers are in general
more concerned with differentiating various arguments before considering how to
combine them.
8.4 Global Science Maps and Overlays 313

Fig. 8.24 A similarity map of JCR journals shown in VOSViewer

The Butterfly designed by Jock Mackinlay and his colleagues at Xerox shows
both ends in the same view, but the focus is at the individual paper level rather than
at a macroscopic level of thousands of journals (Mackinlay et al. 1995). Eugene
Garfield’s HistCite depicts direct citations in the literature. However, as the number
of citations increase, the network tends to become cluttered, which is a common
problem to network representations.
We introduce a dual-map overlay design that depicts both the citing overlay
and the cited overlay maps in the same view. The dual-map overlay has several
advantages over a single overlay map. First, it represents a citation instance
completely. One can see where it is originated and where it points to at a glance.
Second, it makes it easy to compare patterns of citations made by distinct groups of
authors, for example, authors from different organizations, or authors from the same
organization at different points of time. Third, it opens up more research questions
that can be addressed in new ways of analysis. For example, it becomes possible
to study the interdisciplinarity at both source and target sides. It becomes possible
to track the movements of scientific frontiers in terms of their footprints in both
base maps.
The construction of a dual-map base shares the initial steps but differs in later
steps. Once the coordinates are available for both citing and cited matrices of
journals, a dual-map overlay can be constructed. It is not necessary to have cluster
information, but additional functions are possible if cluster information is available.
In the rest of the description, we assume that at least one set of clusters are available
314 8 Mapping Science

Fig. 8.25 The Blondel clusters in the citing journal map (left) and the cited journal map (right).
The overlapping polygons suggest that the spatial layout and the membership of clusters still
contain a considerable amount of uncertainty. Metrics calculated based on the coordinates need
to take the uncertainty into account

for each matrix. In this example, clusters are obtained by applying the Blondel
clustering algorithm. Figure 8.25 is a screenshot of the dual-map display, containing
a base map of citing journals (left) and a base map of cited journals (right).
For each journal in the citing network, its cluster membership is stored with
the journal along with its coordinates. The coordinates may be obtained from a
network visualization program such as VOSViewer, Gephi, or Pajek. Members of
each cluster are painted in the map with the same color.
A number of overlays can be added to the dual-map base. Each overlay requires
a set of bibliographic records that contain citation information, i.e. like the records
retrieved from the Web of Science. The smallest set may contain a single article.
There is no limit to the size of the largest set. With journal overlay maps, each
citation instance is represented by an arc from its source journal in the citing base
map to its target journal on the cited base map. Arcs from the same set are displayed
in the same color chosen by the user so that citation patterns from distinct sets can
be distinguished by their unique colors.
Figure 8.26 shows a dual-map display of citations found in publications of two
iSchools between 2003 and 2012. The citation arcs made by the iSchool at Drexel
University are colored in blue, whereas the arcs made by the School of Information
Studies at Syracuse are in magenta. At a glance, the blue arcs on the upper part of
the map suggest that Drexel researchers published in these areas, whereas Syracuse
researchers made few publications in these areas. The dual-map overlay shows that
Drexel researchers not only published in the areas that correspond to mathematics
and systems journals, Drexel researchers’ publications in journals in other areas are
also influenced by journals related to systems, computing, and mathematics. The
overlapping arcs in the lower half of the map indicate that the two institutions share
their core journals in terms of where they publish.
8.4 Global Science Maps and Overlays 315

Fig. 8.26 Citation arcs from the publications of Drexel’s iSchool (blue arcs) and Syracuse School
of Information Studies (magenta arcs) reveal where they differ in terms of both intellectual bases
and research frontiers

Fig. 8.27 h-index papers (cyan) and citers to CiteSpace (red)

As one more example, Fig. 8.27 shows a comparison between two sets of records.
One is a set of papers on h-index (green, mostly appeared in the upper half)
and the other is a set of papers citing the 2006 JASIST paper on CiteSpace II,
mostly originated from the lower right part of the base map of citing journals. This
image shows that research in h-index is widespread, especially published in physics
journals (Blondel cluster #5) and cited journals in similar categories. In contrast,
papers citing CiteSpace II concentrated on a few journals, but they cited journals in
a wide range of clusters of journals.
316 8 Mapping Science

In summary, global science maps provide base maps that enable interactive
overlays. Dual-map overlays display the citing and cited journals in the same view,
which makes it easier to compare the citation behaviors of different groups in terms
of their source journals and target journals.

References

Aksnes DW (2003) Characteristics of highly cited papers. Res Eval 12(3):159–170


Ben-David U, Benvenisty N (2011) The tumorigenicity of human embryonic and induced
pluripotent stem cells. Nat Rev Cancer 11(4):268–277. doi:10.1038/nrc3034
Bjornson CRR, Rietze RL, Reynolds BA, Magli MC, Vescovi AL (1999) Turning brain into blood:
a hematopoietic fate adopted by adult neural stem cells in vivo. Science 283(5401):534–537
Bock C, Kiskinis E, Verstappen G, Gu H, Boulting G, Smith ZD et al (2011) Reference maps of
human ES and iPS cell variation enable high-throughput characterization of pluripotent cell
lines. Cell 144(3):439–452
Boland MJ, Hazen JL, Nazor KL, Rodriguez AR, Gifford W, Martin G et al (2009) Adult mice gen-
erated from induced pluripotent stem cells. Nature 461(7260):91–94. doi:10.1038/nature08310
Borner K, Klavans R, Patek M, Zoss AM, Biberstine JR, Light RP et al (2012) Design and update
of a classification system. The UCSD map of science. PLoS One 7(7):e39464
Bornmann L, Daniel H-D (2006) What do citation counts measure? A review of studies on citing
behavior. J Doc 64(1):45–80
Boulting GL, Kiskinis E, Croft GF, Amoroso MW, Oakley DH, Wainger BJ et al (2011) A
functionally characterized test set of human induced pluripotent stem cells. Nat Biotechnol
29(3):279–286. doi:10.1038/nbt.1783
Boyack KW (2009) Using detailed maps of science to identify potential collaborations. Sciento-
metrics 79(1):27–44
Boyack KW, Klavans R (2010) Co-citation analysis, bibliographic coupling, and direct citation:
which citation approach represents the research front most accurately? J Am Soc Info Sci
Technol 61(12):2389–2404
Boyack KW, Klavans R, Ingwersen P, Larsen B (2005) Predicting the importance of current papers.
Paper presented at the proceedings of the 10th international conference of the International
Society for Scientometrics and Informetrics. Retrieved from https://cfwebprod.sandia.gov/
cfdocs/CCIM/docs/kwb rk ISSI05b.pdf
Budd JM, Sievert M, Schultz TR (1998) Phenomena of retraction: reasons for retraction and
citations to the publications. JAMA 280:296–297
Burt RS (2004) Structural holes and good ideas. Am J Sociol 110(2):349–399
Buter R, Noyons E, Van Raan A (2011) Searching for converging research using field to field
citations. Scientometrics 86(2):325–338
Chen C (2003) Mapping scientific frontiers: the quest for knowledge visualization. Springer,
London
Chen C (2006) CiteSpace II: detecting and visualizing emerging trends and transient patterns in
scientific literature. J Am Soc Info Sci Technol 57(3):359–377
Chen C (2011) Turning points: the nature of creativity. Springer, New York
Chen C (2012) Predictive effects of structural variation on citation counts. J Am Soc Info Sci
Technol 63(3):431–449
Chen C, Chen Y, Horowitz M, Hou H, Liu Z, Pellegrino D (2009) Towards an explanatory and
computational theory of scientific discovery. J Informetr 3(3):191–209
Chen C, Ibekwe-SanJuan F, Hou J (2010) The structure and dynamics of co-citation clusters: a
multiple-perspective co-citation analysis. J Am Soc Info Sci Technol 61(7):1386–1409
References 317

Chin MH, Mason MJ, Xie W, Volinia S, Singer M, Peterson C et al (2009) Induced pluripotent stem
cells and embryonic stem cells are distinguished by gene expression signatures. Cell Stem Cell
5(1):111–123
Chubin DE (1994) Grants peer-review in theory and practice. Eval Rev 18(1):20–30
Chubin DE, Hackett EJ (1990) Paperless science: peer review and U.S. science policy. State
University of New York Press, Albany
Cobo MJ, Lopez-Herrera AG, Herrera-Viedma E, Herrera F (2011) Science mapping software
tools: review, analysis, and cooperative study among tools. [Review]. J Am Soc Info Sci
Technol 62(7):1382–1402
Cuhls K (2001) Foresight with Delphi surveys in Japan. [Article]. Technol Anal Strateg Manag
13(4):555–569
Dewett T, Denisi AS (2004) Exploring scholarly reputation: it’s more than just productivity.
[Article]. Scientometrics 60(2):249–272
Discher DE, Mooney DJ, Zandstra PW (2009) Growth factors, matrices, and forces combine and
control stem cells. Science 324(5935):1673–1677
Ebert AD, Yu J, Rose FF, Mattis VB, Lorson CL, Thomson JA et al (2009) Induced pluripotent
stem cells from a spinal muscular atrophy patient. Nature 457(7227):277–280. doi:10.1038/
nature07677
Fauconnier G, Turner M (1998) Conceptual integration networks. Cognit Sci 22(2):133–187
Feng Q, Lu S-J, Klimanskaya I, Gomes I, Kim D, Chung Y et al (2010) Hemangioblastic
derivatives from human induced pluripotent stem cells exhibit limited expansion and early
senescence. Stem Cells 28(4):704–712
Fleming L, Bromiley P (2000) A variable risk propensity model of technological risk taking. Paper
presented at the applied statistics workshop. Retrieved from http://courses.gov.harvard.edu/
gov3009/fall00/fleming.pdf
Garfield E (1955) Citation indexes for science: a new dimension in documentation through
association of ideas. Science 122(3159):108–111
Gimble JM, Katz AJ, Bunnell BA (2007) Adipose-derived stem cells for regenerative medicine.
Circ Res 100(9):1249–1260
Glotzbach JP, Wong VW, Gurtner GC, Longaker MT (2011) Regenerative medicine. Curr Probl
Surg 48(3):148–212
Häyrynen M (2007) Breakthrough research: funding for high-risk research at the Academy of
Finland. The Academy of Finland, Helsinki
Hettich S, Pazzani MJ (2006) Mining for proposal reviewers: lessons learned at the National
Science Foundation. Paper presented at the KDD’06
Hilbe JM (2011) Negative binomial regression, 2nd edn. Cambridge University Press, Cambridge
Hirsch JE (2007) Does the h index have predictive power? Proc Natl Acad Sci
104(49):19193–19198
Hong H, Takahashi K, Ichisaka T, Aoi T, Kanagawa O, Nakagawa M et al (2009) Suppression of in-
duced pluripotent stem cell generation by the p53–p21 pathway. Nature 460(7259):1132–1135.
doi:10.1038/nature08235
Hsieh C (2011) Explicitly searching for useful inventions: dynamic relatedness and the costs of
connecting versus synthesizing. Scientometrics 86(2):381–404
Kaji K, Norrby K, Paca A, Mileikovsky M, Mohseni P, Woltjen K (2009) Virus-free induction of
pluripotency and subsequent excision of reprogramming factors. Nature 458(7239):771–775.
doi:10.1038/nature07864
Kakuk P (2009) The legacy of the Hwang case: research misconduct in biosciences. Sci Eng Ethics
15:545–562
Khang G, Kim SH, Kim MS, Rhee JM, Lee HB (2007) Recent and future directions of stem cells
for the application of regenerative medicine. Tissue Eng Regen Med 4(4):441–470
Kim D, Kim C-H, Moon J-I, Chung Y-G, Chang M-Y, Han B-S et al (2009a) Generation of human
induced pluripotent stem cells by direct delivery of reprogramming proteins. Cell Stem Cell
4(6):472–476
318 8 Mapping Science

Kim JB, Sebastiano V, Wu G, Araúzo-Bravo MJ, Sasse P, Gentile L et al (2009b) Oct4-induced


pluripotency in adult neural stem cells. Cell 136(3):411–419
Kiskinis E, Eggan K (2010) Progress toward the clinical application of patient-specific pluripotent
stem cells. J Clin Invest 120(1):51–59
Klavans R, Boyack KW (2010) Toward an objective, reliable and accurate method for measuring
research leadership. Scientometrics 82:539–553
Korpela KM (2010) How long does it take for scientific literature to purge itself of fraudulent
material? The Breuning case revisited. Curr Med Res Opin 26:843–847
Kostoff R (2007) The difference between highly and poorly cited medical articles in the journal
Lancet. Scientometrics 72:513–520
Laflamme MA, Chen KY, Naumova AV, Muskheli V, Fugate JA, Dupras SK et al (2007)
Cardiomyocytes derived from human embryonic stem cells in pro-survival factors enhance
function of infarcted rat hearts. Nat Biotechnol 25(9):1015–1024. doi:10.1038/nbt1327
Lahiri M, Maiya AS, Sulo R, Habiba Berger-Wolf TY (2008) The impact of structural changes on
predictions of diffusion in networks. Paper presented at the 2008 IEEE international conference
on data mining workshops (ICDMW’08). Retrieved from http://compbio.cs.uic.edu/mayank/
papers/LahiriMaiyaSuloHabibaBergerWolf ImpactOfStructuralChanges08.pdf
Lambert D (1992) Zero-infated Poisson regression, with an application to defects in manufacturing.
Technometrics 34:1–14
Laurent LC, Ulitsky I, Slavin I, Tran H, Schork A, Morey R et al (2011) Dynamic changes in the
copy number of pluripotency and cell proliferation genes in human ESCs and iPSCs during
reprogramming and time in culture. Cell Stem Cell 8(1):106–118
Levitt J, Thelwall M (2008) Patterns of annual citation of highly cited articles and the prediction
of their citation ranking: a comparison across subjects. Scientometrics 77(1):41–60
Leydesdorff L (2001) The challenge of scientometrics: the development, measurement, and self-
organization of scientific communications. Universal-Publishers, Boca Raton
Leydesdorff L, Rafols I (2009) A global map of science based on the ISI subject categories. J Am
Soc Info Sci Technol 60(2):348–362
Leydesdorff L, Rafols I (2011) Local emergence and global diffusion of research technologies: an
exploration of patterns of network formation. J Am Soc Info Sci Technol 62(5):846–860
Li C, Heidt DG, Dalerba P, Burant CF, Zhang L, Adsay V et al (2007) Identification of pancreatic
cancer stem cells. Cancer Res 67(3):1030–1037
Lipinski C, Hopkins A (2004) Navigating chemical space for biology and medicine. [Article].
Nature 432(7019):855–861
Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J et al (2009)
Human DNA methylomes at base resolution show widespread epigenomic differences. Nature
462(7271):315–322. doi:10.1038/nature08514
Mackinlay JD, Rao R, Card SK (1995) An organic user interface for searching citation links. Paper
presented at the SIGCHI’95
Martin BR (2010) The origins of the concept of ‘foresight’ in science and technology: an insider’s
perspective. Technol Forecast Soc Change 77(9):1438–1447
Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G et al (2007)
Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature
448(7153):553–560. doi:10.1038/nature06008
Miles I (2010) The development of technology foresight: a review. Technol Forecast Soc Change
77(9):1448–1456
Naik G (2011) Mistakes in scientific studies surge. Wall Street J. Retrieved March 16 2012, from
http://online.wsj.com/article/SB10001424052702303627104576411850666582080.html
Nakagawa M, Koyanagi M, Tanabe K, Takahashi K, Ichisaka T, Aoi T et al (2008) Generation of
induced pluripotent stem cells without Myc from mouse and human fibroblasts. Nat Biotechnol
26(1):101–106. doi:10.1038/nbt1374
Neale AV, Northrup J, Dailey R, Marks E, Abrams J (2007) Correction and use of biomedical
literature affected by scientific misconduct. Sci Eng Ethics 13:5–24
References 319

Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA
103(23):8577–8582
O’Brien CA, Pollett A, Gallinger S, Dick JE (2007) A human colon cancer cell capable of
initiating tumour growth in immunodeficient mice. Nature 445(7123):106–110.
doi:10.1038/nature05372
Okita K, Nakagawa M, Hyenjong H, Ichisaka T, Yamanaka S (2008) Generation of mouse induced
pluripotent stem cells without viral vectors. Science 322(5903):949–953
Patterson M, Chan DN, Ha I, Case D, Cui Y, Handel BV et al (2012) Defining the nature of human
pluripotent stem cell progeny. Cell Res 22(1):178–193
Persson O (2010) Are highly cited papers more international? Scientometrics 83(2):397–401
Pfeifer MP, Snodgrass GL (1990) The continued use of retracted, invalid scientific literature. J Am
Med Assoc 263:1420–1423
Phinney DG, Prockop DJ (2007) Concise review: mesenchymal stem/multipotent stromal cells:
the state of transdifferentiation and modes of tissue repair—current views. Stem Cells
25(11):2896–2902
Pirolli P (2007) Information foraging theory: adaptive interaction with information. Oxford
University Press, Oxford
Pittenger MF, Mackay AM, Beck SC, Jaiswal RK, Douglas R, Mosca JD et al (1999) Multilineage
potential of adult human mesenchymal stem cells. Science 284(5411):143–147
Polak DJ (2010) Regenerative medicine. Opportunities and challenges: a brief overview. J R Soc
Interface 7:S777–S781
Polykandriotis E, Popescu LM, Horch RE (2010) Regenerative medicine: then and now – an update
of recent history into future possibilities. J Cell Mol Med 14(10):2350–2358
Porter AL, Rafols I (2009) Is science becoming more interdisciplinary? Measuring and mapping
six research fields over time. Scientometrics 81(3):719–745
Price DD (1965) Networks of scientific papers. Science 149:510–515
Rafols I, Porter AL, Leydesdorff L (2010) Science overlay maps: a new tool for research policy
and library management. J Am Soc Info Sci Technol 61(9):1871–1887
Ricci-Vitiani L, Lombardi DG, Pilozzi E, Biffoni M, Todaro M, Peschle C et al (2007) Iden-
tification and expansion of human colon-cancer-initiating cells. Nature 445(7123):111–115.
doi:10.1038/nature05384
Service RF (2002) Bell Labs fires star physicist found guilty of forging data. Science 298:30–31
Shibata N, Kajikawa Y, Matsushima K (2007) Topological analysis of citation networks to discover
the future core articles. J Am Soc Info Sci Technol 58(6):872–882
Shibata N, Kajikawa Y, Takeda Y, Sakata I, Matsushima K (2011) Detecting emerging research
fronts in regenerative medicine by the citation network analysis of scientific publications.
Technol Forecast Soc Change 78:274–282
Slaughter BV, Khurshid SS, Fisher OZ, Khademhosseini A, Peppas NA (2009) Hydrogels in
regenerative medicine. Adv Mater 21(32–33):3307–3329
Small H (1999) Visualizing science by citation mapping. J Am Soc Inf Sci 50(9):799–813
Soldner F, Hockemeyer D, Beard C, Gao Q, Bell GW, Cook EG et al (2009) Parkinson’s
disease patient-derived induced pluripotent stem cells free of viral reprogramming factors. Cell
136(5):964–977
Sox HC, Rennle D (2006) Research misconduct, retraction, and cleansing the medical literature:
lessons from the Poehlman case. Ann Intern Med 144:609–613
Stadtfeld M, Apostolou E, Akutsu H, Fukuda A, Follett P, Natesan S et al (2010) Aberrant silencing
of imprinted genes on chromosome 12qF1 in mouse induced pluripotent stem cells. Nature
465(7295):175–181. doi:10.1038/nature09017
Steen RG (2011) Retractions in the scientific literature: do authors deliberately commit research
fraud? J Med Ethics 37:113–117
Swanson DR (1986a) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect
Biol Med 30:7–18
Swanson DR (1986b) Undiscovered public knowledge. Libr Q 56(2):103–118
320 8 Mapping Science

Takahashi K, Yamanaka S (2006) Induction of pluripotent stem cells from mouse embryonic and
adult fibroblast cultures by defined factors. Cell 126(4):663–676
Takahashi K, Tanabe K, Ohnuki M, Narita M, Ichisaka T, Tomoda K et al (2007) Induction of
pluripotent stem cells from adult human fibroblasts by defined factors. Cell 131(5):861–872
Takeda Y, Kajikawa Y (2010) Tracking modularity in citation networks. Scientometrics 83(3):783
Thomas J, Cook K (2005) Illuminating the path, the research and development agenda for visual
analytics. IEEE CS Press, Los Alamitos
Thomson JA, Itskovitz-Eldor J, Shapiro SS, Waknitz MA, Swiergiel JJ, Marshall VS et al (1998)
Embryonic stem cell lines derived from human blastocysts. Science 282(5391):1145–1147
Tichy G (2004) The over-optimism among experts in assessment and foresight. [Article]. Technol
Forecast Soc Change 71(4):341–363
Trikalinos NA, Evangelou E, Ioannidis JPA (2008) Falsified papers in high-impact journals were
slow to retract and indistinguishable from nonfraudulent papers. J Clin Epidemiol 61:464–470
Upham SP, Rosenkopf L, Ungar LH (2010) Positioning knowledge: schools of thought and new
knowledge creation. Scientometrics 83:555–581
van Dalen HP, Kenkens K (2005) Signals in science: on the importance of signaling in gaining
attention in science. Scientometrics 64(2):209–233
van Eck NJ, Waltman L (2010) Software survey: VOSviewer, a computer program for bibliometric
mapping. [Article]. Scientometrics 84(2):523–538
Vierbuchen T, Ostermeier A, Pang ZP, Kokubu Y, Südhof TC, Wernig M (2010) Direct conver-
sion of fibroblasts to functional neurons by defined factors. Nature 463(7284):1035–1041.
doi:10.1038/nature08797
von Luxburg U (2006) A tutorial on spectral clustering. From http://www.kyb.mpg.de/fileadmin/
user upload/files/publications/attachments/Luxburg07 tutorial 4488%5b0%5d.pdf
Wager E, Williams P (2011) Why and how do journals retract articles? An analysis of Medline
retractions 1988–2008. J Med Ethics 37:567–570
Wakefield AJ, Murch SH, Anthony A, Linnell J, Casson DM, Malik M et al (1998) Ileal-lymphoid-
nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children
(Retracted article. See vol 375, pg 445, 2010). Lancet 351(9103):637–641
Walters GD (2006) Predicting subsequent citations to articles published in twelve crime-
psychology journals: author impact versus journal impact. Scientometrics 69(3):499–510
Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature
393(6684):440–442
Weeber M (2003) Advances in literature-based discovery. J Am Soc Info Sci Technol
54(10):913–925
Wernig M, Meissner A, Foreman R, Brambrink T, Ku M, Hochedlinger K et al (2007) In vitro
reprogramming of fibroblasts into a pluripotent ES-cell-like state. Nature 448(7151):318–324.
doi:10.1038/nature05944
Woltjen K, Michael IP, Mohseni P, Desai R, Mileikovsky M, Hamalainen R et al (2009)
piggyBac transposition reprograms fibroblasts to induced pluripotent stem cells. Nature
458(7239):766–770. doi:10.1038/nature07863
Young RA (2011) Control of the embryonic stem cell state. Cell 144(6):940–954
Yu J, Vodyanik MA, Smuga-Otto K, Antosiewicz-Bourget J, Frane JL, Tian S et al (2007) Induced
pluripotent stem cell lines derived from human somatic cells. Science 318(5858):1917–1920
Yu J, Hu K, Smuga-Otto K, Tian S, Stewart R, Slukvin II et al (2009) Human induced pluripotent
stem cells free of vector and transgene sequences. Science 324(5928):797–801
Zeileis A, Kleiber C, Jackman S (2011) Regression models for count data in R. from http://cran.r-
project.org/web/packages/pscl/vignettes/countreg.pdf
Zhao T, Zhang Z-N, Rong Z, Xu Y (2011) Immunogenicity of induced pluripotent stem cells.
Nature 474(7350):212–215. doi:10.1038/nature10135
Zhou H, Wu S, Joo JY, Zhu S, Han DW, Lin T et al (2009) Generation of induced pluripotent stem
cells using recombinant proteins. Cell Stem Cell 4(5):381–384
Chapter 9
Visual Analytics

Visual analytics is regarded as the second generation of computer supported visual


thinking after information visualization. The widely known mission for information
visualization is to obtain insights by laying out information in front of us. Gestalt
psychology has played an implicit but significant role because we expect to benefit
from emergent patterns and properties that can only make sense when we look
at relevant information as a whole. No wonder Shneiderman’s mantra for visual
information retrieval starts from an overview.
Visual analytics focuses on analytic reasoning and decision making. Although
sense making is one part of the analytic process, the outcome of the visual thinking
is now clearly insight and it has to be taken into account. The emphasis of
making decisions with incomplete information of potentially high uncertainty is
fundamental to visual analytics. The role of evidence becomes an integral part of our
decision system in terms of the quality, the provenance, the credibility of evidence,
and the implications of updated evidence. Visual analytics sets insights in context
and it drives the process of visual thinking towards a realistic resolution of a complex
situation. In this chapter, we describe a few systems in the broadly defined visual
analytics. We will highlight how each system is designed to facilitate the reasoning
and decision making process. Figure 9.1 shows a screenshot of GeoTime, a visual
analytic system for investigating events in both spatial and temporal dimensions.

9.1 CiteSpace

CiteSpace is a Java application for visualizing and analyzing emerging trends and
patterns in scientific literature. The design of CiteSpace is motivated to achieve two
ambitious goals. One is to provide a computational alternative to supplement the
traditional systematic reviews and surveys of a body of scientific literature. The
other is to provide an analytic tool so that one can study the structure and dynamics

C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 321
DOI 10.1007/978-1-4471-5128-9 9, © Springer-Verlag London 2013
322 9 Visual Analytics

Fig. 9.1 A screenshot of GeoTime (Reprinted from Eccles et al. 2008)

of scientific paradigms in the sense defined by Thomas Kuhn. The primary source
of input for CiteSpace is a body of scientific literature, namely bibliographic records
from the Web of Science or full-text versions of publications.
The general assumption is that the study of such input data will allow us to
address two fundamental questions that systematic reviews and surveys would intent
to address:
1. What is the persistent core of the literature?
2. What are the transient trends that have appeared or are emerging in the literature?
The persistent core of a body of literature corresponds to the intellectual base of
a field of study. The transient trends correspond to scientific frontiers. On the other
hand, researchers have realized that scientific knowledge can be seen as the constant
movement of scientific frontiers. The state of art today may or may not survive the
future. Only the time can tell whether an exciting new theory will have its position
in the history of science.
We use co-citations of references as the basic organizing mechanism. In other
words, we construct a global structure from local details. Each individual scientist
or domain expert provides their input as they publish their work in the literature. As
they cite previously published works in the literature, they leave their footprints that
carry information about their preferences, intents, criticisms, and interpretations. In
this way, citations provide a valuable source of information for us to identify and
measure the value of a scientific idea, a discovery, or a theory.
9.1 CiteSpace 323

Fig. 9.2 CiteSpace labels clusters with title terms of articles that cite corresponding clusters

CiteSpace supports a series of functions that transform the bibliographic data


into interactive visualizations of networks. Users can choose a window of analysis.
CiteSpace divides the entire window of analysis into a sequence of consecutive time
intervals, called time slices. Citation behaviors observed within each time slice are
used to construct a network model. Networks over adjacent time slices are merged
to form a network over a longer period of time.
Synthesized networks can be divided into clusters of co-cited references. Each
cluster contains a set of references. The formation of a cluster is resulted from the
citation behaviors of a group of scientists who are concerned with the same set of
research problems. A cluster therefore can be seen as the footprint of an invisible
college. As the invisible college changes its research focus, its footprints will move
on the landscape of scientific knowledge. The cluster will evolve accordingly. For
example, it may continue to grow in size, branch out to several smaller clusters, or
join other clusters. It may even be phased out as the invisible college drift away from
an old line of research altogether.
CiteSpace provides three algorithms to label a cluster, namely, the traditional
tf by idf, log-likelihood ratio, and mutual information (Fig. 9.2). Label terms are
selected from titles, keywords, or abstracts of articles that specifically cite members
of the cluster. If the members of a cluster represent the footprints of an invisible
college or a paradigm, the labels reflect what the invisible college and the paradigm
324 9 Visual Analytics

Fig. 9.3 Citations over time are shown as tree rings. Tree rings in red depict the years an
accelerated citation rate was detected (citation burst). Three areas emerged from the visualization

are currently concerned with, which may or may not be consistent with the direction
of the cluster. These clusters represent the intellectual base of a paradigm, whereas
the citing articles associated with a cluster represent the research fronts. It is possible
that the same intellectual base may sustain more than one research front.
CiteSpace identifies noteworthy patterns in terms of structural and temporal
properties. Structuring properties include the betweenness centrality of a cited
reference at both the individual article level and the aggregated level of clusters.
Temporal properties include citation bursts, which measure the acceleration of
citations within a short period of time. It has been shown that these indicators
capture research focuses of the underlying scientific community (Chen 2012; Chen
et al. 2010; Small 1973).
CiteSpace characterizes emerging trends and patterns of change in such networks
in terms of a variety of visual attributes. The size of a node indicates how many
citations the associated reference received. Each node is depicted with a series
of citation tree-rings across the series of time slices. The structural properties of
a node are displayed in terms of a purple ring. The thickness of the purple ring
indicates the degree of its betweenness centrality, which is a measure associated
with the transformative potential of a scientific contribution. Such nodes tend to
bridge different stages of the development of a scientific field. Citation rings in red
indicate the time slices in which citation bursts, or abrupt increases of citations,
are detected. Citation bursts provide a useful means to trace the development of
research focus. Figure 9.3 shows an example of the distribution of topic areas with
strong citation bursts in research of terrorism.
9.2 Jigsaw 325

Fig. 9.4 A network of 12,691 co-cited references. Each year top 2,000 most cited references were
selected to form the network. The same three-cluster structure is persistent at various levels

CiteSpace provides system-level indicators to measure the quality of a cluster in


terms of its silhouette score, which is an indicator of its homogeneity or consistency.
Silhouette values of homogenous clusters tend to close to one. The change of the
modularity of a network over time can be used to measure the structure of the system
and its stability. In the regenerative medicine example, we have demonstrated
that the Nobel Prize winning discoveries caused a substantial amount of system
perturbation.
Figure 9.4 shows a network of 12,691 co-cited references based on the citation
behavior of top 2,000 papers per year on topics relevant to terrorism. The network
is clearly clustered. The majority of the attention was attracted to areas that have
demonstrated exponential grows as measured in terms of their citation bursts.

9.2 Jigsaw

Jigsaw is developed at Georgia Tech, led by John Stasko, who has been active
in software visualization, information visualization, and visual analytics. Jigsaw
integrates a variety of views for the study of a collection of text. The software
is available at http://www.cc.gatech.edu/gvu/ii/jigsaw. Prospective users are highly
recommended to start with tutorial videos.1

1
http://www.cc.gatech.edu/gvu/ii/jigsaw/tutorial/
326 9 Visual Analytics

Fig. 9.5 The document view in Jigsaw

Jigsaw is designed to extract entities and relationships from unstructured text


and provide users with a broad range of interactive views to explore the identified
entities and how they are related. For example, given a set of scientific publication,
Jigsaw will identify entities such as authors, concepts, and keywords.
Jigsaw provides several forms of representations of the underlying data.
Figure 9.5 shows the interface of its document view. The top of the view shows
a tag cloud display in which the size of a term reflects the frequency of the term.
We can easily learn that the data is mainly about machine learning and network.
The lower part of the view is split into two parts. The part on the left shows a
list of the documents the user has browsed. The entry of the current document is
highlighted in yellow. Its content is display in the window on the right. Inside the
content window, a brief summary is displayed on the top, followed by the text of the
document. Entities identified in the text are highlighted. For example, comparative
analysis and uncertain graphs are highlighted in the title of the 2011 VAST paper.
The authors of the paper are identified along with other entities such as concepts
and the source of the document.
Figure 9.6 shows the List View of Jigsaw. The List View can display multiple lists
of entities simultaneously and highlight how selected entities in one list are related
to entities in other lists. The example in Fig. 9.6 displays three lists of entities, a list
of concepts, a list of authors, and a list of index terms. The source documents are
the papers of the InfoVis and VAST conference proceedings. The authors selected
in the middle are coauthors of my publications in InfoVis and VAST. The highlight
9.2 Jigsaw 327

Fig. 9.6 The list view of Jigsaw, showing a list of authors, a list of concepts, and a list of index
terms. The input documents are papers from the InfoVis and VAST conferences

concepts on the left include network, text, animation, usability, and matrix. The
index terms highlighted on the right are much more detailed, including citation
analysis, astronomical surveys, PFNET, and SDSS. The flexibility to browse entities
and relations across multiple lists is very convenient. It supports a very common task
in exploring a dataset.
Other views in Jigsaw include a Circular Graph View, a Calendar View, a
Document Cluster View, a Document Grid View, and a Word Tree View. Jigsaw
also provides functions to compute the sentiment of a document and display the
result in a Document Grid View. Jigsaw uses lists of “positive” and “negative” words
and counts the number of occurrences in each document. The Document Grid view
represents positive documents in blue and negative documents in red. Figure 9.7
shows a Word Tree View in Jigsaw.
Perhaps the function that is most useful for an analyst is the Tablet View in
Jigsaw (See Fig. 9.8). The Tablet functions like a sandbox where the analyst can
organize various views of the same set of documents side by side. Tablet also allows
the analyst to create a timeline and place evidence and other information along the
timeline.
328 9 Visual Analytics

Fig. 9.7 A word tree view in Jigsaw

Fig. 9.8 Tablet in Jigsaw provides a flexible workspace to organize evidence and information
9.4 Power Grid Analysis 329

9.3 Carrot

Carrot is a document clustering workbench. It is freely available.2 It can handle


text data from a broad range of input sources, including the Internet, Google, and
customized collections in XML. Carrot provides very powerful clustering functions
and visualizes the clustering results in a treemap-like visualization called a Foam
Tree visualization along with a few other views (See Fig. 9.9). The prominent
clusters are major clusters.

9.4 Power Grid Analysis

Monitoring the nation’s electrical power grids is a labor-intensive operation that


requires constant human attention to assess real-time telemetered data. The pre-
vailing map-based or electric-circuit-based graphics display technology adopted by
the power grid industry is mostly designed for data presentation purposes. The
technology relies heavily on the operators to make real-time decisions based on
their experience. Simple human negligence errors could potentially bring down the
entire power grid in minutes and cause major disturbance to the community.

Fig. 9.9 Carrot’s visualizations of clusters of text documents. Top right: Aduna cluster map
visualization; lower middle: circles visualization; lower right: Foam Tree visualization

2
http://project.carrot2.org/
330 9 Visual Analytics

Fig. 9.10 Left: The geographic layout of the Western Power Grid (WECC) with 230 kV or higher
voltage. Right: a GreenGrid layout with additional weights applied to both nodes (using voltage
phase angle) and links (using impedance) (Reprinted from Wong et al. 2009 with permission)

One example provided by visual analytics could potentially alleviate some of


the real-time situation awareness challenges facing power grid operators. The
GreenGrid system developed at Pacific Northwest National Lab (PNNL) extends
the traditional force-directed graph layout technique by integrating the physics of
the electrical circuit and the geography of the physical power grid resources into
one discourse analytics tool for continuous power grid monitoring, measurement,
and mitigation (Wong et al. 2009). Figure 9.10 (left) shows a traditional power grid
visualization of the Western Power Grid (a.k.a. Western Electricity Coordinating
Council, or WECC) drawn on a geographic map. By modeling the telemetered
data using the attractive and repulsive forces of the WECC in Fig. 9.10 (right),
the operators could conduct contingency analysis as soon as they spot unusual
stretching of certain power grid links. On August 10, 1996, the Western Power Grid
was decoupled into four isolated islands (Alberta, Northern, Northern California,
and Southern) during the last blackout in western North America. A customized
visual analytics tool such as GreenGrid in Fig. 9.10 (right) could show early signs
of the decoupling, which would allow operators to assess the situation and enforce
mitigation at the earliest possible time.
While the GreenGrid technology seems effective, the multi-faceted power grid
analytics challenges will need an array of additional visual analytics technologies
to fully address the problems. Among them is the multivariate network analytics
problem, which potentially could be alleviated by the GreenCurve technology
(Wong et al. 2012). Both the GreenGrid and the GreenCurve visual analytics
technologies could be modified to address other critical infrastructure network
problems from telecommunication, to energy, to transportation network grids that
require real-time human monitoring and responding.
9.5 Action Science Explorer (iOpener) 331

For planar network graphs that can be anchored to geo-spatial coordinates,


removing the geography information could allow additional information to be
modeled into a more abstract visualization that could potentially improve the quality
of a visual analytics human discourse between users and their data. In the dual
overlay maps we discussed in Chap. 8, a similar design strategy is used to connect
information in two distinct views together so as to facilitate the study of the
interrelations.

9.5 Action Science Explorer (iOpener)

The Action Science Explorer (ASE) is a new tool developed at the University of
Maryland (Dunne et al. 2012). It is designed to present the scientific literature
for a field using many different modalities: lists of articles, their full texts,
automatic text summaries, and visualizations of the structure of the citation network.
Action Science Explorer integrates a variety of functions in order to support rapid
understanding of scientific literature. Users can analyze the network of citations
between papers, identify key papers and research clusters, automatically summarize
them, dig into the full text of articles to extract context, make annotations, write
reviews, and finally export their findings in many of document authoring formats.
Action Science Explorer is partially an integration of two existing tools – the
SocialAction network analysis tool3 and the JabRef reference manager.4 SocialAc-
tion provides network analysis capabilities including force-directed citation network
visualization, ranking and filtering papers by statistical measures, and automatic
cluster detection. JabRef supplies features for managing references, including
searching using simple regular expressions, automatic and manual grouping of
papers, DOI and URL links, PDF full text with annotations, abstracts, user generated
reviews and text annotations, and many ways of exporting. It integrates with
Microsoft Word, OpenOffice.org, and LaTeX/BibTeX, which allows quick adding
of citations to discovered articles when writing survey papers.
These tools are linked together to form multiple coordinated views of the data.
Clicking on a node in the citation network selects it and its corresponding paper
in the reference manager, displaying its abstract, review, and other data associated
with it. Moreover, when clusters of nodes are selected their papers are floated to
the top of the reference manager. When any node or cluster is selected, the In-
Cite Text window displays the text of all incoming citations to the paper(s), i.e.
the whole sentences from the citing papers that include the citation to the selected
paper(s). These are displayed in a hyperlinked list that allows the user to select
any one of them to show their surrounding context in the Out-Cite Text window.
This window shows the full text of the paper citing one of the selected papers, with

3
SocialAction network analysis tool.
4
JabRef reference manager.
332 9 Visual Analytics

Fig. 9.11 A screenshot of ASE (Reprinted from Dunne et al. 2012 with permission)

highlighting showing the selected citation sentence as well as any other sentences
that include hyperlinked citations to other papers. The last view is the summary
window, which can contain various multi-document summaries of a selected cluster.
Using automatic summarization techniques, we can summarize all of the incoming
citations to papers within that cluster, hopefully providing key insights into that
research community.
According to the website of ASE,5 it is currently not available to the general
public, except only to their collaborators. Figure 9.11 shows a screenshot of ASE.

9.6 Revisit the Ten Challenges Identified in 2002

In 2002, when I wrote the first edition of the book, I identified the following top-ten
challenges for the subject and predictions for the near future. Now in 2012, what
have been changed and what are the new challenges that have emerged since?

5
http://www.cs.umd.edu/hcil/ase/
9.6 Revisit the Ten Challenges Identified in 2002 333

Challenge 1 Domain-specific versus domain-independent. This issue is concerned


with how much domain knowledge will be required to carry out an analysis.
This challenge remains to be an issue. Research in automatic ontology construc-
tion has made considerable progress and we hope, by providing a representation of
the domain structure, future systems can provide better support to accommodate the
needs from both inside and outside of a domain.
Challenge 2 Quality versus Timeliness. The quality comes from the collective
views expressed by domain experts in their scholarly publications. The timeliness
issue rises from the reality that by the time an article appears in print, it is
more likely that science has moved on. Nevertheless, the history of scientific
debates can provide valuable insights. If the analysis can be done frequently, such
visualizations can provide useful milestones for scientists to project the trajectory of
a paradigm. This issue also relates to the source of input, ranging from traditional
scientific literatures, gray literatures such as technical reports and pre-prints, to
communications between scientists of an invisible college.
The timeliness issue is relaxed by the use of social media. Studies have found
that how often an article is tweeted on Twiter soon after its publication may be a
good indicator of its subsequent citations in scholarly literature. On the other hand,
social media’s impact is more likely to be transient rather than persistent because
it takes much more than a single manuscript of detailed experiments to convince
skeptical readers, let alone a brief one-liner tweet to change people’s opinions. The
real value of social media in this context is its ability to draw our attention to some
potentially interesting work quickly. And that is a very good starting point.
Challenge 3 Interdisciplinary nature: To understand and interpret what is going on
in science, we must consider the practice of closely related disciplines, particularly,
history of science, philosophy of science, sociology of science as well as the scien-
tific domain itself. This challenging issue requires an interdisciplinary approach to
ensure that we are aware of the latest development in these disciplines and integrate
theoretical and methodological components properly. Getting a meaningful and
coherent big picture is a relevant issue.
We have a better understanding of the nature of interdisciplinarity. The diver-
sity that comes with interdisciplinarity is essential to the advance of scientific
knowledge. In Turning Points (Chen 2011), we have demonstrated that a common
mechanism of scientific creativity is to connect seemingly contradictory ideas.
Interdisciplinarity is the norm of how science works rather than its exception. We
have introduced the structure variation model in Chap. 8 to demonstrate a new way
of studying scientific discoveries and identifying criteria of an environment in which
they may emerge. Even if we can capture a small amount of scientific discoveries
in this way, its theoretical and practical implications would be too strong to ignore.
Thus addressing this challenge is a promising direction.
334 9 Visual Analytics

Challenge 4 Validation. This is an integral part of the process. It is crucial


to develop a comprehensive understanding of the strengths and weaknesses in
applying this type of approaches to a wider range of case studies. By maintaining
a focused visualization and exploration process, it may be more informative for
the development of the information visualization field as well as for the particular
scientific debates studied.
The validation challenge remains. In addition to search for a method purely
devoted to the validation, we may start to consider the potential of the ability to
continuously monitor an optimization process. As many real-world problems can
be modeled as a complex adaptive system, it is important to be able to measure the
potential value of newly available information.
Challenge 5 Design Metaphor: This issue is fundamental to the future development.
Where do we seek appropriate and meaningful metaphors? How do we ensure that
a chosen metaphor is not misleading or ambiguous?
Researchers continue to search for design metaphors that can provide us a
framework for analytic reasoning and decision making. One example is the notion of
a fitness landscape, which was initially originated in biological evolution, and then
to strategic management. A good metaphor has a broader range of applicability.
Challenge 6 Coverage. The need for expanding citation indexing databases from
journals to other forms of scientific outputs, such as conference proceedings, patent
databases, e-prints, technical reports, and books.
The expansion has been taking place. Thomson Reuters has expanded its
indexing services to books and data. Adding data citations is a strategic move. It
has the potential to provide an essential piece for e-science and e-social science.
Google scholar can search across scientific papers and patents seamlessly. It has
become a common choice for a research evaluation program to take into account
information from multiple sources such as publications, research grants, patents,
and other information.
Challenge 7 Scale up. Although it appears as an algorithmic problem, it involves
other issues as well, such as design metaphor, validation.
The recent widespread interest in Big Data has highlighted both the demands
and the challenges. Large volumes of data arrive at an unprecedented speed from
so many aspects of our environment and our society. Cloud computing, Hadoop,
and cheaper-faster-larger data storage have all contributed to the vastly improved
computing power. On the other hand, the road of scalability is likely to have a
long way to go. The speed that data arrives will probably always beat the speed
to receive analytical results from our ever increasing computing power. Although
we can shift as many slow algorithms to powerful computing facilities as possible,
new applications will emerge that will demand an even higher level of computing
power. What question will slow down IBM’s Watson?
9.6 Revisit the Ten Challenges Identified in 2002 335

Challenge 8 Automatic labeling. The ability to generate informative and accurate


labels boils down to classification and categorization skills. Is it possible to pass on
such skills to algorithms?
The challenge is to choose labels that will make the most sense to the intended
audience. Studies have found that human beings tend to choose relatively broader
label terms than algorithms that are configured to differentiate groups of co-cited
references. A promising strategy is to make use of domain-specific knowledge and
adapt to the knowledge level of the audience.
Challenge 9 Individual differences. One user’s daydream could be the other’s
nightmare. The same visual-spatial configuration may send different messages to
different individuals. Personalization is a related issue.
The situation is similar to that of Challenge 8. The challenge boils down to how to
maintain an effective communication between technology and human information
and reasoning needs. Effectively incorporating and accessing background knowl-
edge is a long-term challenge that has been addressed by artificial intelligence.
Challenge 10 Ethical constraints. Moving from information-oriented search tools
to knowledge-oriented ones shifts the focus from documents to scientists and scien-
tific networks. The knowledge of invisible colleges has been privileged. Sometimes
this is the knowledge that distinguishes an expert from a new comer. Re-think
the famous quotation from Francis Bacon (1561–1626), “Knowledge is power.”
What are the ethical issues we need to take into account?
Much of the competitive edge results from the asymmetric procession of
knowledge. The techniques of making atomic bombs and cloning human beings are
just few examples of what decisions the society as a whole has to make to ensure
the humanity has a healthy future.
In terms of the technical capabilities, I forecasted the following developments in
2002. Which ones have been achieved and which ones are still not reachable?
For the next 3–5 years, between 2002 and 2005, several routes of research and
development are likely to emerge or establish. In particular, the need for expanding
the current coverage of citation databases to include conference proceedings and
patent databases can trigger further research in automatic citation indexing and
large-scale text analysis. The need for timelier disciplinary snapshot should also
drive much research onto this route. Automatic extraction of citation context will
become increasingly popular. Software agents will begin to emerge for summarizing
multiple citation contexts – an important step in resolving the bottleneck of stream-
lining quantitative and qualitative approaches to science mapping. The recent surge
of interest in small-world networks is likely to continue. One can expect to see more
specific studies of scientific networks as small-world networks, including Web-based
resources analysis, traditional citation databases, and patent databases. Research
in small-world networks is likely to draw much of attention to network analysis tools
that can handle large-scale scientific networks. Cross-section comparisons should
increase.
336 9 Visual Analytics

Between 2005 and 2010, knowledge-oriented search tools and exploration tools
will become widely available. Users major search tasks will probably switch from
data-oriented search to comprehension and interpretation tasks. Intelligent software
agents will begin to mature for citation context extraction and summarization.
Genomic maps will play more substantial roles in linking scientific data and
scientific literature. A synergy of data mining in genomic map data and scientific
literature will attract increasing interest.
Beyond 2010, mapping scientific frontiers should reach a point where science
maps can start to make forecasts and simulations. Powerful simulations will allow
scientists to see the potential impact of a new technology. Further than that, we will
have to wait and see.
Many techniques have matured over the last 10 years, including automatically
summarizing multiple documents, automatic construction of ontology, and recom-
mending relevant references. Research has begun to touch the issues of predictive
analysis and how to deal with unanticipated situations. To what extent is scientific
advance predictable? What can we learn from the past so that we will be able to
better recognize early signs of something potentially significant?
I envisage the following two milestones ahead for mapping scientific frontiers.
First, recall the clarity of the conceptual structures demonstrated by Paul Thagard
as we have seen in Chap. 1. Here are the requirements: at any point of time, the first
part of the input is the entire knowledge that has ever conceived by human beings,
the second part of the input is a newly proposed idea, the future system will be able
to let us know very quickly to what extent the new idea has been addressed in the
past and, if it is true, what areas of our knowledge will be affected. This process is in
essence what scientists go through so many times in their research. The key question
is how much of the retrieval, sense making, differentiation, and other analytic tasks
can be performed with considerably more external help.
Figure 9.12 illustrates how the publication of an article by Galea et al. in 2002
altered the holistic system of our knowledge on post-traumatic stress disorder. The
Galea article is six pages long. It cites 32 references. On the one hand, it requires a
substantial amount of domain knowledge to understand its validity and significance.
On the other hand, co-citation patterns indicate its special position in the landscape
of the domain knowledge. The diagrams show how the key contribution of their
work can be summarized at a conceptual level that sense making tasks can become
much easier and efficient.
The second milestone may be built on the first one to a great extent. The second
milestone is to externalize all the activities associated with scientific inquiries in
a form that can greatly integrate and inform scientists of their current situations
and paths that may lead to their goals. Figure 9.13 shows an illustrative schedule
of a fitness landscape of scientific inquiries. Each point of the landscape indicates
the fitness value of corresponding points on the base of the landscape. Many
scientific inquiries can be conceptualized as an exploration on such a landscape.
Some areas they find consistent information. Some other areas they may expect
to find contradictions. Some areas may be well defined, whereas other areas may
9.6 Revisit the Ten Challenges Identified in 2002 337

Fig. 9.12 An ultimate ability to reduce the vast volume of scientific knowledge in the past and a
stream of new knowledge to a clear and precise representation of a conceptual structure

Fig. 9.13 A fitness landscape of scientific inquires

involve uncertainties. The fitness landscape will provide a macroscopic organizing


structure. The movement of a scientific frontier on such fitness landscapes can be
shown with scientific accuracy.
To conclude the book, the quest for knowledge visualization underlines the
importance of understanding the dynamics of science. Science of science still has
338 9 Visual Analytics

a long way to go. The role of visual thinking and reasoning in science is clear. We
draw inspirations from what we see. The advances of our ability to obtain a wide
variety of visual images make us reach what was impossible before. We are able to
see much farther away with modern telescopes. Our mind does a large part of the
work in terms of scientific reasoning. One day, our mind will be augmented further
by information and computational tools that can enhance our vision to our own.

9.7 The Future

We started our journey in mapping scientific frontiers from cartography on the


land, in the sky, and in the mind in attempts to clarify the essentials in visual
communication, especially the metaphors that can make a big picture simple and
useful. We then moved on to explore ways that might enable us to catch a glimpse
of scientific frontiers. Guided by philosophical theories of science, we focused on
the trajectory or trails of competing paradigms through scientific literatures. We
emphasized the historical role of quantitative studies of science and methods such
as co-word analysis, co-citation analysis, and the potential that might be realized by
the use of a variety of information visualization techniques. Finally, we examined a
series of case studies in which scientific debates were a common feature.
Mapping scientific frontiers needs a combined effort from a diverse range
of underlying disciplines, such as philosophy of science, sociology of science,
scientometrics, domain analysis, information visualization, knowledge discovery,
and data mining. By taking our readers through such a wide-ranging journey we
envisage that the book can stimulate and forge some joint lines of research and
a coordinated research agenda so that researchers in different disciplines can work
better together. In addition, the book intends to raise the awareness of available tools
and promising technologies for scientists to adapt and use in their daily scientific
activities.
Throughout this book, we have emphasized on the need for comprehensive
support for knowledge management at a strategic and thematic level as opposed
to that for information seeking at lexical level. We have distinguished relevance
judgments made by lexical match from those made by explicit references to the
existing body of knowledge. Citation analysis is a quantitative approach that can
bring us qualitative insights into scientific frontiers. In this sense, every scientist is
taking part in a social construction of knowledge. And we need to account for how
intellectual contributions have been assessed and perceived by others. Examples in
this book are not given as the best answer to each question; instead, they are meant
to provide some concrete and tangible exemplars just to inspire better ones.
References 339

References

Chen C (2011) Turning points: the nature of creativity. Springer, New York
Chen C (2012) Predictive effects of structural variation on citation counts. J Am Soc Info Sci
Technol 63(3):431–449
Chen C, Ibekwe-SanJuan F, Hou J (2010) The structure and dynamics of co-citation clusters: a
multiple-perspective co-citation analysis. J Am Soc Info Sci Technol 61(7):1386–1409
Dunne C, Shneiderman B, Gove R, Klavans J, Dorr B (2012) Rapid understanding of scientific
paper collections: integrating statistics, text analytics, and visualization. J Am Soc Info Sci
Technol 63(12):2351–2369
Eccles R, Kapler T, Harper R, Wright W (2008) Stories in GeoTime. Info Vis 7(1):3–17
Small H (1973) Co-citation in scientific literature: a new measure of the relationship between two
documents. J Am Soc Inf Sci 24:265–269
Wong PC, Schneider K, Mackey P, Foote H Jr, Chin G, Guttromson R et al (2009) A novel visual-
ization technique for electric power analytics. IEEE Trans Vis Comput Graph 15(3):410–423
Wong PC, Foote H, Mackey P Jr, Chin G, Huang Z, Thomas J (2012) A space-filling visualization
technique for multivariate small world graphs. IEEE Trans Vis Comput Graph 18(5):797–809
Index

A Citation analysis, 2, 5, 37, 38, 41, 44, 117, 148,


Action Science Explorer (ASE), 331–332 164, 166, 167, 172–190, 192, 195, 196,
Actor-network theory (ANT), 40, 168 201, 209, 220, 224, 227, 234, 239, 248,
Acupuncture map, 79–81 254, 264, 295, 338
AGN paradigm, 218–223 CiteSpace, 2, 11, 137, 192, 195, 275, 276, 289,
Alluvial map, 110–113 296, 297, 315, 321–325
Anomalies, 1, 6, 7, 9, 213 Cluster analysis, 128, 129, 180
Author co-citation analysis (ACA), 2, 105, Cluster linkage (CL), 268, 270, 272, 273
106, 148, 172, 180–190, 192, 220 Co-citation analysis, 2, 44, 117, 148, 164, 166,
167, 172–190, 192, 195, 196, 201, 209,
220, 239
B Co-citation clusters, 41, 177, 179, 180
Baseline network, 264–266, 268 Co-citation networks, 105, 107, 138, 173, 177,
Base map, 38, 48, 50, 52, 85, 103, 104, 118, 178, 180, 183, 185–190, 192, 220–223,
127–130, 189, 203, 305, 307–316 237, 240, 264, 286, 297, 300, 338
Between-cluster link, 267, 268 Cognitive map, 87–90
Biological map, 43, 77–83 Collagen research, 8, 175, 192, 203–206, 217
BSE and vCJD, 42–43, 238, 248–254 Competing paradigms, 5–9, 41–44, 170, 186,
201–224, 230, 338
Complex network analysis, 271–275
C Concept mapping, 127–131, 143
Carrot, 301, 329 Conceptual revolutions, 5, 9, 11–16, 203
Cartography, 6, 36, 43, 47–55, 57, 61, 66, 83, Constellations, 26, 43, 56–66, 70, 85, 86, 130
85, 86, 130, 338 Continental drifts, 11, 13–16
Case study, 44, 217, 218, 220, 222, 240, 250, Co-word map, 143, 167–172
254, 294
Catastrophism, 208, 212, 214, 215, 222
Celestial map, 47, 53, 56–77, 83 D
Centrality divergence, 265, 268–270, 273 Data representation, 39
CfA2 Great Wall, 70, 74 Data transformation, 36
Challenges, 7, 15, 23, 38, 39, 43, 45, 67, 70, Dimensionality reduction, 111–127, 143, 185
86, 87, 91, 110, 131, 134, 209, 212, Document co-citation analysis (DCA),
213, 218, 227, 230, 263, 272, 274, 172–180, 185, 192, 195, 209, 220, 239
276, 279, 288, 289, 305, 309, 330, Domain analysis, 172, 201–203, 338
332–338 Dual-map overlays, 44, 312–316

C. Chen, Mapping Scientific Frontiers: The Quest for Knowledge Visualization, 341
DOI 10.1007/978-1-4471-5128-9, © Springer-Verlag London 2013
342 Index

E Information visualization, 3, 4, 21, 23, 33,


Early signs, 259–262, 293, 330, 336 35–39, 44, 53, 88, 91, 95, 97, 101,
Emergent properties, 38 103, 110, 139, 143, 150, 161, 168, 179,
Evidence, 3, 6, 8, 9, 13–15, 23, 33, 39, 42, 51, 185, 193, 201, 229, 230, 255, 321, 325,
66, 77, 78, 92, 203, 207–215, 217–224, 334, 338
250, 253, 260, 272, 298, 303, 304, 321, Invisible college, 2, 4, 5, 10–11, 41, 43, 166,
327, 328 167, 172, 189, 323, 333, 335
Explanation coherence, 11, 13 iPSC. See Induced pluripotent stem cell (iPSC)
Isomap, 121–126
F
The first X-ray, 32 J
Jigsaw, 78, 325–328
G John Snow’s map of cholera deaths, 23, 24
Galaxies, 28, 42, 53, 63, 64, 66–71, 74, 75,
107, 130, 174, 218–220, 222
Genomic maps, 81–82, 336 K
GeoTime, 108, 321, 322 Knowledge diffusion, 10
Gephi, 2, 137–138, 314 Knowledge discovery, 173, 224, 227, 229–238,
Gestalt psychology, 9, 34–45, 114 241, 243, 338
Gestalt switch, 9, 11, 32, 33, 43 Knowledge garden, 146
Global map of science, 175, 176, 305, 310
Global science maps and overlays, 304–316
Goodness of model, 269 L
Gradualism, 208–209, 211–213, 215, 222 Landmark article, 9, 205, 213–216, 223, 238,
Graphical representations of information, 35 242, 277, 280
Graphs, 88, 93, 95, 124, 127, 131–135, 137, Large graph layout (LGL), 138
138, 143, 155, 169, 175, 186, 214, 307, Latent domain knowledge, 44, 224, 227–255
326, 327, 330, 331 Latent semantic indexing (LSI), 91–93,
Great wall of galaxies, 74 150, 224
LGL. See Large graph layout (LGL)
Literature-based discovery, 263, 265
H Locally linear embedding (LLE), 121,
Hermeneutics, 31, 33, 202 124–127
Hidden Markov models (HMMs), 150–153, LSI. See Latent semantic indexing (LSI)
156, 158–160, 169
HistCite, 190–192, 313
History of science, 5, 7, 20, 31, 39, 42, 166, M
172, 262, 322, 333 Main-stream domain knowledge, 228, 234,
237, 238, 241–242, 248–255
Map of the universe, 70–75, 107
I Mass extinctions, 42, 44, 169, 170, 201,
Impact theory, 207–209, 211–213, 215, 206–218, 222, 229, 230, 275
216, 222 Matthew effect, 164–167, 195, 227
Inclusion index, 168–170 Memex, 86, 87, 143
Inclusion maps, 167–170 Minimum spanning tree (MST), 88, 93, 95,
INDSCAL, 98, 119–121, 183, 229 138, 150, 178, 186, 187, 193
Induced pluripotent stem cell (iPSC), 111, 137, Modularity change rate (MCR), 265–267,
275, 276, 278–289 270, 273
Influenza virus protein sequences, 82–83, 138 Modularity of a network, 266, 281, 325
Information foraging, 148–151, 153, 156–160 Multidimensional scaling (MDS), 8, 44, 93,
Information science, 3, 5, 38–40, 166, 173, 97, 106, 111–120, 122–124, 126–130,
180–183, 185, 192, 193, 196, 201–203, 143, 163, 170, 175, 177, 180, 182, 183,
233, 239, 240, 262 185–190, 192
Index 343

N Scientific paradigms, 5, 7, 38, 43,


Napoleon’s retreat, 22, 23 52, 201, 205, 224, 228–230,
Narratives of specialties, 176–180 255, 322
Negative binomial models, 270 Scientific revolutions, 5–7, 9, 11, 12, 167,
Non-mission research, 18–20 229, 230
Novelty, 148, 261, 263, 265 Scopus 2010 global map, 309
SDSS. See Sloan digital sky survey (SDSS)
SDSS Great Wall, 70, 74
P Self-organized map (SOM), 53, 103, 104, 168
Pajek, 2, 111, 136–137, 305, 314 Shneiderman’s mantra, 150, 321
Partitions, 95, 130, 181, 189, 235, 265, Singular value decomposition (SVD), 92
266, 268 Sloan digital sky survey (SDSS), 69–77, 327
Patent co-citations, 166, 193–195 Small-world networks, 87, 131–133, 135, 143,
Pathfinder networks, 93–99, 106, 107, 114, 271, 335
150, 169, 175, 183–185, 187–190, 192, Social networks, 37, 131, 132, 265
205, 213, 220, 227–229, 235, 238, Sociology of science, 2, 5, 39, 40, 165,
240–249 333, 338
Pathfinder network scaling, 93–95, 106, 113, SOM. See Self-organized map (SOM)
169, 183, 187–189, 192, 228, 235, Spatial-semantic mapping, 151, 154
238, 243 Structural hole, 132, 265
Philosophy of science, 2, 6, 201, 203, 333, 338 Structural variation, 44, 259–274, 281
Pioneer spacecraft, 27, 28 Structure and dynamics of scientific
Power of Ten, 47, 48 knowledge, 44, 163–197
Predictive analysis, 336 Supermassive black holes, 44, 217–224, 230
Principle component analysis (PCA), 44, 97, Survey knowledge, 89, 145
111, 114, 122–126, 128, 129, 180, 182, SVD. See Singular value decomposition (SVD)
185, 187–189, 211 Swanson’s impact, 239–240
Profitability, 148, 149, 151–154, 156 System perturbation, 259–274
Project Hindsight, 17, 18

T
R TextFlow, 110
Retraction, 290–304 Thematic maps, 8, 38, 43, 47–49, 52–54,
204, 205
Thematic overlay, 43, 48–50, 52, 85, 127, 128,
S 144, 308
Scale-free networks, 136137 ThemeRiver, 108, 109
Science mapping, 1–4, 15, 38, 40–41, 43, ThemeView, 38, 101, 102
44, 76, 91, 127, 161, 163, 164, 166, Topic evolution, 110
172, 174, 180, 195, 196, 224, 295, Topic variations, 109, 110
304, 335 Tower of Babel, 23–26
Scientific debates, 5, 9, 44, 205, 230, 334, 338 TRACES, 16–20
Scientific frontiers, 1–20, 22, 33, 37–44, 76, Trajectories of search, 44, 143–161
77, 83, 139, 144, 167, 170, 176, 186, Transformative ideas, 259, 264
189, 197, 203, 223, 224, 227, 263, 289, Traveling salesman, 89, 131, 144–146, 177
313, 322, 336–338 Triangular inequality, 93, 94, 150, 183
Scientific inscriptions, 5
Scientific literature, 2–6, 8, 38, 45, 70, 91, 105,
143, 144, 166, 167, 172, 174, 175, 180, U
203, 223, 227, 229, 234, 250, 254, 276, UCSD map, 305, 306
290–292, 294, 299, 304, 306, 321, 322, Undiscovered public knowledge, 166, 227,
331, 333, 336, 338 230–234
344 Index

V Voyager’s message, 25, 28, 29, 43, 57


Visual analytics, 1, 3, 5, 9, 15, 35–39, 43, VxInsight, 82, 101–103, 194
45, 108, 196, 289, 291, 294–296, 304,
321–338
Visualism, 6, 31, 53, 224 Z
Visual navigation, 150–152, 155 Zero-inflated negative binomial models
Visual thinking, 20–39, 47, 321, 338 (ZINB), 261, 269, 270

S-ar putea să vă placă și