Sunteți pe pagina 1din 154

HARNESSING

THE POWER OF
GOOGLE
What Every Researcher Should Know

Christopher C. Brown

Copyright © 2017 Christopher C. Brown


All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise,
except for the inclusion of brief quotations in a review, without prior permission in writing from the
publisher.
Library of Congress Cataloging-in-Publication Data
Names: Brown, Christopher C., 1953– author.
Title: Harnessing the power of Google : what every researcher should know / Christopher C. Brown.
Description: Santa Barbara, California : Libraries Unlimited, an imprint of ABC-CLIO, LLC, [2017] |
Includes index.
Identifiers: LCCN 2017007461 (print) | LCCN 2017021554 (ebook) | ISBN 9781440857133 (ebook) |
ISBN 9781440857126 (acid-free paper)
Subjects: LCSH: Google. | Internet searching. | Internet research. | Libraries and the Internet. | Internet in
library reference services.
Classification: LCC ZA4234.G64 (ebook) | LCC ZA4234.G64 B76 2017 (print) | DDC 025.04252—dc23
LC record available at https://lccn.loc.gov/2017007461
ISBN: 978-1-4408-5712-6
EISBN: 978-1-4408-5713-3
21  20  19  18  17         1  2  3  4  5
This book is also available as an eBook.
Libraries Unlimited
An Imprint of ABC-CLIO, LLC
ABC-CLIO, LLC
130 Cremona Drive, P.O. Box 1911
Santa Barbara, California 93116-1911
www.abc-clio.com
This book is printed on acid-free paper
Manufactured in the United States of America
Figures displaying Google Web site materials are © 2015 Google, Inc., used with permission. Google and
the Google logo are registered trademarks of Google, Inc.

Contents

Illustrations
Introduction
Chapter 1: Searching Generally
History of Searching
Tension between Controlled Vocabulary and Full Textuality
Subject Headings vs. Subject Descriptors
Full-Text Searching: A Different Way of Thinking
Reference
Chapter 2: How Google Works
Crawling and Indexing
Ranking of Results
Beyond Google’s Reach
General Google Search Tips
Indirect Searching—Hidden Internet Content
Reference
Chapter 3: Searching Google Web
Basic Search Techniques
Power Search Techniques
Keyword Formulation
Evaluating Web Content
Cached Content
Reference
Chapter 4: Power Searching for Primary Sources Using Google Web
Site-Specific Google Searching
Site Searching on the International Level
Site Searching on the Foreign Government Level
Site Searching on the U.S. National Level
Site Searching on the State Government Level
Site Searching on the Local Level (Counties and Communities)
Site Searching for Commercial Content
Site Searching for Nonprofit Content
Reference
Chapter 5: Google Scholar and Scholarly Content
Depth of Searching in Google Scholar
What Content Does Google Scholar Retrieve?
Books—From Google Books
Evaluating Google Scholar Content
Right Margin Links
Left Sidebar
Citation-Specific Links
Effective Searching in Scholar
Google Scholar Metrics
References
Chapter 6: Google Books
Library Project
Google Books Partner Program
Google Book Views
Searching and Navigating Google Books
The “Fulfillment” Part
How Complete Is Google Books?
Government Documents in Google Books
Magazines in Google Books
Google Books and HathiTrust
Google Books and the Internet Archive
Fingers and Hands in Google Books
Citing Google Books
References
Chapter 7: Google as a Complement to Library Tools
What’s Wrong with Academic Libraries
The Three Googles
What’s Right with Academic Libraries
The “Flattening” of Information Sources
The “Holy Grail” of Search Results in the Google Age
Reference
Chapter 8: Academic Research Hacks
No Country Redirection (NCR) searching
Google Books Ngram Viewer
Image Searching
Google Translate
Legal Cases
Searching Patents
References
Chapter 9: Case Studies in Academic Research
Case Study 1: Resources on Human Trafficking
Case Study 2: Funding for Religious Schools in Tanzania and South Africa
Case Study 3: Paris Tango House Fashion from the 1920s
Case Study 4: Data on Geothermal Heat Pumps
Chapter 10: Searching for Statistics
Step One: Determine Who Cares about the Statistics You Are Seeking
Step Two: Search within the Internet Domain of the Entities Likely to Issue
Statistics
Keyword Searching and Statistics
Statistics Case Study 1: Immigrant Statistics for the United States
Statistics Case Study 2: Subnational Data for Uganda
Conclusion
Index

Illustrations

FIGURES
Figure 1.1. Key Events in Recent Searching History
Figure 1.2. Contrast in Thinking When Searching by Controlled Vocabulary vs.
Searching Full Text
Figure 2.1. Robots Exclusion Example
Figure 2.2. Example of Numerical Data Not Exposed to Google Searching
Figure 2.3. Indirect Searching to Uncover Hidden Internet Content
Figure 3.1. Basic Structure of a URL
Figure 5.1. Google Scholar Content Ingest Model
Figure 5.2. Examples of Metadata in Google Scholar Records
Figure 5.3. How to Test for Google Scholar Search Depth
Figure 5.4. Testing Google Scholar Search Depth
Figure 5.5. Testing Full-Text Searching Depth in Google Scholar
Figure 5.6. Reason for Failure Revealed
Figure 5.7. Setting Google Scholar for a Specific Library
Figure 5.8. Google Scholar Features
Figure 5.9. Date Sort and Searching Only within Abstracts of Articles
Figure 5.10. Scholar’s Option to Search Only in Citations
Figure 5.11. Google Scholar Features underneath Citations
Figure 5.12. Clustering of Google Scholar Records
Figure 5.13. Citations in Google Scholar
Figure 5.14. Google Scholar Citation Failure for Book Chapters
Figure 5.15. “My Library” Feature in Google Scholar
Figure 5.16. Keyword Search Brainstorming Strategy
Figure 6.1. Google Books Content Sources
Figure 6.2. Limited Preview in Google Books
Figure 6.3. Search Term Hits with Google Books
Figure 6.4. U.S. Government Publication with Snippet View in Google Books
Figure 6.5. HathiTrust Augments Google Books for Government Publications
Full-Text Access
Figure 6.6. Contrast between Google Books and HathiTrust for Out-of-
Copyright Publication
Figure 7.1. Broadcast Searching Model of Late 1990s to Early 2000s
Figure 7.2. Discovery Tools as True Federated Search Tools
Figure 7.3. The Three Googles Model
Figure 8.1. Google and Google Scholar as Seen in Japan
Figure 8.2. Google Scholar Search Results from Japan
Figure 8.3. Accessing English-Language Google Scholar from Japan with NCR
Fix
Figure 8.4. Accessing Google Web from Hungary with NCR Fix
Figure 8.5. Using the Chrome NCR Redirect Add-in
Figure 8.6. Using the NCR Redirect Add-in to Search a Different Local Version
Figure 8.7. Viewing UK News from the United States with Chrome NCR Add-in
Figure 8.8. Google Books Ngram Viewer Showing Popularity of Various Foods
Figure 8.9. Go Example of Google Images Used to Identify a Place Photo
Figure 9.1. Google Books Highlights Page Hits

TABLES
Table 1.1 Contrast between E-books and E-journals
Table 2.1 Examples of Robot Exclusion from Popular Web Sites
Table 3.1 Commonly Used Power-Searching Strategies
Table 3.2 Evaluation Criteria
Table 4.1 TLDs from Selected Countries
Table 4.2 Variations in National Government Secondary Domains
Table 4.3 Examples of TLDs Exploited for Commercial Purposes
Table 4.4 Domains for U.S. States and the District of Columbia
Table 5.1 Evaluation Criteria for Google Scholar
Table 7.1 The Information Access Anomaly: Assumption of 400 Words per Page
(WritersServices 2001)
Table 7.2 Distinction among Resource Types
Table 8.1 Interesting U.S. Patents
Table 8.2 World Patent Office Content in Google Patents (as of December 16,
2016)
Table 9.1 Finding Web Domains from Entities that Care about the Topic
Table 10.1 Entities Likely to Issue Statistics about Fisheries
Table 10.2 Brainstorming Keywords for Statistical Questions
Table 10.3 Summary of Domain Searches

Introduction

Why is a reference librarian writing about using Google for academic research?
Don’t professors tell students not to use Google in their research? Isn’t Google a
threat to librarianship, and won’t it eventually replace the need for librarians?
This book will suggest that Google is extremely valuable in the academic
research process, but users need to understand what is being searched, how to
constrain searches to academically relevant resources, how to evaluate what is
found on the Web, and how to cite what is found.
It’s not uncommon for new university students to think they already know
how to search. Their first paper comes due, and what do they do? They resort to
using Google. They think they know it all—or at least where everything can be
found. But when they get that first paper back with an unsatisfactory grade and
comments like, “you need to cite peer-reviewed articles,” “don’t rely on
Google,” and “you need reliable sources to support your arguments,” many of
them show up for research consultations and to meet with reference (research)
librarians. It takes a librarian to really show them how to search.
I’ll let you in on a little secret: all reference librarians, academic or otherwise,
use search engines, especially Google. The extent to which it is used varies, of
course, but Google can be the single best starting point for navigation down the
right path. When someone has a question, they don’t know the answer. This
seems obvious, but it is profoundly important. If someone doesn’t know
something, they may not even know how to visualize what the answer looks like
or what path to pursue. Trying to navigate in a fog is nearly impossible. But
Google is there to correct our misspellings, suggest new pathways, and clear the
fog.
This book is not intended to cover every feature of Google. We intentionally
gloss over Google Earth, most of the Google widgets, Google personalization
features, linkages to Google+ and other Google properties, and even some of the
search capabilities that have little to do with academic research. This is
intentional. Many books already do that. All you have to do is search Google
Web like this: how to search google, or Google Books like this: how to search
google. This book is focused on assisting students, researchers, teachers,
professors, and librarians in finding primary and secondary sources using
Google.

1
Searching Generally

The history of language, writing, and indexing is a fascinating one. When


thinking of how we access the vast libraries of information that exist in the
world, we must remember that before the writing of literate cultures there were
oral cultures. These cultures had ways of remembering or indexing in the mind
as well. Long gone are the days when scholars would memorize long texts and
be able to search their memories for ideas and arguments. Since Gutenberg’s
printing press and the printing revolution, publications have proliferated and
various finding aids, including bibliographies, printed indexes and catalogs, card
catalogs, and, more recently, online catalogs and indexes, have been created to
provide intellectual access to print publications. To fully appreciate the
technology available to us today, we need a bit of historical perspective.

HISTORY OF SEARCHING
The blossoming of magazine and journal publishing soon necessitated a way
to discover all this content. Thus modern indexing was born. A glance at a
technology timeline will help give us a historical perspective (see Figure 1.1).
The mid-1800s saw the beginning of periodical indexing with pioneers like
William Frederick Poole and H. W. Wilson. Poole’s Index to Periodical
Literature, published in the mid-1800s through various editions and
supplements, economized space with abbreviations and small print and was
tedious and challenging to use, but it worked. Wilson, whose work endures to
this day, also saved space with abbreviations, but incorporated a technology that
was being developed in his day: subject headings along the lines of Library of
Congress subject headings.
Some of our older readers will remember libraries with card catalogs—those
handsome wooden cabinets with tiny drawers to accommodate cards with
information about the books or other materials owned by the library. The most
common scheme for library card searching was the dictionary catalog approach:
cards arranged alphabetically for authors, cards for subjects, and cards for titles.
But it gets more complex that just these three simple categories. When there are
multiple authors or editors, another card set needs to be created. Every subject
assigned generates more card sets. Title cards would account not only for the
main title, but also for additional titles such as series titles, varying forms of the
title, translated title, etc.

FIGURE 1.1.   Key Events in Recent Searching History.

There was no keyword searching in the physical library catalog days. Access,
at least for English language materials, was “left-anchored.” That is, searchers
had to start from the left to look up an author, subject, or title. If an author’s
name was Christopher C. Brown, the name was inverted “phonebook” style to
Brown, Christopher C. Subjects were governed with controlled vocabularies
such as the Library of Congress subject headings. Titles had special
considerations as well. Users had to omit initial articles (for English, omitting a,
an, or the from the beginning of titles). This greatly inhibited the access to
materials, but because that was the state of the art at that time, users didn’t know
what the future held.
The late 1970s and 1980s saw the development of online catalogs. Because
libraries were worried that the public would not accept the new technology,
online catalog records were made to look like printed catalog cards. But there
was one major advancement of technology that was transformative: the ability to
search the online catalog record by keyword. In other words, users wouldn’t
need to think about inverting an author’s name. If all that was known of a title
was several words within the title and perhaps an author’s first name, the book
could likely be found. Users unaware of the proper formation of subject
headings could nevertheless locate materials simply by searching using
keywords. This technology was a huge forward leap and should be fully
appreciated.
The quest for magazines or scholarly journal content experienced a
transformation similar to books in library contexts. When indexes no longer had
to be printed out, economy of space was less important. Abstracts of articles
could be incorporated into the entry for each article. Keyword searching likewise
transformed access to scholarly journal literature.

TABLE 1.1.   Contrast between E-books and E-journals.


E-books     E-journals
Entire books online, but very often with DRM restrictions.     Single articles available through publishers and
aggregators, but no DRM.
Restricted printing and downloading.     No restrictions on printing and downloading.
Sometimes limits on “simultaneous users,” making use for     No simultaneous user restrictions.
course-related materials challenging.
Often requires establishing a login with the vendor.     Vendors generally do not know the identity of users.
Often requires downloading helper software with associated     Usually only requires Adobe Acrobat Reader, which is
barriers. almost universally installed.

But the technology didn’t stop there. Full texts of e-books and articles began
to appear in the late 1990s and early 2000s. Early e-books were not fun to use.
Digital rights management (DRM) systems made access onerous. In order to sell
their new e-book model to print publishers, e-book vendors tried to replicate the
print user experience with their digital books, making the argument that the
online model completely replicated a print user model: one user per time, per e-
book. But users didn’t understand it that way: they wanted unlimited access to
online content, not “one simultaneous user.” Other vendors applied DRM with
helper applications such as Adobe Digital Editions. Downloading of auxiliary
software places additional barriers before the user, as evidenced by the many
assistance calls placed to reference desks. These and more e-book barriers persist
to this day.
Where e-books failed, e-journals succeeded (see Table 1.1). Books often had
DRM protections, but journals didn’t need this, because it was the part
(individual articles) and not the whole (entire books) that were being exposed.
The only barriers users had to e-journal content was whether their library
subscribed or not and was the initial authentication process. Once users were
authenticated, they could easily save entire e-journal articles, print them out, and
read them either online or as a printout.
Google entered the world of full text of both e-books and e-journals with
Google Books and Google Scholar, respectively. But more about those models
later. Suffice it to say that these two additional Google initiatives transformed the
way students think about content. From the initial search to finally accessing the
full text, the discovery and fulfillment process was forever changed.

TENSION BETWEEN CONTROLLED VOCABULARY AND


FULL TEXTUALITY
The “holy grail” of searching is to find “all and only” the relevant materials.
How one goes about finding this “all and only” has been gradually shifting.
Traditionally librarians have emphasized established search techniques when
doing library instruction. These include Boolean search terms (AND, OR, NOT),
nesting with parentheses, and proximity operators (these vary by database, but
may look like w/10, NEAR5, or something similar). In addition instruction
librarians spend much time teaching the mastery of searching by controlled
vocabularies. Controlled vocabularies are agreed-upon vocabulary sets,
established by subject experts using thesaurus construction standards (NISO
2010), for use within a specific discipline. For example Psychological Abstracts
(and its online analog PsycInfo) are controlled by the Thesaurus of
Psychological Index Terms; Sociological Abstracts by the Thesaurus of
Sociological Indexing Terms; and ERIC, the U.S. Department of Education’s
Education Resources Information Center index and database, has as its thesaurus
the Thesaurus of ERIC Descriptors. In science and engineering there was the
INSPEC Thesaurus and the IEEE Thesaurus, among many others. The idea
behind subject-specific thesauri was to capture the discipline-specific nuances of
terminology and to apply it consistently within that discipline. It should be
noted, however, that even between disciplines as close as education, psychology,
and sociology there were sometimes great differences in assigned terminology
and thus in the resulting application of those terms to indexed items.
For example, for the notion of e-mail, the ERIC descriptor is “Electronic
Mail,” psychology uses “Computer Mediated Communication,” and sociology
uses “Telecommunications” often combined with “Interpersonal
Communication.” Yet each of these official descriptors lags behind the culturally
accepted term “e-mail.” Indexing, in all of its structures and standards, is far
from a perfect art and is often variously applied, depending on the indexer.
However, Google’s searching power has become so popular with users—and
with librarians—that advanced searching techniques and reliance on thesauri has
decreased in recent years. Do we still need thesauri? Yes we do, especially in the
fields of medicine and law. Do we still need Boolean operators and other
connectors? Yes, because certain databases require them to produce predictable
search results. Indexing and abstracting products (or as librarians refer to them,
A&I products) are easily searched with Boolean operators (AND, OR, NOT).
These operators work well because what was being searched was only the words
contained in the index record: authors, title, subject, and perhaps an abstract—a
relatively small set of words. But when searching full-text books, the AND and
OR operators all of a sudden lose much of their power. Suppose you are
searching for hummingbirds in Colorado. Searching hummingbirds AND
Colorado in an abstract record (metadata only) will easily uncover relevant
materials, because the span of words searched is constrained to capture the true
“aboutness” of the word being indexed. But when using the same Boolean
operators in a database capable of searching across the full texts of thousands of
millions of books, too many irrelevant materials are going to be retrieved.
“Colorado” may appear on page 3 of a book, and “hummingbird” may appear on
page 250. But is this book really about hummingbirds in the context of Colorado
given that the two terms are so far removed from each other? To effectively
perform a search across such large collections of full texts we need one of two
things: 1) proximity operators that can constrain the closeness of one term to
another, thus increasing the chances of a greater degree of relevancy, or 2) a
sophisticated relevance ranking technology that is able to place at the top of the
results list materials where two or more search terms are in close proximity to
one another, where weighting is given to structural considerations like title,
subdivisions, assigned descriptions, etc., and a sophisticated underlying system
of synonymy is employed. The former system of proximity operators is what
library vendors typically provide in their products as described deep within their
help modules; the latter system is what search engines like Google typically do
for us. Do users understand the differences between how to frame a search in
Google versus how to put together a search in a typical library database? Do
users know how to work proximity operators when searching a library-
subscribed full-text database? The answers to these questions are likely “no.”
This helps us understand the popularity of Google Web, Google Scholar, and
Google Books and helps explain why researchers often don’t start their research
with library resources.
The problem with Google is that Google isn’t telling us exactly how the
results are retrieved and how the relevancy is ordered. We simply see many more
resources for the same search, and the ordering of results is mysteriously up to
Google. I recommend that users do a “both-and” approach. Search both Google
and traditional library databases. When I work with students in reference
consultations, I very often take them to Google Scholar first, because they can
accumulate a lot of relevant resources very quickly. Then I tell them to play
“clean up” in the subject-specific library databases. This kind of joint strategy is
especially important for doctoral students who are responsible for reading
everything (not just some things) about their topics.
There are two disciplines where exactness in searching is absolutely essential:
medicine and law. We don’t want our doctors to miss crucial content about their
field. Nor do we want attorneys with expensive “billable hours” to overlook
materials that could help our legal standing as they represent our interests. In
these disciplines there is heavy reliance on the application of controlled
vocabularies. The National Library of Medicine has developed their Medical
Subject Headings (MeSH) to capture disease names, medical procedures, and
other medical terms. Staff at the National Library of Medicine with subject
expertise carefully assign MeSH headings for precision in retrieval of results.
In the legal realm, the proprietary West Key Number System is a long-
standing authority for indexing precedent-setting cases, law reviews, legal
encyclopedias, and other materials. These legal subjects are assigned by
attorney-editors who are specialists in various areas of law. It is important for
legal researchers to be able to find “all and only” relevant legal cases, as well as
their disposition (whether a law has been overturned or still stands).
In the realms of medicine and law it seems that strict adherence to controlled
vocabularies is here to stay for generations to come. However, in many other
disciplines we are seeing a paradigm shift. Users are tending not to take the time
to look up subjects or descriptors and instead simply search the full text with
tools like Google Web, Google Scholar, and Google Books.
Several observations should be made about controlled vocabularies and their
use by various database vendors. The first observation is that there are great
challenges when trying to decide whether to use a controlled vocabulary or not,
and then which controlled vocabulary should be used. As we saw with the three
social science disciplines of psychology, education, and sociology, there are
sometimes great differences in nomenclature within these disciplines. What
about contexts that are more universal? Library catalogs, for example, are
collections of books and other materials across all disciplines: arts and
humanities, social sciences, and science and engineering. Usually one controlled
vocabulary is used in academic libraries to capture the “aboutness” of these
works: the Library of Congress Subject Headings. But in using a generalized
nomenclature set, subject-specific nuances are not captured. The important point
in this observation is that controlled vocabularies are many, they vary in scope
and applicability, and they are not always applied in every context.
The second observation we need to make is that many databases, through
“smoke and mirrors,” create the impression that they are using a principled
thesaurus and applying it consistently throughout their product, but in fact this is
not the case. This is not meant to criticize them, for they are doing the best they
can with what they have to work with. But we need to be aware of what is really
happening in aggregated databases like those produced by vendors like EBSCO,
ProQuest, and Gale.
Unlike databases like PubMed and Westlaw, which get their records from a
single stream that they control, aggregator content comes from many sources,
some of which they have more control over than others. To make it appear that
they have a semblance of vocabulary control, subjects are first captured out of
individual index records, then are “back-generated” into a master index and
“normalized” (brought into a uniform style) to the extent possible. Rather than
individual index records being carefully examined (an impossible idea when you
consider the scale of records vendors deal with), clusterings of like subjects are
mashed together, and outliers are dealt with on an ad hoc basis. All three of these
vendors have done a great job of cleaning up records that only a few years ago
were a mess. My point here is that controlled vocabulary sometimes is superb,
especially in fields of medicine and law, but many times it is overplayed,
because millions of records are massaged by computer and only nominally
overseen by human eyes.

SUBJECT HEADINGS VS. SUBJECT DESCRIPTORS


A century ago books were carefully cataloged and a small number of subjects
were assigned per work. But for every subject assigned, another set of catalog
records had to be typed out and filed into those handsome card catalog trays. For
this reason the infamous “rule of three” was often applied, stipulating that no
more than three subjects should be applied to a book. And the subjects were
precoordinated headings. By precoordinated I mean that they had one or more
semantic notions per subject heading.
We can see this in any library catalog. Searching for the subject cats, we see
results like these:
Cats (one semantic notion)
Cats—Aging (two semantic notions)
Cats—Anatomy (two semantic notions)
Cats—Anatomy—Atlases (three semantic notions)
Cats—Anatomy—Juvenile Literature (three semantic notions)
Cats as Laboratory Animals—Law and Legislation—United States (three semantic notions)
Cats—Fractures—Treatment—Handbooks, manuals, etc. (four semantic notions)

In order to keep the subjects to three or fewer, it is necessary for catalogers to


precoordinate the terms—that is, to combine notions together on a single line. In
some cases six or more subjects can be precoordinated. Here is example of six
semantic notions in a precoordinated Library of Congress subject heading:
United States—Armed Forces—Yugoslavia—Pay, allowances, etc.—Taxation—
Law and legislation. Yet even with all this effort to capture the “aboutness” in as
few lines as possible, it is still failing to capture more granular subjects that are
dealt with within books.
In contrast to the idea of subject headings are subject descriptors. Subject
descriptors have one and only one semantic notion. In a descriptor world, the
subjects are not precoordinated by a cataloger or an indexer. In the subject
descriptor world the coordination must be done by the searcher. Thus, it is a
postcoordination method. This means that if one were trying to search for books
under the Library of Congress Subject Heading Cats—Fractures—Treatment—
Handbooks, manuals, etc., searching would now need to be done with the
Boolean AND operator like this: cats AND fractures AND treatment AND
(handbooks OR manuals). The advantage of descriptors is that any number of
descriptors can be assigned. In a database world catalog cards need not be
printed, journal indexes don’t need to occupy reference shelves, and subject
descriptors can proliferate in the online environment with no physical
consequences.
In early print indexes, like those produced by the indexing pioneer H.W.
Wilson, precoordinated subject headings were used because Wilson’s indexes
were developed around the same time the Library of Congress Subject Headings
(LCSH) were being developed, and it offered an economy system whereby fewer
access points needed to be created because of precoordination of terms. As
Wilson products migrated to online databases, the practice of assigning subject
headings rather than subject descriptors continued. After EBSCO purchased the
Wilson products and incorporated them into their mega-indexing tools like the
Academic Search and Business Search products, we see a mixture of subject
headings (one or more semantic notions) and subject descriptors (only one
semantic notion) in the same environment. This can be clearly seen in the online
thesaurus EBSCO provides for Academic Search Complete:
cats—vaccination (two semantic notions)
cats—virus diseases (two semantic notions)
cats as carriers of disease (one semantic notion)
cats as laboratory animals (one semantic notion)
cats in advertising (one semantic notion)
cats in art (one semantic notion)
cats in art—exhibitions (two semantic notions)
cats in literature (one semantic notion)
cats in motion pictures (one semantic notion)

The ProQuest and Gale products, because they don’t have to deal with an
environment with subject headings thrown in with subject descriptors, have
back-generated thesauri that show only subject descriptors with only a single
semantic notion in each one.

FULL-TEXT SEARCHING: A DIFFERENT WAY OF


THINKING
Could we find anything in a world without controlled vocabularies? How do
we think about information if it is not first categorized and normalized for us?

FIGURE 1.2.   Contrast in Thinking When Searching by Controlled Vocabulary vs. Searching Full
Text.

It needs to be clearly stated that strategies for framing a subject descriptor


search differ greatly from a full text search using Google. Under a subject
descriptor method, searchers must think broadly about the topic. Consulting the
thesaurus for the discipline, if there is one, is essential to understanding the
normalized accepted nomenclature for the discipline. Searching should not be
done by idiosyncratic terminologies used by individual authors, but by
regularized terms accepted by searchers within the discipline.
On the other hand, full-text searching using Google is done at the other end of
the spectrum. Searchers must place themselves, as much as possible, in the mind
of the author, seeking idiosyncratic turns of the phrase and using a generous
amount of synonyms. If one is searching for texts written in the 1920s, for
example, terminology for racial groups and local customs will differ greatly.
This is the difference between thinking like a “full text” and thinking like a
cataloger or an indexer (Figure 1.2).
Library databases increasingly allow for full-text searching across content to
which they provide full text. This proximity searching is quite powerful because
users can control relevancy by stipulating how far keywords are from each other.
For example, Term A within 20 words or 5 words of Term B, terms within the
same sentence, or terms within the same paragraph. Whereas Boolean operators
AND, OR, and NOT are nearly universal among online databases, proximity
operators are not, and each vendor’s product needs to be thoroughly studied in
order to have confidence in proximity searching. In the world of Google, the
library search operators do not work (not that users had really learned them
anyway). Google’s relevance ranking is supposedly smarter than any librarian.
I’m not so certain of that, but that’s the reality we have in Google’s world.

REFERENCE
National Information Standards Organization. 2010. “ANSI/NISO Z39.19 - Guidelines for the Construction,
Format, and Management of Monolingual Controlled Vocabularies.” Available at:
http://www.niso.org/standards/resources/Z39-19.html. Accessed December 10, 2016.

2
How Google Works

Search engines are one of the marvels of our time. They provide responses from
billions of pages of Web content in a fraction of a second. But how do they
actually work?

CRAWLING AND INDEXING


Perhaps the best place to start is to state how they do not work. Web search
engines do not perform live searches of the Internet. You would not want them to
do that. It would simply take too much time. Instead, search engines create an
index of Web content and curate it over time.
This is very much analogous to what happens in local online library catalogs.
When you search for a book in a catalog of a research library, a live database
query is not what is happening. Rather, a previously created index of the books is
what is searched. Imagine someone wanting a book and approaching the
reference desk of a library. The librarian then takes the title of the book, starting
from the first shelf, and reading spine titles until the requested title is found. This
would take days and is not practical. This is analogous to performing a live
query (as opposed to querying an index) of a library catalog or of the entire
Internet. For this reason libraries are arranged by classification systems, and
searching is done against the indexing of call numbers, authors, titles, and
subjects in library online catalogs.
Likewise search engines gather Web data by “crawling,” or indexing, Web
content over time. How often sites are crawled by search engines depends on
how important they are. For example, I have a personal Web site provided by my
university that I have maintained for nearly 20 years. I use it to post research
projects, guides for government information specialists, and even an indexing
project I did for the United Nations 15 years ago. In all, I have over 5,000
personal “pages,” and I update a small subset of these pages maybe once per
month. Obviously Google doesn’t need to crawl my site very often.
The U.S. Bureau of Labor Statistics is an official government statistical
agency that issues economic data and press releases about the economic
conditions of the United States on a regular basis. It’s likely that Google crawls
this site daily. News sites like CNN, Fox News, NBC, ABC, CBS, and the like
try to best each other with late-breaking news updates. Search engines are likely
to crawl these sites extremely frequently, perhaps many times per hour.
After the crawling and ingesting of content are complete, some sense needs to
be made of the information. It’s at this point that search engines build up an
index of the content. Web pages have at least some kind of metadata (or
background information) in the underpinnings of the Web page source code. At
minimum there is the URL itself and the date of the Web page capture. There
may also be metadata tags containing a title, responsible parties (authors),
creation date, and language. As we will discuss later, the metadata provided by
Web pages varies greatly, creating problems for consistent information retrieval.
Search engines add this metadata, as well as indexing every word found on the
Web page, to its giant index. Various compression algorithms can be applied to
the index to make indexing faster. This is analogous to a “back of the book”
index which contains reference points (page numbers) to more quickly locate
content within the book.

RANKING OF RESULTS
Google’s indexing of the Web alone is not sufficient. With billions of Web
page content, some sense must be made to the ranking of results in response to a
user query. This relevance ranking is proprietary to each search engine, and is, in
fact, the feature that distinguishes the good search engines from the great search
engines. Google is constantly tweaking its relevance ranking algorithms, but
over the years Google has proven itself to have superior relevance ranking.
Library vendors, either online catalog vendors or online database vendors,
have offered their versions of relevance ranking. In general these companies are
forthcoming with how they rank the search results within their result sets. But
Google does not tell us how they do it. For one thing, this is how they position
themselves within the competitive search engine market, and for another thing,
they are continually modifying the way their relevance ranking works. So don’t
expect Google to tell us why certain results appear before others within search
results.
BEYOND GOOGLE’S REACH
Not all Web information is findable with Google. Often called the “hidden
Internet,” the “deep Web,” or the “invisible Web,” much content, perhaps much
more than Google actually is able to crawl, is completely invisible to Google.
Some have estimated that this hidden content is 500 times what is findable in
Google (Bergman 2001). There are many reasons for this, and we need to
discuss each of them. Failure to understand why Google does not find everything
will only perpetuate the myth that everything and anything can be found with
Google.

Google Is Polite
Google is not always wanted in all parts of the world. Perhaps you have seen
news stories about China blocking all access to Google, favoring its own
powerful search tool, Baidu. Lawsuits, both domestic and abroad, occasionally
mandate that Google take down certain Web content because it is offensive,
illegal, or disputed as part of legal actions. These kinds of actions have a big
impact on Chinese students who return home after studying in other countries, or
on researchers visiting China, but otherwise will have little impact on
researchers in the United States.
Google will not crawl where it is not wanted. Google pays attention to robot
exclusion protocols. A robot is another term for a Web crawler. One of the oldest
of these protocols, still in existence today, is the robots.txt files posted on the
root domains of many major Web sites. This file tells search engines where they
should not crawl. Try this: go to your favorite Web site, and after the forward
slash from the root directory, add robots.txt. For my university, the University of
Denver, the root domain is: www.du.edu. Adding robots.txt to the root Web
address gives us this URL: http://www.du.edu/robots.txt. Here you will see
portions of the university Web site that Google need not bother to crawl, either
because it is a waste of resources or because it doesn’t add anything to the
discoverability of information the university wants discovered. Here is what
shows up on that page (Figure 2.1).
Not all Web sites employ this old technology. Many have newer tools to
exclude Google and other search engines. But as a fun experiment, try to see
how many robot exclusion files you can discover. Here are just a few examples
(Table 2.1).
The point of this is that Google stays out when it is not wanted, accounting for
at least a small reason why not all information is available to Google.
Technology Exclusion
Google is not able to go where the technology does not permit. Many
databases are closed to search engines because of the way they work. Other
databases contain information that, even if they could be crawled, would be
meaningless. For example, the U.S. Naval Observatory has a database
(http://aa.usno.navy.mil/data/docs/RS_OneYear.php) that gives sunrise/sunset
data for each day of the year for every place in every state. An example can be
seen in Figure 2.2.

FIGURE 2.1.   Robots Exclusion Example.

TABLE 2.1.   Examples of Robot Exclusion from Popular Web Sites.

http://www.coca-cola.com/robots.txt      https://www.whitehouse.gov/robots.txt

https://www.parliament.uk/robots.txt      http://www.nigeria.gov.ng/robots.txt


https://www.cornell.edu/robots.txt      https://colorado.gov/robots.txt

http://www.ny.gov/robots.txt      https://www.theguardian.com/robots.txt

http://www.toyota.com/robots.txt      http://www.toyota.co.jp/robots.txt

http://www.funcionpublica.gob.mx/robots.txt      https://www.interpol.int/robots.txt

Even if Google was able to index this database, what’s the point? It’s all
numerical data. Google does have its own way of serving up sunrise and sunset
data, through nicely integrated widgets, but not via Google searches directly.
Password and Firewall Exclusion
Google doesn’t have access rights to content that is proprietary. We can easily
illustrate this with library database content you may already be familiar with.
Academic libraries subscribe to many wonderful resources such as Access World
News (local newspaper content from around the world), Alternative Press Index
(indexing of alternative press articles), Archives Unbound (digitized archival
collections of primary sources from various archives), Berg Fashion Library
(texts and images of fashion history), Legislative Histories (a ProQuest database
with exhaustive legislative histories going back to the first Congress),
MarketResearch.com Academic (expensive research reports for use in business
research), United States Congressional Serial Set (digitized copies of
congressional reports and documents from the early 1800s to present), and Web
of Science (cited references across all science, social science, and humanities
disciplines). Because of licensing restrictions that require password access,
Google is not allowed to crawl these resources. Academic libraries typically
subscribe to many hundreds of such databases. It’s true that some of the content
can be found in Google by other means. But generally these expensive interfaces
provide enhanced access that makes the subscription well worth the investment.

FIGURE 2.2.   Example of Numerical Data Not Exposed to Google Searching.

Many businesses have their entire corporate policies, customer databases,


price lists, and transactions available online, but behind a firewall. A virtual
private network (VPN) or other secure connection would need to be invoked to
get to this information. This information is often highly secretive and would ruin
many businesses if full access were suddenly enabled. Don’t expect Google to
give you access to this information, although I usually recommend that business
students try to see what is available through the indexable Web. You never know
what may have been leaked!

Disappeared Content
Sometimes content is removed from the Web for any number of reasons: it has
become obsolete and has been superseded, it is outdated, the funding for the Web
site has stopped, a legal action forced the content to be removed, there are
temporary server problems, and hundreds of other reasons. This is the reason
that most citation styles require researchers to include the access date in their
bibliographic citations. The claim is being made that, at least on the date stated,
the content was viewed. No claim can be made for any other date.

GENERAL GOOGLE SEARCH TIPS


Top-down Searching—“Aboutness”
When we search most library online databases, we are encouraged to
brainstorm for broad, general keywords that define the topic. This is because the
information that is searched had historically been the key indexing points:
authors, subjects, and titles. Books would be described by perhaps three subject
headings. Scholarly articles could be found additionally by searching keywords
within the abstract. This strategy works well for scholarly research content, since
most often a work’s title captures the general “aboutness” of the book or article.
Suppose you are trying to find scholarly articles about anxiety disorders
among people diagnosed with diabetes. A top-down strategy might frame a
search this way: psychological disorders AND diabetes; or anxiety disorders
AND diabetes; or depression AND diabetes mellitus.

Bottom-up Searching—Full-Text Content


What is not possible with library online catalog-style searching is possible
with Google. Because Google searches the full text of Web pages, more than just
the title and abstract are retrievable. This means that searchers can employ a
deliberately different strategy. Rather than just searching for the “aboutness” of
desired content, try to envision the answer and search accordingly. Let’s
illustrate this by performing two different style searches, the top-down search,
and the bottom-up search, for the same topic. A bottom-up strategy for the
diabetes topic mentioned earlier might be framed like this: depression diabetes
“quantitative study” hospitalization.
As an example of bottom-up searching, for this book I needed to find an
authoritative list of Google Books Library Partners. I first searched Google Web
(“google books” library partners) using the top-down strategy and found partial
answers. I did the same using Google Scholar. I had difficulty in getting a
complete list in this case by searching the top-down approach. I came much
closer to the answer by framing a search that had some of the results I know to
be present in the desired result set, without knowing all of the set. I searched like
this: google books library partners michigan illinois california cornell harvard.

INDIRECT SEARCHING—HIDDEN INTERNET CONTENT


We can refer to all the content contained in Google as “the indexable Web.”
Not all Web content is crawlable or discoverable by Google, as we have already
mentioned. To illustrate this let’s take as an example a local library catalog.
In 1964 a master’s thesis was done at the University of Denver titled Malcolm
Glenn Wyer, Western Librarian: A Study in Leadership and Innovation. A
Google search for this title uncovers about seven results, but none of them are
from the University of Denver library catalog. A simple search of the library
online catalog did turn up this title—proof that this catalog record was not
exposed to Google and is part of the hidden Internet. Actually, we wouldn’t want
Google populated with results from individual library catalogs. What a mess that
would create! With over 3,000 academic libraries in the United States alone, as
well as over 9,000 public libraries, imagine what would happen if each catalog
record were indexed in Google. We would see tens of thousands of catalog
records in Google result sets when searching for a popular book title.
There are ways of forcing Google to index records from within library
catalogs. Search engine optimization initiatives can tell Google how to crawl
library catalog databases, for example, by providing Google with a range of
searchable control numbers used in individual records and thus indexing the
individual titles within a local library. Although a few libraries appear to have
accomplished this, the results are mixed. The purpose of this illustration is
simply to show that there is much content hidden within databases accessible to
libraries.
Another example from a research context that illustrates the hidden Internet
situation is statistical databases that are numerically based. Chapter 10 deals with
the idiosyncrasies of searching for statistics using Google, but I will introduce
the problem here. The Food and Agriculture Organization of the United Nations
(FAO) has their FAOSTAT database (http://faostat3.fao.org/home/E). Data can
be searched or browsed within the database by country, by topic, or by rankings.
However, the many data elements within this database are not exposed to Google
because it would be ridiculous to do so. Data can be packaged in many ways,
and the search engine is capable of putting together search results in response to
specific queries. But putting raw numerical tables with no relationship to
questions asked makes no sense, and likely is not even possible.
There are many other types of databases, both within and outside of libraries,
that are impervious to Google’s crawls. Thus we need a search strategy to
account for this. I call this indirect searching. Rather than searching for the
“aboutness” of the topic as we mentioned earlier, or rather than searching for
content within a work, also mentioned earlier, this strategy involves searching
for the “container,” the type of database that would contain the material we are
looking for. Figure 2.3 illustrates this indirect search strategy.

FIGURE 2.3.   Indirect Searching to Uncover Hidden Internet Content.

Suppose you were looking for export data between Japan and Argentina,
specifically the most recent available statistics for cars exported from Japan into
Argentina. You could perform a Google search for automobiles export Japan
Argentina statistics. Although this search has some promising results, it doesn’t
really give the precise data you need. The top-down strategy didn’t work, and a
bottom-up style strategy won’t work, since we cannot envision what the
statistical answers would be in this case. An indirect search strategy would
enable us to find hidden Internet content. We could frame a search like this:
foreign trade database by country by commodity. This search turns up the UN
Comtrade Database, which indeed contains the answers we need.
REFERENCE
Bergman, Michael K. 2001. “White Paper: The Deep Web: Surfacing Hidden Value.” Journal of Electronic
Publishing 7, no. 1.

3
Searching Google Web

When I started teaching Internet Reference in 1999 Google was not the search
engine I favored. In 1999 the World Wide Web was just six years old, and the
popular search engines were AltaVista and soon after that AllTheWeb (known as
Fast.com). Google existed at that time, but I paid little attention to it. But within
a few years I became a convert: Google had figured out the relevance ranking
magic and was continually developing new ideas through Google Labs. It was
evident that Google was developing an interest in searching and discovery
beyond what other search engines were interested in accomplishing. Labs was
retired in 2011, but while it existed, it demonstrated the excitement and
determination Google had in developing new ideas.
As time went by Google-like searching became so instantiated in culture that
library search engines had to amend their search capabilities to keep up with user
expectations. For example, the phrase contained within quotes, a search engine
staple, is now a standard search feature in many, if not most, library-subscribed
databases.

BASIC SEARCH TECHNIQUES


As computers become more sophisticated, formulation of search strategies
evolved as well. In the early days of library searching users were taught the
Boolean operators: AND, OR, and NOT. Along with that they were taught how
to frame and use the best keywords to retrieve quality results. Now natural
language searching changes things. Google is continually doing research to
improve natural language searches (Google 2016). By natural language
searching we mean just typing your question in the search engine box as you
would as at the reference desk: “How many people live in Colorado?”; “What is
the current consumer price index?”; or “How can I find contemporary reactions
to Lincoln’s Gettysburg Address?” These natural language searches actually
work quite well in Google Web.
Although I don’t recommend natural language searching for academic
content, it can work some of the time. Most of the time it’s best to search by
noun forms and to stay away from other parts of speech like verbs, adverbs,
adjectives, and prepositions. Some of the questions we pose in the academic
world are replete with complexity—things like causation, correlation,
implications, and other relationships. Perhaps future semantically based search
engines will be able to sort out these complexities, but for the time being, it’s
best to stick with noun forms.

POWER SEARCH TECHNIQUES


For many users, power searching means going to the Google Advanced
Search page. Google sometimes changes the way to get to this page, so the best
way to get there is to search google advanced search. I find this page a bit
painful to use. For this reason I think it best just to learn basic power search
techniques and use them directly in searching. Here is a brief overview of the
most useful power-searching techniques. These will be dealt with in greater
detail with examples later.

Phrase Searching
Phrase searching is accomplished by enclosing your search term within
quotes. When we say “quotes,” we mean what is sometimes referred to as
“double quotation marks.” As you read this book you see smart, or curly,
quotation marks used within the text. However, these must not be used in
Google. Copying and pasting items containing curly quotes from Microsoft
Word into a Google search box will sometimes give undesirable results.
Although placement of a phrase in quotes is very powerful as a constraining
mechanism, it should not be overused. Only enclose a phrase in quotes if it is
really a “frozen phrase” in the language in which you are searching. “United
States” is a frozen phrase, as we never say “the States that are United.” However,
“first amendment right” would not be a good idea to enclose in quotes, as the
same idea might be framed as “first amendment gives us the right” or “the rights
of the first amendment.”

Site-Specific Searching
Most researchers I speak with don’t realize that Google doesn’t even let you
see beyond the first 1,000 results, at best. Assuming that you retrieved 1 million
results with your search, and assuming that you had many years to sift through
results, Google prohibits you from looking at those results.
To test this, set your Google search results to 100 results per page, just to
make this task easier. Now perform any broad, general search in Google, go to
the bottom of the page, and you will see up to 10 pages to which you can
navigate. The first page should have results 1 to 100, the second page results 101
to 200, and so on. Usually the results will stop far short of 1,000—maybe around
600 to 800 results. What this means is that if the results affecting academic
research are not in these top 1,000 results, you will never see them. Throwing
more words at Google will certainly bring different results to the top, but it may
also keep out results that would have helped you. There must be a better way—
and there is!
Site-specific searching means that you access Google’s indexing of only a
specific site. For example, it is often difficult to locate documents at my
university, the University of Denver. But if I do a site-specific Google search
using the syntax: site:du.edu, followed by my keywords, the result set is only
results from the du.edu Internet domain. This syntax has a couple of rules that
must be followed religiously: “site” must not be capitalized; and there must be
no space after the colon. Technically it is okay to have a “dot” after the colon.
For example, we can search site:.gov to find U.S. government information, but I
never teach this as a best practice. When I teach in front of groups, I fear that the
dot may be confused with a space, so I always omit it.

File Type Searching


Another way to escape Google’s restriction on viewing 1,000 results or fewer
is to limit by file type. File type or file format may be more familiar to those who
use Windows-based computers. Although Macintosh and Windows computers
have the option of concealing file type extensions, those who opt to see the
extensions are familiar with these common three-letter extensions after file
names: Adobe Acrobat (.pdf), Microsoft Word (.doc or .docx), Microsoft Excel
(.xls or .xlsx), Microsoft PowerPoint (.ppt or .pptx), and generic text file (.txt).
Limiting by these common file types with Google’s syntax is most helpful in
isolating research-related content. Very often substantive reports, studies, and
articles are posted on the Web in Adobe Acrobat or .pdf format. Using syntax
similar to site-specific searching, we can restrict results to .pdf format like this:
filetype:pdf. The Microsoft file types mentioned earlier have two versions: the
older version without the final “x,” and the newer file types with the “x.” Both of
these need to be searched on their own.
It is also possible to search by less common file types such as WordPerfect
(.wpd), Lotus AmiPro (.ami), generic rich text format (.rtf), Apple Keynote
presentation (.key), and so forth. Table 3.1 summarizes these three most
important power-searching strategies.

Other Considerations
There are those occasions when the advanced search interface is essential. For
example, if searching for local information, such as news local to a country, the
“region” limit on the advanced search page works really well and cannot easily
be performed apart from the advanced search page.

TABLE 3.1.   Commonly Used Power-Searching Strategies.


Search Type      Description      Example
Phrase searching      Searches a phrase as a literal string of characters      “united states” “human rights”
Site-specific searching      Searches within a specified Internet domain      site:census.gov site:state.co.us
File type searching      Searches for a specified file type      filetype:pdf filetype:doc

Google occasionally changes its search operators. For example, many readers
may be accustomed to using the plus sign (+) in searches. Formerly the plus sign
could be used to force a word or phrase to be present in the results, rather than
treating it as an optional OR. But Google intentionally removed the plus sign as
an operator so that it could be used with Google Plus. Unlike library databases,
which stick with the same search operators year after year, we are at the mercy
of the whims of Google.
A variation on phrase searching is something Google calls verbatim searching.
The syntax is verbatim:your search terms. Google claims that this will not
search for plurals or alternative spellings. Another way to do this is to go directly
to https://www.google.com/webhp?tbs=li:1. This defaults your search to a
verbatim search in Google. There is a significant difference between verbatim
searching and quotation (phrase) searching. Searching with quotes around terms
finds the exact strings. But doing verbatim searching finds all the search terms,
with the exact spelling, but not necessary just as a string. Thus verbatim
searching will produce more results.
If you wanted to find a common misspelling of United States in Google, you
could search verbatim:untied states, although for whatever reason I find more
predictable results by visiting the URL mentioned in the previous paragraph.

Searching for Syllabi


Suppose you are preparing a course on comparative politics, and you need
to find out what textbooks you should select. Try this Google strategy: 1)
Search Google Web: site:edu comparative politics syllabus textbooks.
Doing a site-specific search limits the results to educational institutions,
the domain where syllabi are more like to be posted. 2) Look at current
syllabi and make a listing of textbooks used. 3) Do a voting method, and
see which books get the most votes. You should now be ready to make
recommendations.

KEYWORD FORMULATION
When selecting keywords for Google searches, you can be more generous
than when selecting keywords for library database searches. When searching a
library database, the default search is often not to search the full text of content,
but rather to search the metadata only. For example, library catalogs today search
the text within a catalog record. This definitely includes the title, subject
headings, and possibly notes, summaries, and at times tables of contents of the
work, but not the full text of materials. Reference librarians often advise that
users search online catalogs with broad terms to capture the most relevant record
set. When searching Google, which reaches down into the full text of content,
we can be more generous and use more specific search terms. I call this “going
for the gold.” Search as if you know you can find exactly what you are looking
for. If and when that strategy fails, then think of synonyms, broader terms and
related terms to change the search.
When you search Google Web with many keywords, Google presents results
even if all keywords were not present on the Web pages. Google does you the
favor of saying which keywords were not found. I did a keyword search looking
for subnational data from Uganda like this, making reference to two subnational
regions: uganda database wakiso masindi economy gdp. None of the Web pages
Google retrieved contained all of these keywords. On the first result Google
noted this: Missing: wakiso masindi. This was extremely helpful, informing me
that the top result was about a database on the Ugandan economy, but it did not
contain the names of the Wakiso or Masindi regions of Uganda.
Be as specific as possible when searching Google. Because library databases
have a small set of data that you are searching against, you often get no results
when framing a search with many search terms. But Google’s index is many
times larger; thus, you can afford to throw more words at Google. If Google
doesn’t find all the terms, it starts to selectively discard terms, as noted earlier,
so as to still give you some results.

EVALUATING WEB CONTENT


Many, perhaps most, university instructors will tell students not to use Google
in their research. The recent prevalence of “fake news” serves to illustrate the
need to confirm the veracity of sources we use in our scholarship and to evaluate
everything we encounter on the Internet. There are times when we do need
Google to locate information, and we will then need to justify our decisions to
our professors. One way to do this is to apply evaluative criteria to what we find
on the Web. This is not much different from what we do already in the print
universe. We question the author’s credentials, we examine the structure and
layout of the resource, we consider the date of the publication in relation to the
question at hand, and we notice if there is scholarly justification for what is
being said in the form of bibliographic references. In Table 3.2 there are some
basic evaluation criteria that apply to print information as well as Internet
information. The Web is full of many other sites that provide similar
information. Let’s say that you want to see what academic librarians have to say
about evaluating Web sites. Frame a Google Web search like this: site:edu web
site evaluation criteria.

TABLE 3.2.   Evaluation Criteria.

Evaluation Criteria     Explanation


Authorship     Who wrote it?
Publication Type     What is it?
Indications of Authority     Who published it? Is there evidence of a review process? Is there a bibliography?
Are sources attributed?
Intended Audience     Scholars? Juveniles? High schoolers?
Content     How is it organized? Is it useful for your research?
Bias Is there a political agenda? Another agenda?
    
Date     Is the Web site/page current? Does currency matter for the topic?

In the Web world, the same criteria apply. It’s just that the methods of
determining the answers may be different, and different tools are available to use
to make these determinations. Let’s look back at some of the additional
evaluation considerations for Internet content.

URL Diagnostics
There is an additional criterion available to researchers when considering
Internet research: indicators given by the URL of the Web site. A URL can
contain hints that divulge much about a Web site’s authenticity or credibility. As
an example, a tilde (~) in a URL is often an indicator of a personal Web site.
Many universities have tilde sites for faculty, students, or staff. In the early days
of the Web, Congress issued tilde sites to senators and members of the House of
Representatives. For example Senator Ted Kennedy had the site
www.senate.gov/~kennedy. Congress does not use this pattern today, but the
Internet Archive’s Wayback Machine has captured Kennedy’s old Web sites at
various points of time. By visiting the Wayback Machine at archive.org and
typing http://www.senate.gov/~kennedy/ into the Wayback Machine search box,
we can see old versions of Senator Kennedy’s Web sites from as far back as
1997. The point is that we need to pay attention to such markers within URLs, as
they can give clues as to the authority of the site.

Structure of a URL
To begin to evaluate content, it is necessary to know something about the
structure of a URL. Take this URL as an example:
http://www.census.gov/geo/maps-data/data/tiger.html.

FIGURE 3.1.   Basic Structure of a URL.


Figure 3.1.   shows the constituent parts of a simple URL. Of course, it
gets more complicated when you consider URLs dynamically generated
from databases (such as a URL from a library catalog search), redirected
URLs, and the many other technologies that appear in links. The most
important thing to keep in mind is how to instantly recognize the TLD—or
top-level domain—and to recognize the secondary domain. This will tell
you about where the Web page or site is hosted and give hints as to
trustworthiness.

Another way to learn more about a Web site is to try what I call “backing off
the URL.” In this Web page about Halloween
(http://www.jeremiahproject.com/culture/halloween.html), we find many things
that make us suspect that there is an underlying agenda. By backing off the URL
—first taking off halloween.html, then removing culture/, we get back to the root
directory from which we can more clearly discern the background and intentions
of the authors.

Currency
When was the Web site created? When was the page or site last updated? For
current news contexts, a current date and even time are very important.
However, sometimes the date on a Web page really doesn’t matter. Take this
URL, for example:
http://memory.loc.gov/ammem/aap/aaphome.html.

The title of the page is “African American Perspectives: Pamphlets from the
Daniel A. P. Murray Collection, 1818–1907.” Examining the URL we note that it
is published by the Library of Congress. It is part of their American Memory
Project. If we “back off” the URL we see the root site:
http://memory.loc.gov/ammem/index.html. But back to the page in question. At
the bottom of the Murray page we see that the date is “Oct-19-1998.” Should we
assume this is not an authoritative page because it is nearly 20 years old and
doesn’t seem to have been updated at all since that time? No, in this case that
wouldn’t make sense. This is historical content, posted on the Web, and not
needing any kind of updates. It’s just as valid and credible now as it was in 1998.
Google Web Date Searching: Not to Be Completely Trusted
Google has a challenging task: index all it can of the World Wide Web. If
only all Web publishers and all the content that Google ingests would
follow standards. Ideally every Web page would contain information,
either explicitly stated for all to see, or at least hidden in the metadata,
about who created the content and when it was created or modified. Such
data does exist for Google Scholar and Google Books records, because that
metadata comes into Google in normalized formats from publishers and
other sources following standards. But Google Web does not have that
luxury in many cases.
Some Web pages are created with all the proper metadata by software
produced by commercial firms or freely available products. Other pages
are created through automated processes that fail to incorporate basic
metadata into their pages. This is part of the reason for the metadata being
irregularly present.
Google Web results can be limited by date, but don’t be fooled. These
limits only apply to compliant pages. Only those pages that publish their
created or modified dates will show up in the results; all other results will
be lacking, and you will get a false sense of completeness.

Authorship/Creatorship
A Web page or site didn’t just happen; it was created by someone. It may have
been a person or several people, or it may be attributable to a group, what we
call a corporate entity. By corporate we don’t necessarily mean a corporation or
company; we mean a group entity of some type: a nonprofit organization, a
commercial company, a local government, an international organization, a
federal government agency, a religious group, or a political action committee. It
is important to know who created a Web site, because you need to know the
authority of the person or group (do they have sufficient credentials to speak to
the subject?) and because you need to properly cite the page in your cited
references.
Hopefully author, whether personal or corporate, will be clearly stated. But if
it is not a Web tool, WhoIs, is helpful in determining ownership of Web sites.
Just search Google for whois, and you will uncover many WhoIs servers. These
servers contain the public Internet registration information from those who
register domain names. You can discover names of the people who registered the
domain, which can sometimes assist in the background work.
We had previously referred to the Wayback Machine from the Internet
Archive. By going to archive.org and entering a domain or a Web page into the
Wayback Machine search box, you can see cached content over time and see
how pages have changed and if viewpoints have been modified over time.

CACHED CONTENT
The Web is volatile. Using Google, just type define:volatile. The second
definition given there is “liable to change rapidly and unpredictably, especially
for the worse.” We’ve all experienced Web content that has disappeared; it no
longer exists. But there is hope. Several techniques can be attempted to recover
content that is no longer accessible in the original manner.

Google’s Cache
Perhaps you have had the experience of using Google to locate information,
only to discover that by clicking on the title you get an error. This is because
when we search Google we are not searching the live Web—we are searching
Google’s indexing of the Web, which was done at some point in the past
(perhaps a week ago, perhaps several months ago). Sometime after Google’s last
indexing the page or documents have disappeared. One way to recover from this
is to see if Google has a cached (or stored) version of the document. In Google
Web you notice that the Web page titles are in blue. Underneath you will see the
URL in green. To the right of the URL you may see a small green downward-
pointing arrow. Clicking this arrow gives you the possibility of accessing
Google’s cached version of the Web page—in other words, the version of the
Web page that Google indexed. This is extremely helpful in trying to recover
vanished content.

Wayback Machine
Another way to recover content is the previously mentioned Wayback
Machine. Take this government publication as an example of how the Wayback
Machine can recover content that was previously indexed by Google but is now
seemingly extinct:
United States Forest Service. 2006. Carpenter Ants in Alaska: Insect Pests of Wood Products.
Anchorage, AK: U.S. Dept. of Agriculture, Forest Service, Alaska Region 10, Forest Health
Protection, State and Private Forestry.

Several years ago a PDF of this document was available at:


http://ublib.buffalo.edu/libraries/e-resources/ebooks/images/efc8610.pdf, but this
link no longer works. It is now dead. But by going to the Wayback Machine and
searching for this URL, we can see that the documents were cached as early as
2005 and that we can fully download the PDF from the cache of the Wayback
Machine.

End of Term Archive


This interesting project is a collaboration between the University of North
Texas Libraries, the Internet Archive, and the Library of Congress. Every four
years when a U.S. presidential administration changes (even if it is a
continuation of the same president), content is removed from the Web. For
various reasons—political, administrative, and just to have a fresh start—content
will be removed and not necessarily properly archived by the government. The
End of Term Archive project is an attempt to proactively capture this content.
Often these captures grab important PDF files that would otherwise have been
completely lost to history.

REFERENCE
Google. 2016. “Research at Google: Natural Language Processing.” Available at:
http://research.google.com/pubs/NaturalLanguageProcessing.html. Retrieved December 10, 2016.

4
Power Searching for Primary Sources
Using Google Web

The power of Google is the power of the full-text search. In the very early days
of Web search engines would not always scour every word of every document.
Sometimes search engines would not pursue links or index words buried deeply
down in the directory structure of Web sites. Other times long documents, such
as PDFs, would not be searched in their entirety. But that was then; today search
engines are capable of so much more power in terms of depth of searching and
tenacity at drilling down into the entire directory structures of Web sites.
When we search Google, Google decides which results come to the top.
Multiple factors affect Google search rankings. Among them are

Age of Internet domain (how long the Web site has been registered; longevity is assumed to be more
reliable than transitivity)
Domain history (drops in registration or ownership may imply unreliability)
Public vs. private domain registration (Are the registrants hiding something?)
Duplication of content, frequency of content updating, outbound linking, grammar, and spelling (a
signal of quality)
Broken links and cited references and sources (a possible sign of scholarship) are only a few of the
factors thought to go into Google rankings

Perhaps you have heard of “search engine bias.” This tends to occur in the
business world of Web site positioning. When you want to eat Japanese food,
you would likely type Japanese restaurants into a search engine. The results you
see on top are typically ads. These ads are the result of Google knowing where
you are and what companies in your area are willing to pay for ads. But after the
ads you will see links to Web sites, with Google pushing to the top results that
are in your area mixed with results from Web sites that have paid for higher
positioning. Search engine optimization protocols are used to make Web sites
more visible and help to push them higher in the results list. Although much has
been written about these practices, that is beyond the purpose of this book. Here
we only care about using Google to assist us in our academic pursuits.

SITE-SPECIFIC GOOGLE SEARCHING


Because academic research tends to focus on more obscure scholarly topics
rather than popular, trending themes or business interests, it is very possible that
the resources you need for research will not appear in the first 10 results (the
first results page) or even in the top 100 results. The way to get around Google’s
1,000-document view limit and Google’s relevance ranking is to do site-specific
searching. The advantage of this kind of searching is that it only shows Google’s
indexing of a particular Web site. This presumes, of course, that you know which
Web site likely contains the information you need. This is not always easy. If
you wanted official information on the Zika virus, it would be a good guess that
the World Health Organization (WHO) might be an excellent starting point. The
first task is to discover what the Internet domain of the WHO is. Upon using
Google to search for “World Health Organization” we discover that the URL is
http://www.who.int/en/ (the en portion is a function of the WHO server detecting
that we are in the United States and presuming that we want to view English
language pages). The part of the URL we need to immediately identify is the
TLD, or top-level domain. In this case it is .int. If we want to limit our search to
WHO content only, then we need to put who.int in our search string. It is never
advisable to incorporate “www” into a site-specific search string. Thus, our ideal
search string here would be: site:who.int zika. For the purposes of academic
research, many valuable reports tend to be published on the Web in PDF format.
We can add a file type syntax to our search string to have Google just give us
PDF documents: site:who.int zika filetype:pdf.
These capabilities are not limited to English language searching. Here is a
search of the WHO site for Chinese language materials: site:who.int filetype:pdf
据寨卡病毒活跃传播的.
When constructing site-specific and file type–specific search strings, it is
important to note two things:

Never capitalize site or filetype


Never put a space after the colon in site: or filetype:
These mistakes will break the syntax and produce false or zero results because
Google does not recognize this as the advanced search syntax. It does not matter
where the “site:” or the “filetype:” appear in a search string, as long as they don’t
interfere with a search string that might be within quotes. In other words, these
searches all produce identical results:

site:who.int filetype:pdf zika statistics


zika site:who.int filetype:pdf statistics
statistics filetype:pdf zika site:who.int

With this background in mind, let’s apply site-specific searching to several


realms of what can be considered primary source materials that assist in our
academic pursuits.

SITE SEARCHING ON THE INTERNATIONAL LEVEL


There is a difference between international information and foreign
information that searchers need to be clear on. International information or
documents emanate from international organizations such as the United Nations
and its many associated agencies and other entities such as treaty organizations
like NATO, ASEAN, and APEC and other intergovernmental organizations.
Some of these are very well known, like the WHO that we already briefly
touched upon. Others are regional, and very few people may be aware of them.
This means that our first step in a research strategy is to search Google to
identify organizations that exist before we try searching generally for
information.
Let’s say you were searching for international human rights organizations.
Using Google to perform a search yields compiled directories of organizations,
selected nonprofit organizations (NGOs), intergovernmental organizations
(IGOs), and some organizations affiliated with the United Nations. We will
discuss NGOs later, but the focus here will be on official IGOs, including UN
bodies.
If I want to clarify the complex Web sites used by the UN for human rights,
and after discovering that the main UN URL is un.org, I can perform this search:
site:un.org human rights bodies. After going through results I discover several
helpful pages, including http://research.un.org/en/docs/humanrights/charter and
http://www.unsceb.org/directory. From these results I can see that I need to
perform site-specific searching within the following domains:
un.org      United Nations main site
ohchr.org      Office of the United Nations High Commissioner for Human Rights
unocha.org      UN Office for the Coordination of Humanitarian Affairs
unhabitat.org      HABITAT
Site-specific searches could then be performed within each of these domains.
Let’s use the topic of human trafficking as an example.

site:un.org human trafficking


site:ohchr.org human trafficking
site:unocha.org human trafficking
site:unhabitat.org human trafficking

To each of these search examples, adding filetype:pdf to the string further


restricts the search to PDF files.

SITE SEARCHING ON THE FOREIGN GOVERNMENT


LEVEL
Foreign governments each have their own TLDs. We can use your Google
skills to find the many places on the Web that contain these listings by simply
searching Google with the three letters, TLD. Highest on the list will likely be
the Wikipedia page, “List of Internet Top-Level Domains.” This page lists every
country of the world. Brief examples of selected countries are found in Table
4.1.
Searching Google for site:za will pull up Web pages hosted on servers from
South Africa. Often researchers want official information from government Web
sites. To do this we first need to find out if there is a specific government
subdomain for the country in question. In the case of South Africa, gov.za is the
government subdomain. Not always is .gov the government subdomain. Table
4.2 gives some notable exceptions to this.
It needs to be noted that many times foreign domains are sold for other
purposes (see Table 4.3).

TABLE 4.1.   TLDs from Selected Countries.

Country      TLD

Afghanistan      .af

Argentina      .ar


Brazil      .br
China (People’s Republic of)      .cn
Spain      .es
Japan      .jp
Netherlands      .nl
Saudi Arabia      .sa

Thailand      .th


United Kingdom      .uk
South Africa      .za

TABLE 4.2.   Variations in National Government Secondary Domains.


Country     Government Domain
Japan     .go.jp
China     .gov.cn; subnational government (province) domains are also used: ah.cn Anhui, bj.cn Beijing,
fj.cn Fujian, etc.
Ecuador     .gob.ec and .gov.ec are both used. Gobierno is “government” in Spanish
Canada     .gov.ca [Most government entities use their own second-level domain rather than .gov.ca]
United Kingdom     .gov.uk [Many government entities use their own second-level domains]
Germany     .de [Government entities use their own second-level domains], ex: bundestag.de, Parliament;
bundeskanzlerin.de, the Federal Chancellor; bundesregierung.de, Cabinet of Germany
France     .gouv.fr
Nigeria     .gov.ng
Indonesia     .go.id
India     .gov.in
Korea, Republic of     .go.kr
Mongolia     .gov.mn

TABLE 4.3.   Examples of TLDs Exploited for Commercial Purposes.


https://www.senate.mn/     Mongolian domain (.mn) used by the Minnesota (MN) State Senate
http://www.daa.nu/en/     .nu is the TLD for the island state of Niue. “Nu” is the word for “now” in Swedish,
and it is popular to buy this domain in Sweden and other Northern European
countries.
http://fashion.cd/     .cd is the TLD for the Democratic Republic of the Congo. CD also stands for
“compact disc” and can be used for formerly trendy sites.
http://pbs.tv/     Alternative site for Public Broadcasting Service videos. .tv is the TLD for Tuvalu.
http://www.pbb.me/     Commercial site capitalizing on the popularity of “me,” the TLD for Montenegro.
Many other examples can be found for marketable TLDs, such as .am (as in
AM radio—Armenia), .fm (as in FM radio—Federal States of Micronesia), and
.dj (as in disk jockey—Djibouti).
The point here is that the quest for in-country primary source information
across most countries can be done with the strategies mentioned here, but for
some countries little information may be available, and the primary purpose of
Internet is seen as purely commercial profit.

SITE SEARCHING ON THE U.S. NATIONAL LEVEL


We suggest several strategies when searching for U.S. federal information:

Use a government organizational directory. The official directory, published annually, is the United
States Government Manual. Available at https://www.gpo.gov/fdsys/browse/collection.action?
collectionCode=GOVMAN, the manual provides an understanding of the complexities of the federal
government as well as providing direct URLs to agencies.
USA.gov provides a comprehensive list of U.S. government agencies at https://www.usa.gov/federal-
agencies/a.
Search Google directly for the name of the government entity. Let’s say you wanted to find the
government agency that deals with safety on highways. Your initial Google search might look like this:
site:gov highway safety. You would then discover that the National Highway Traffic Safety
Administration (NHTSA) is the appropriate agency and that their URL is http://www.nhtsa.gov/.

Whether using the more explicative government manual or the more dynamic
USA.gov site, once you have identified an agency of interest, you can put your
site-specific searching to use in Google. For example, once you discover that the
National Aeronautics and Space Administration site is nasa.gov, you could
frame a search such as this: site:nasa.gov moon landing. Using the file type limit
we covered previously, we could narrow Google results to more substantive
pages: site:nasa.gov moon landing filetype:pdf.

Finding Primary Sources with Site-Specific Searching


Need to find primary sources to spice up your paper? Try going through
these steps.

1. Brainstorm your topic. What agency, association, government body, international


association, or interest group cares about the issue you are researching?
2. Use Google Web to find these associations. Suppose you are interested in child care
resources for the state of Connecticut. Find the government Web sites (remember that most
states have at least two. In this case you might search like this: site:ct.gov “child care.”
But also be sure to search with the other state domain: site:state.ct.us “child care.”
3. Refine your search with document-specific file type limits. For this search, a limit to
filetype:pdf works well: ct.gov “child care” filetype:pdf. And don’t forget: state.ct.us
“child care” filetype:pdf.
4. You might refine your search to find primary source statistics. Use xls and xlsx to find
Microsoft Excel spreadsheets, a common file type for statistics. Search: state.ct.us “child
care” filetype:xls; also search: ct.gov “child care” filetype:xls.

Let’s try a different topic. Let’s say you have a research project on the
Falun Gong religious sect, which has been banned in China. You can use
these steps.

1. Find the top-level domain for mainland China. Type tld into Google Web, and you can
access one of several lists of TLDs. Because the Wikipedia list is near the top, we can use
that list. We find that the TLD for China is. cn.
2. Search Google Web like this: site: site:cn falun gong. Now you will notice in the search
results that the secondary domain for Chinese government Web sites is gov.cn. We can
further narrow our search like this: site:gov.cn falun gong.
3. For a contrasting viewpoint, we can see what U.S. government sites have to say about this
topic. Search like this: site:gov falun gong. We note from the results list that the U.S. State
Department has something to say about this topic, so can focus our search like this:
site:state.gov falun gong.
4. We can further refine these results by restriction to file type: filetype:pdf site:state.gov
falun gong.

Although most U.S. government sites use a .gov TLD, there are many
exceptions to this. The largest exception is U.S. military sites that use .mil as
their TLD. Several agencies use the older .us for their domain. The U.S. Forest
Services uses fs.fed.us. Because the U.S. Forest Service is the most prolific user
of this domain, we need to use our Google Web search skills here to discover
other federal sites that use .us. We first frame a search like this: site:fed.us. We
see that nearly all the top searches are Forest Service related. We want to rule out
any Forest Service sites from the search results, so we incorporate a NOT
operator into the search this way: site:fed.us −site:fs.fed.us. Notice the minus (−)
sign directly before the second element of the search, telling Google to eliminate
results from the Forest Service. We now see that many U.S. courts also use the
.us TLD. Just to make it more complex, the United States Postal Service uses
.com (usps.com, redirected from usps.gov) and National Defense University,
under the Department of Defense, uses .edu (ndu.edu). These are not the only
examples of exceptions, but they illustrate the need to not make assumptions
about government top-level domains.

SITE SEARCHING ON THE STATE GOVERNMENT LEVEL


Searching U.S. state Web sites presents some challenges. Originally the .us
TLD was used for state and local governments. However, over time, the .us
domain was assigned to other entities within the United States that were not
government related. Originally states were all assigned a URL pattern within the
.us domain like this: state.xx.us, where xx is the two-letter postal code for the
state. Thus state.co.us would be the Colorado state site, and state.wi.us would be
the Wisconsin site. These patterns proved to not be popular with many states,
being too difficult for the public to remember and difficult to market. Many
states secured other domains for their official content. Colorado uses
colorado.gov for its official entry site, and Wisconsin uses wisconsin.gov.
However, most states still have a substantial amount of content on the older
state.xx.us sites.
Let’s put this to a test. I want to see what the state of Colorado has to say
about school district test results. First, searching (site:colorado.gov school
district test results) yielded 4,080 results on September 1, 2016. The same search
on the .us site (site:state.co.us school district test results) yielded 7,110 results at
the same time.
In the case of Wisconsin, site:wisconsin.gov school district test results gave
15,700 results, whereas site:state.wi.us school district test results gave 1,130
results. Colorado seems to have more content on the older .us site than
Wisconsin does. Yet both sites would need to be searched to retrieve
comprehensive results.
To assist with the complexities of state site-specific searching, Table 4.4
should be consulted.

TABLE 4.4.   Domains for U.S. States and the District of Columbia

State     Domains with Site-Searchable Content


Alabama     alabama.gov and al.gov; little content under al.gov; most under alabama.gov and
state.al.us
Alaska     alaska.gov; content also under state.ak.us
Arizona     az.gov; also uses azleg.gov; azcourts.gov; and other.gov domains; content also
under state.az.us
Arkansas     arkansas.gov; ar.gov redirects to arkansas.gov; content also under state.ar.us
California     ca.gov; little content also under state.ca.us
Colorado     colorado.gov; co.gov redirects to colorado.gov; content also under state.co.us
Connecticut     ct.gov; content also under state.ct.us
Delaware     delaware.gov; de.gov redirects to delaware.gov; content also under state.de.us
District of Columbia     dc.gov; content also under dc.us (only for schools)
Florida     myflorida.com; florida.gov and fl.gov both redirect to myflorida.com, but much
content under both.gov domains; content also under state.fl.us
Georgia     georgia.gov; content also under ga.gov and state.ga.us
Hawaii     hawaii.gov redirects to ehawaii.gov, but content under both domains; content also
under state.hi.us
Idaho     idaho.gov; content also under state.id.us
Illinois     illinois.gov; il.gov redirects to illinois.gov, but content under both domains; content
also under state.il.us
Indiana     in.gov; content also under state.in.us
Iowa     iowa.gov; ia.gov redirects to iowa.gov; content also under state.ia.us
Kansas     kansas.gov; content also under ks.gov and state.ks.us
Kentucky     kentucky.gov; ky.gov redirects to kentucky.gov; content under both domains; content
also under state.ky.us
Louisiana     louisiana.gov; la.gov redirects to lousiana.gov; content under both domains;
content also under state.la.us
Maine     maine.gov; content also under state.me.us
Maryland     maryland.gov; md.gov redirects to maryland.gov; content under both domains;
content also under state.md.us
Massachusetts     mass.gov; ma.gov and massachusetts.gov both redirect to mass.gov; content under;
content only under mass.gov and state.ma.us
Michigan     michigan.gov; mi.gov redirects to michigan.gov; content under michigan.gov,
mi.gov, and state.mi.us
Minnesota     mn.gov and minnesota.gov; most content under state.mn.us, with some content
under mn.gov and minnesota.gov
Mississippi     ms.gov and mississippi.gov, but content under both domains; content also under
state.ms.us
Missouri     missouri.gov redirects to mo.gov, with content under both domains; no content
under state.mo.us
Montana     montana.gov and mt.gov, most content under mt.gov, with little content under
montana.gov and no content under state.mt.us
Nebraska     nebraska.gov; ne.gov redirects to nebraska.gov, with content under both domains;
content also under state.ne.us
Nevada     nv.gov; almost no content under state.nv.us
New Hampshire     nh.gov; content also under state.nh.us
New Jersey     nj.gov and newjersey.gov; content also under state.nj.us
New Mexico     newmexico.gov; content also under nm.gov and state.nm.us
New York     ny.gov; content also under state.ny.us
North Carolina     nc.gov; content also under state.nc.us; some government entities use different
domains, for example Secretary of State is sosnc.gov and Motor Vehicles is
ncdot.gov
North Dakota     nd.gov; northdakota.gov redirects to nd.gov, but content only under nd.gov; content
also under state.nd.us
Ohio     ohio.gov; oh.gov redirects to ohio.gov; local content under oh.gov, state content
under ohio.gov; content also under state.oh.us

Oklahoma     ok.gov and oklahoma.gov, most content under ok.gov; content also under
state.ok.us
Oregon     oregon.gov; content under state.or.us
Pennsylvania     pa.gov; pennsylvania.gov redirects to pa.gov, content under pa.gov and state.pa.us
Puerto Rico     pr.gov
Rhode Island     ri.gov; rhodeisland.gov redirects to ri.gov, very little content under
rhodeisland.gov; most content only under ri.gov and state.ri.us
South Carolina     sc.gov; content also under state.sc.us
South Dakota     sd.gov; content also under state.sd.us
Tennessee     tn.gov and tennessee.gov; content also under state.tn.us
Texas     texas.gov, some content also under tx.gov; content also under state.tx.us
Utah     utah.gov; content also under state.ut.us
Vermont     vermont.gov; vt.gov redirects to vermont.gov; content under vt.gov, vermont.gov,
and state.vt.us
Virgin Islands     vi.gov (don’t confuse with British Virgin Islands, .vg)
Virginia     virginia.gov; content also under state.va.us
Washington     wa.gov; washington.gov redirects to wa.gov, but content only under wa.gov;
content also under state.wa.us
West Virginia     wv.gov; content also under westvirginia.gov and state.wv.us
Wisconsin     wisconsin.gov; wi.gov redirects to wisconsin.gov; content under wi.gov,
wisconsin.gov and state.wi.us
Wyoming     wyo.gov; wy.gov and wyoming.gov redirect to wyo.gov; content wyo.gov, wy.gov,
wyoming.gov and state.wy.us

Searchers should keep in mind that URLs used by states are subject to change
and updating at any time.

SITE SEARCHING ON THE LOCAL LEVEL (COUNTIES AND


COMMUNITIES)
Local governments, meaning counties, cities, and other places such as
villages, share a similar URL situation to that of states. Counties were originally
assigned a URL structure under the .us domain. A county such as Arapahoe
County, Colorado, was assigned co.arapahoe.co.us. The first “co” here refers to
“county,” whereas the second “co” refers to “Colorado.” This can be seen very
clearly in the case of Walworth County, Wisconsin: co.walworth.wi.us. In these
two cases the counties have opted not to deviate from their original domain
assignment.
However in Boulder County, Colorado, the domain bouldercounty.org is used.
Because Web content resides on both the bouldercounty.org site and the
co.boulder.co.us site, both will need to be searched for completeness:
site:bouldercounty.org minutes = 1,460 results on Sept. 1, 2016
site:co.boulder.co.us minutes = 36 results on Sept. 1, 2016

In this case, a majority of materials exist on the newer domain.


The same situation holds for city governments. Instead of “co” for county, the
pattern is “ci” for city. The city of Denver has the older domain ci.denver.co.us,
but also the newer domain, denvergov.org:
site:ci.denver.co.us minutes = 721 results on Sept. 1, 2016
site:denvergov.org minutes = 9,380 results on Sept. 1, 2016

According to the Census Bureau in 2012 there were 3,031 counties, 19,519
municipalities, and 16,360 townships (Hogue 2012). It would not be possible to
list all the possible variations of Internet domains within this publication.
Besides, the results would change perhaps daily. This means that the skillful
searcher simply needs to apply the Google searching skills described in the book
to come up with the most comprehensive results.

SITE SEARCHING FOR COMMERCIAL CONTENT


So far the focus has been on governmental information, a very rich source for
content in the academic research process. But often valuable content can be
found on business (or. com) sites. Businesses vary as to how much information
they are willing to post to the Internet, but often valuable research, from
statistical data to technical reports, can be found on commercial sites.
Occasionally power Google searching will uncover content the company does
not want the public to see, either because it was in an old directory structure or
because information was erroneously loaded into a publicly accessible part of
the Web site.

Searching Other File Types


We have emphasized doing file type–specific searches for PDF files,
because these files tend to be substantive and valuable for research.
Although the Google Advanced Search page has a pull-down menu
suggesting several other file types that can be searched, you will be better
off following my advice here.
Sometimes PowerPoint presentations can help in the research process. It
is helpful to see how others have organized presentations, including the use
of graphics. The Google Advanced Search form suggests users search by
the file type ppt. However, many new PowerPoint presentations have the
extension pptx. If you were searching for PowerPoint presentations from
the U.S. federal government on terrorism preparedness, you could run
these two searches: terrorism preparedness site:gov filetype:ppt and
terrorism preparedness site:gov filetype:pptx. You will get different results
from each search.
Let’s say that you were looking for financial data, such as would be
found in an Excel spreadsheet. You would need two searches: one to
account for xls files, and another to account for xlsx files. To find financial
data from U.S. companies you might try these two searches: financial data
filetype:xls site:com and financial data filetype:xlsx site:com.
Other file types you can search this way: Word documents (doc and
docx), Microsoft Access databases (mdb and accdb), EndNote
bibliographies (enl), and simple text files (txt, rtf). Then there are the
graphics file types: png, gif, jpg, and bmp.

SITE SEARCHING FOR NONPROFIT CONTENT


In the original scheme of things, nonprofit organizations were registered in the
United States under the .org domain. Unlike the .us situation, this scheme has
largely stayed intact. For other countries, nonprofits may be registered as
second-level domains as the following countries illustrate:

or.jp Japan organizations


org.cn China (mainland) organizations
or.uk Great Britain organizations
org.in India organizations
org.mx Mexico organizations
org.ec Ecuador organizations
org.au Australia organizations
org.ru Russia organizations

In these cases it’s best simply to search Google for the subject of the nonprofit
organization and restrict the search to the appropriate limits, either the TLD of
.org (in the case of organizations registered within the United States) or the
secondary-level TLD (in the case of organizations registered in other countries).
As an example, to find human rights associations in the United States, search:
site:org human rights. To find organizations in Japan, Russia, Mexico, and
Nigeria, search like this: site:or.jp human rights; site:org.ru human rights;
site:org.mx human rights; site:org.ng human rights. It’s even better if you can
search in the national language of the country: site:or.jp 人権; site:org.ru права
человека; site:org.mx derechos humanos; site:org.ng droits de l’homme.

REFERENCE
Hogue, Carma. 2012. “Government Organization Summary Report: 2012.” Census Bureau. Available at:
http://www2.census.gov/govs/cog/g12_org.pdf. Accessed December 11, 2016.

5
Google Scholar and Scholarly Content

Google’s early successes as a search engine led them to undertake a very smart
experiment. Google must have realized that librarians and publishers are very
principled in the way they produce metadata following basic standards so that
citations to scholarly literature invariably contain elements like title, author, date
of publication, and control numbers such as international standard serial number
(ISSN) and international standard book number (ISBN). Leveraging these
metadata features, Google was able to produce a product so powerful that it
surpassed all existing indexing and abstracting tools in use in libraries in terms
of depth of searching.
Traditional article databases in libraries search across basic indexed fields
such as author, title, subjects or keywords, and abstract or summary of the
article. Google has these metadata elements in its giant index, but also has the
full text of a high percentage of scholarly publications. The default Google
Scholar search is to scour across the entire full text of articles, whereas the
default search in many, perhaps most, article databases found in libraries is to
search across all metadata fields, but not across all full text. The plus side of
searching only the metadata is that the user gets more relevant search results.
The negative is that the user gets fewer results. In Google’s case many more
results are retrieved, but the results are ranked by Google’s proprietary relevance
ranking scheme.
Of course, we don’t really know Google’s inside story, but from the product
that results, we can surmise the following. Google went to publishers and said to
them, “Give us your metadata—metadata to all scholarly articles that you have
ever published.” But why would any publisher ever want to do that? After all,
publishers have been able to sell this metadata in print, in compact disc, and later
through Web portals, at a very high price to libraries. But that wasn’t all. Google
further said to publishers, “Don’t stop with the metadata; give us your full PDFs
as well.” So why would any publisher be interested in doing this? The answer
would seem to be “to monetize it.” Publishers could give away access to
abstracts of scholarly articles in exchange for worldwide access to their obscure
journal content. This could be a win-win situation: a win for Google, being the
host of all this content (and later able to sell advertising), and a win for
publishers, who would not allow access to full-text content unless it was
purchased.
But Google didn’t just stop with this model. They went to libraries and
requested a listing of all journals to which a library subscribed, in both tangible
formats (print or microfiche) and in online format, through any and all
publishers and aggregators, and also the specific holdings for those journals (that
is, the dates that the library owned or had online access to).
How could any library, even academic libraries with large budgets, pull this
off? Well, most couldn’t do this without help, without a vendor prepared to take
on this task. This was the time when several vendors were prepared to do just
that: prepare a large XML-formatted file with a library’s complete journal
holdings, subscriptions, and ownership, covering all years, all vendors, and all
publishers. Google could then come along on a periodic basis and grab that
information and do what Google does best: index the information into its giant
Google database (see Figure 5.1).
The result of this merging of publisher and library holdings information is
depth of discovery provided by Google and direct journal access (hopefully)
provided by your local academic library. Because of this linkage of library
content with Google Scholar results, Scholar is not a tool that works in
opposition to libraries; rather it is one of the greatest proponents of the richness
of an academic library’s expensive investments. That’s why Scholar works best
for those associated with research libraries and does little to assist public
libraries.
FIGURE 5.1.   Google Scholar Content Ingest Model.

FIGURE 5.2.   Examples of Metadata in Google Scholar Records. Google and the Google logo are registered
trademarks of Google Inc., used with permission.

In Figure 5.2.   the boxes under each respective citation point out the metadata
supplied by publishers, and the boxes in the right margin point out the metadata
matches based on library holdings.

DEPTH OF SEARCHING IN GOOGLE SCHOLAR


It is important to point out the depth to which Google Scholar searches into
the full text of scholarly articles. To test for this, we just need to go into any
scholarly article, grab a string of words that appear to be unique, and see if that
string is retrieved with a Google Scholar phrase search.
Let’s take an article from 1999:
Smith, John WT. “The Deconstructed Journal—A New Model for Academic Publishing.” Learned
Publishing 12, no. 2 (1999): 79–91.

We will copy the word string from page 83 of the article (Figure 5.3) and
search it in Google Scholar.
Generally when doing these tests it is best not to cross line breaks, as they can
produce irregular results. But with this article, because the columns are so brief,
we had no choice but to take our text from a second line. Placing the search
within quotes in Google Scholar gives us the result we expect (Figure 5.4).
Of course we will find instances where this test fails. Although much of the
content in Google Scholar was “born digital”—that is, originally created in an
online format—some content contains scans of older articles created long before
the digital age. We might call this “legacy content” because it originally
appeared in a print format and had to be digitally scanned and then processed
with optical character recognition (OCR) software to make the words searchable
in online environments. The OCR process is not 100 percent perfect, leading to
some misfires when performing tests such as these. Nevertheless, the indexing
power of Google Scholar is impressive enough to capture the full-text indexing
of a very high percentage of scholarly publications over the years, at least in the
English language.

FIGURE 5.3.   How to Test for Google Scholar Search Depth.


FIGURE 5.4.   Testing Google Scholar Search Depth. Google and the Google logo are registered trademarks of
Google Inc., used with permission.

FIGURE 5.5.   Testing Full-Text Searching Depth in Google Scholar.

In this 1893 journal article we attempt the same test (Figure 5.5) and the test
fails.
Steele, Theodore C. “Impressionalism.” Modern Art 1, no. 1 (1893).

The reason for this failure to retrieve the text can be seen when copying and
pasting the underlying text. When we copy and paste the text directly into the
Google search bar, we see “gremlin” characters in the pasted text (Figure 5.6):

FIGURE 5.6.   Reason for Failure Revealed. Google and the Google logo are registered trademarks of Google Inc., used
with permission.

The underlying OCR-ed text in this case contains errors. When we search for
the error-laden text, we do indeed retrieve the text of this article. Google does
not correct these errors; if it did take the time to micromanage every such error,
we certainly would not have the product we have today. Its OCR indexing is
good enough to get the job done. This means that there will be some degree of
failure when searching the full text of articles. But for the most part, the text is
retrievable.

Walk-ins: The Best Kept Secret in Academic Libraries


Most academic libraries have a provision that allows walk-in visitors
access to search online databases from on campus or within the library.
This is especially true for taxpayer-funded state university campus
systems, but even many private colleges and universities have contractual
clauses in their vendor contracts that allow for walk-ins to access, search,
and download content from databases. There will be some exceptions in
cases of e-books, market research databases, and some other premium
content resources, but most databases from publishers and aggregators will
be able to provide visitors with scholarly journal articles when on campus.
When visiting the library as a guest, you will want to find out how to
connect your laptop or device to the campus Wi-Fi network. This will
enable you to see Google Scholar links in the right margin just as if you
were a student.

WHAT CONTENT DOES GOOGLE SCHOLAR RETRIEVE?


Most scholarly journal databases publish lists of the journals they cover in
their databases. For example, the National Library of Medicine’s PubMed
publishes a list of the 26,000 journals covered in the database
(https://www.nlm.nih.gov/bsd/serfile_addedinfo.html). The U.S. Department of
Education’s ERIC project likewise has a listing (http://eric.ed.gov/?journals).
But Google is silent on the scope of coverage within Google Scholar. The
Google “About” page only speaks very generally about coverage
(https://scholar.google.com/intl/en/scholar/about.html). Scholar includes articles
from individual authors, university repositories, large publishers, open access
journals, and small publishers via hosting services. Unlike library databases that
tend to be very forthcoming about what journals are represented in their
databases, Google does not provide such lists. That leaves it to the researcher to
try to figure it out on their own. Several years ago I attempted to come up with a
comprehensive list, and I discovered content from nearly every content provider
that academic libraries typically subscribe to. This includes large publishers like
Elsevier (Science Direct), Palgrave, and Wiley; societies such as the American
Physical Society and the American Society of Civil Engineers; university presses
such as Cambridge, Oxford, and MIT Press; archival journal projects such as
JSTOR; collaborative initiatives such as Project MUSE; and technical reports
from sources like NASA and the Office of Science and Technical Information.
Do not expect to find aggregated journal content from vendors such as Gale,
ProQuest, or EBSCO in Scholar; however, in some cases citations from these
vendors do make their way into Scholar.
Early concerns about Google Scholar focused on the secrecy of what Google
included in Scholar, the depth of indexing (the fact that in the beginning Scholar
did not index every word in PDF documents), and the frequency of updates to
Scholar content. As the years have passed, the depth of indexing and the
currency issues have not been seen to be problematic. However, Google still
remains secretive about the scope of coverage.
The frequency of updating is an issue that needs to be addressed. If a
researcher needs journal content as soon as it becomes available on the
publisher’s platform, he should know that Scholar does not do that. It may take
several weeks, or perhaps months, for content to populate Scholar. In these cases
the researcher should consult the publisher’s Web site or find library-licensed
databases that provide current coverage for the journal in question.

Use Scholar to Solve Your Bibliography Problems


Let’s say you are a research assistant for a professor. She wants you to
convert her bibliography, created in Microsoft Word, into an EndNote
bibliography. You notice that Word does not have an export feature for its
bibliography, because it is not following normal bibliography export
standards. There is no way to automatically send citations from Word to
other bibliographic citation software. But you can use Google Scholar to
make this conversion task less onerous. Just look up every journal article
using Google Scholar. Then use the Cite button to save citations in a
format you can import, or automatically import citations one at a time by
clicking the EndNote button (assuming that you already have EndNote
software installed on your computer). You will need to take more care with
book chapters, however, because Google Scholar will often not format
these properly. Whole books can best be dealt with using WorldCat.org and
the citation generator contained within it. For primary sources such as UN
documents or WHO reports, you may need to create citations manually.

BOOKS—FROM GOOGLE BOOKS


One of the first things users will notice when using Google Scholar is the
large number of books that appear in the retrieved results. We will discuss
Google Books in depth in the next chapter. It can be assumed that books that
appear to Google Scholar to meet scholarly criteria are displayed in Google
Scholar results. Google states “Google Scholar automatically includes scholarly
works from Google Books Search” (Google 2016). This is both good and bad.
It’s good because it suggests materials beyond scholarly articles that may be
helpful. It’s also good because when Google Books results “bleed through” into
Google Scholar, they take on the characteristics of Scholar results with
bibliographic citations easily accessible, cited references showing up (as they do
not do in the Google Books interface), and the related articles feature. But it’s a
bit annoying that one cannot turn off the books and only view scholarly articles.
I wish there were a feature in the left margin to do just that. But users have to
realize that two different kinds of materials are offered in Scholar result sets:
journal articles that tend to be scholarly, and books that tend to be scholarly.

EVALUATING GOOGLE SCHOLAR CONTENT


Remember the discussion in Chapter 3 about evaluating Google Web content?
Well, we still need to call upon that information when considering Google
Scholar content as well. But why should we need to do this—everything in
Google Scholar is automatically scholarly, right? Well, not exactly. Generally the
content does tend to be scholarly. But there are many reasons why some material
that appears in Google Scholar would not meet the “scholarly” test.
Let’s revisit the table from Chapter 3 on evaluation and amend it slightly
(Table 5.1).
Google Scholar sometimes includes doctoral dissertations and master’s theses
because they were harvested from university institutional repositories. These are
generally not considered on the same scholarship level as a journal article that
was subject to a peer-review process.
Sometimes content from scholarly journals is included that is not in itself
scholarly. Examples of this might include editorial introductions to journal
issues, book reviews, and books that are not scholarly. There are many
“predatory publishers” out there trying to pass off their journals as legitimate
(Beall 2012). Users who really want to check whether a journal is peer reviewed
should be referred to Ulrich’s Periodicals Directory (print title) or its online
version, Ulrichsweb, found in many academic libraries.

TABLE 5.1.   Evaluation Criteria for Google Scholar.


Evaluation Criteria     Explanation
Authorship     Who wrote it? Are author’s credentials given (university or other affiliation). Are
the credentials relevant to the subject?
Publication Type     What is it? Is it a journal article, or is it a book (because books appear in Google
Scholar). Is it a technical report or an expert report?
Indications of Authority     Who published it? Is there evidence of a review process? Is there a bibliography?
Are sources attributed? Is the publisher named?
Intended Audience     Is it intended for scholars? Can you determine a reading level?
Content     How is it organized? Is it useful for your research? Are there charts, tables, figures,
graphs? Is there a bibliography or notes? Is there a thesis statement? Are research
claims documented?
Bias     Is there a political agenda? Another agenda?
Date     Does currency matter for the topic?

RIGHT MARGIN LINKS


Linking to Locally Held Journal Content
One of the most important features of Google Scholar is the way it can display
links to library content. Academic libraries very often subscribe to journal
management modules that control and provide access to the many electronic
journals to which a library subscribes. This information can be passed along to
Google for monthly updating. Then users can easily see if their library
subscribes to the content or not. These services are common in medium to large
academic libraries and less common, because of expense, in smaller or
community college libraries. This is accomplished by the vendor posting a giant
XML file containing all the relevant journal titles, ISBNs, years that the journals
are held (whether in print or online), and which vendor provides the content.
This list is generally harvested monthly by Google.
When users perform Google Scholar searches from on campus, they can
retrieve full text the way they are accustomed to doing so: by clicking on the
article title. However, if their university does not subscribe to the title from the
publisher the link represents, they will be presented with a request for payment.
Off-campus users without academic affiliation will never be able to click
through to licensed content. That is why the institution-specific links are
necessary and useful in Google Scholar.

Setting Up the Linking


Links to local institution content need to be enabled within a local browser.
Google occasionally changes the way links are presented for doing this. But
generally users start at the main Google Scholar page (scholar.google.com) and
click the “Settings” link at the top right of the Web page. Then, clicking the
“Library Links” button on the left margin will enable users to search for, find,
and select their institution (see Figure 5.7).
Saving the newly checked option will enable off-campus searching from
anywhere by means of “cookies” locally saved on the laptop or other device. Up
to five libraries can be selected. Why would anyone need to check more than one
library? Some students and faculty have multiple affiliations with different
library subscriptions.
If you are not certain if your college or university has access to Google
Scholar content, you should ask a reference librarian at your institution. They
will be able to assist you. Even if your library is not able to set up customized
links in the right margin, if you are on campus, you still should be able to click
directly on article titles within Google Scholar. If your library subscribes, you
will be passed on through to licensed content. If not, you will be asked to pay for
content. Don’t pay if you are an academic researcher. Speak with your
reference librarian and ask what the best way is to initiate an interlibrary loan
request. You pay enough money already for your college or university. You don’t
want to be paying for each article you need for your research.
FIGURE 5.7.   Setting Google Scholar for a Specific Library. Google and the Google logo are registered
trademarks of Google Inc., used with permission.

Using a Helper App with EZproxy


Library vendors have various kinds of proxy technologies to enable off-
campus access to licensed vendor content. OCLC’s EZproxy is likely the most
popular of such technologies. For libraries that use EZproxy there is a help app
for both Google Chrome and Firefox. These apps are helpful when your
institution’s links are not showing up properly in Scholar. To find the Chrome
app, simply use Google to search: google chrome ezproxy. To find the Firefox
app search: firefox ezproxy helper.
I have found that at least two vendors’ links do not consistently generate local
journal content links within Google Scholar, JSTOR and HeinOnline. There may
be others as well. The helper apps are especially recommended for these
situations.

How Does the Linking Work?


How does this work? It works through the magic of openURL resolution.
Invented by Herbert Van de Sompel at the University of Ghent in the 1990s, the
openURL has become an international standard (ANSI/NISO Z39.88-2004). It is
“open” because you can read the citation information directly from the URL
itself. Let’s take a closer look at an openURL. Google has its own underlying
linking system. This is a link underlying one of the university links from the
right margin:
https://scholar.google.com/scholar?
output=instlink&q=info:SNxr8YbKArkJ:scholar.google.com/&hl=en&num=20&as_sdt=0,6&inst=52723001017548261&s

This link, in turn, links to an openURL:


http://du-primo.hosted.exlibrisgroup.com/primo_library/libweb/action/openurl?
sid=google&auinit=JR&aulast=Jensen&atitle=Inland+wetland+change+detection+in+the+Everglades+Water+Conservatio
1112&vid=01UODE_SERVICES&institution=01UODE&url_ctx_val=&url_ctx_fmt=null&isSerivcesPage=true

This strangely long URL is actually understandable. It’s easy to identify the
field tags here: sid, auinit, aulast, atitle, volume, issue, date, spage, and issn.
It should be evident now why it is called an openURL. You can read the
metadata directly from the URL. The article title is plainly visible, as is the
source title, that is, the journal. Volume, issue, date, starting page, and ISSN
control number are also present. Google has some kind of underlying database
with various control numbers that somehow, mysteriously to us, points to and
generates the eventual openURL.
This is the opposite of what you want with online banking. We don’t want to
pass bank account numbers, passwords, and balance information through URLs.
This is why banking is done with sophisticated encryption technologies. But
bibliographic citations need no security; thus, openURLs make sense in these
contexts.
Van de Sompel’s invention was really a coup for the publishing world. It
created an international standard whereby publishers and aggregators that
compete with each other can nevertheless pass information along for the
purposes of information discovery. When a library link is clicked, the library’s
openURL resolver finds all the content, down to the specific journal article, that
meets the criteria.
Among the link resolvers libraries commonly subscribe to are SFX from Ex
Libris, 360 Link from Serials Solutions, LinkSource from EBSCO, WebBridge
from Innovative Interfaces, and GoldRush from the Colorado Alliance of
Research Libraries.

Linking to Free Content


Many times a local library will not have access to scholarly journal content
through library-subscribed databases. Google Scholar may, in some cases, be
able to solve this problem. When Google crawls scholarly content, it discovers
not only publisher content—very often freely available content is also found.
This may be content in institutional repositories, digital repositories, or simply
stored on personal Web sites. Google will place a link in the right margin for
these materials above the university links (if those are enabled and present). Very
often in PDF, but occasionally in HTML format, these documents may be an
open access version and thus freely available.

LEFT SIDEBAR
What makes Google Scholar distinct from Google Web is its separate interface
with unique features. These features are made possible because 100 percent of
Google Scholar’s content is metadata based. Let’s examine each of these features
(see Figure 5.8).
Let’s first examine the features available in the left sidebar of a Google
Scholar result set. It should be reiterated that these facets or limiters are evidence
of the underlying metadata that builds up the Google Scholar database. Unlike
Google Web, these metadata-driven features are evidence of the reliability of the
dates and other metadata types.

Limit by Articles or Case Law


In the left margin of Google Scholar search results, the first option is to limit
by articles or case law. It would really have been much more useful if there were
a way to limit by book content within Scholar. As it is, when the default
“articles” selection is made, books show up in the results as well. There is
sufficient metadata in Scholar to make that kind of distinction, but unfortunately
Scholar does not provide that filtering option. Thus, one of the most basic
Scholar search skills to learn is how to distinguish book results from article
results. The “case law” limit flips the search away from scholarly articles and
books over to federal and state court cases, a completely different universe. We
will discuss case law later, but for now, because the realm of scholarly articles
and case law are such different worlds, most researchers wouldn’t need to use
the case law facet.
FIGURE 5.8.   Google Scholar Features. Google and the Google logo are registered trademarks of Google Inc., used with
permission.

Limit by Date
This is perhaps the most powerful feature of Google Scholar. Because, unlike
in Google Web, dates are consistently applied to all Scholar records, the date
limit is extremely reliable. Google Scholar ranks by its idea of relevancy,
incorporating older materials and newer materials together. The power behind
the date limit is to factor out materials that are not relevant to your research. If
you only care about research from the last five years on your topic, then adjust
the date accordingly. The custom date range feature works extremely well. If you
wanted to view medical research during the early days of the recognition,
understanding, and diagnosis of HIV/AIDS, you could easily restrict the dates to
say 1980 to 1984 and view only those scholarly articles.
Google Scholar has the ability to limit by date because of its underlying
metadata. In the left margin of a Scholar result set we see date limit suggestions,
as well as the possibility to limit by any other dates we choose. This is distinctly
different from Google Web’s ability to limit by date. Google Web indexes all
Web pages to which it has access, whether those pages contain principled
metadata or not. It is only those pages that contain adequate date information
that are available for Google’s date limit within Google Web. Thus, whenever
we use Google Web to limit by date, we are only retrieving those records that
have sufficient underlying metadata to appear in the result set. In other words,
records that should be in the Web result set are omitted simply because there was
insufficient metadata for them to be included. Now with Google Scholar
metadata is all produced by publishers, libraries, or library vendors and is
completely metadata driven, enhancing our confidence that any date limits have
a high degree of accuracy and completeness, unlike Google Web.

Sort by Relevance or Date


Scholar presents results ranked by relevance as the default, but users can
change that so that results are presented by date, showing only articles added in
the past year. There is a secret advantage to performing the date sort: a new
Scholar feature appears—the ability to restrict searches to only the abstracts in
the metadata. This really should be a feature across all Google Scholar content,
but it only shows up when a date sort is requested (see Figure 5.9).
By searching only within abstracts, we are constraining the “aboutness” of our
results. Some abstracts are author-generated, others are indexer-generated, but
they generally capture the thesis statement, main arguments, methodologies, and
conclusion of the article. In the example search in Figure 5.9, this search within
abstracts yields 647 results, whereas the same search within “everything” yields
about 41,500 results. This kind of restriction can really assist in ruling out the
tangentially related content and only ruling in to the result set the more highly
relevant articles.

FIGURE 5.9.   Date Sort and Searching Only within Abstracts of Articles. Google and the Google logo are
registered trademarks of Google Inc., used with permission.

Include Patents/Include Citations


Because Scholar is Google’s powerful metadata-processing tool, Google
seems to have thrown many other metadata-intensive materials into the same
pot, whether it makes sense to put them there or not. Scholarly articles all come
with metadata, and we have already discussed legal cases, rich with metadata.
But Google Patents are also metadata intensive and are thrown into the Google
Scholar pot. It generally doesn’t hurt to leave the “include patents” box checked,
because scholarly articles and patents are often such different worlds that very
few users would even notice patents within the top search results. But the option
does exist for researchers to turn off patents so that they will not show up in
search results. Google Patents also has its own searchable interface, as we will
discuss later.
The “include citations” limit is perhaps poorly labeled. What is meant by this
is that the results only include Scholar results that lead to full text somewhere
and do not include citations to articles that are merely citations but for which full
text cannot be located. In other words, the results omit indexing and abstracting
only (A&I) records. In Figure 5.10.   the left side shows the default selection,
with citations included. Note that the titles of the articles are not clickable. The
right side of the image has the “include citations” box unchecked, and none of
these “citation only” results show up.

FIGURE 5.10.   Scholar’s Option to Search Only in Citations. Google and the Google logo are registered
trademarks of Google Inc., used with permission.

Create Alert
Alerts are an extremely useful feature, especially for researchers doing long-
term research on a topic. Undergraduate students with only a passing interest and
who are just writing a quick research paper may not need the power of alerts.
Scholar alerts, like alerts within the Google Web interface, monitor new content
added to the Scholar database. When new content is added that would be
retrieved by your keywords, you receive an e-mail notifying you of the
additions. Think of this as a clipping service or a monitoring service for new
scholarly content that is added to the Google Scholar mix.

CITATION-SPECIFIC LINKS
Other features within Google Scholar appear underneath the keyword excerpts
in the center column of the results. Let’s go over these features in turn. Figure
5.11.   illustrates the topics we will cover: “cited by,” “related articles,” “all xx
versions,” “cite,” “save,” and “library search.”

“Cited By”
Because Scholar ingests not just metadata, but also full text, it is able to scan
all the footnotes and bibliographic references contained within scholarly articles.
In compiling these, a linkage system is created to all subsequent articles that cite
the article in question. Thus, Google Scholar not only tracks what is in each
scholarly article’s bibliography, but also the cross-linking between articles.
“Cited by” is powerful in scholarship because it shows the interaction of other
scholars over time. What it is incapable of showing, however, is whether the
subsequent citation is a positive one (agreeing with the cited author) or a
negative one (taking issue with the cited author), or simply a neutral reference.
This function is similar to what Web of Science and Scopus are capable of
doing, but for much less money (in fact, no money at all). It should be noted that
the citations are forward looking (“who cites whom”) and not backward looking
(“whom who cites”), as the expensive software packages noted earlier are
capable of doing.

FIGURE 5.11.   Google Scholar Features underneath Citations. Google and the Google logo are registered
trademarks of Google Inc., used with permission.

Google, rather than limiting its citations to source lists, simply ingests all the
content it can. We can assume that Google has some criteria as to what journals
to include in Scholar and which sources to exclude. It’s just that Google isn’t
forthcoming with its criteria, so we are left to guess as to what they are doing. It
appears from what we see with Google Scholar that they include journals that,
for the most part, tend to be scholarly. Sometimes they let doctoral dissertations
into the mix, as well as newspaper content. They generally include university
institutional repository content, which can vary widely from scholarly content,
dissertations and theses, and even capstone projects. Some of these are not the
same scholarship level as Scopus or Web of Science would include, but these are
among the reasons for the generally higher citation numbers for Google Scholar
over the other two citation tracking systems.

“Related Articles”
Google Scholar doesn’t say much about how their related articles are
gathered. But it appears that it somehow looks for shared keywords and shared
relevancy when gathering the materials. If there is a sufficient number of
articles, clicking “Related Articles” will invariably show 101 results, with the
first result being the article in question and the remaining 100 articles being
somehow related. This feature actually seems to work extremely well. In fact, it
could be that the relevancy Scholar provides will exceed the relevancy that
proprietary library search tools are able to provide, making this an important step
in the research process.

“All xx Versions”/Clustering of Results


Google Scholar, being metadata driven, encounters different versions of the
same scholarly article. Some of these versions may be from publishers, others
from indexing resources, aggregators, or institutional repositories. Google
clusters or groups these together rather than presenting duplicated records.
Clicking on the link with various versions will show separate records for each of
the versions. Some of the metadata may not be sufficient to “trigger” a match
with the link resolver that may be available for your college or university. This
can be helpful in cases where your university may not subscribe to the top
version being featured by Google Scholar (see Figure 5.12).

“Web of Science”
The Web of Science link within Google Scholar will only be visible and will
only work in on-campus environments from universities that subscribe to Web of
Science. These links will not show up from off-campus locations. However, if
you are at a university that subscribes to Web of Science, you can get cited
references from it in addition to those supplied by Google.
FIGURE 5.12.   Clustering of Google Scholar Records. Google and the Google logo are registered trademarks of
Google Inc., used with permission.

It needs to be noted that there are often significant differences between the
number of citations found by Web of Science and the number of citations found
by Google. This is because of differences in scope of coverage.

“Cite”/Bibliographic Citations
Because Google Scholar is metadata driven, bibliographic citations can easily
be provided, and that is exactly what Google Scholar does. Google provides
popular citation styles for Modern Language Association (MLA) style,
American Psychological Association (APA) style, Chicago style, Harvard style,
and Vancouver style. It needs to be noted that the Chicago style provided by
Google Scholar is the older Documentation I, or Notes and Bibliography, style
often preferred by researchers in the humanities. Many social science researchers
will prefer Chicago Documentation II, the Chicago Author-Date style, but this
style is not provided by Google Scholar (see Figure 5.13).
Using the Google Scholar citation link provides a limited number of citation
styles, but by linking out to commercial citation managers, an unlimited number
of citation options is available. Researchers needing one of the hundreds of other
existing styles should use the links below the citation and export the citation to
one of the four citation management software programs that are supported
(BibTeX, EndNote, RefMan, or RefWorks). BibTeX works well with Mendeley,
which is a free download. RefMan, short for Reference Manager, is no longer
supported by Thomson Reuters, although the style is really RIS format, a format
developed by the company Research Information Systems, that nearly all
citation programs are capable of importing. Most universities support EndNote
and/or RefWorks by providing subscriptions that students and faculty can access
through university-wide subscriptions. If this is not available, then individual
subscriptions or software purchases can be made at the researcher’s expense. In
any event, once a citation is imported into one of these software packages,
alternative citation styles can be selected. In this way researchers needing
Chicago Documentation II or Turabian styles can be accommodated.

FIGURE 5.13.   Citations in Google Scholar. Google and the Google logo are registered trademarks of Google Inc., used
with permission.

Scholar Citations and Bibliography Management Software


The Cite button in Scholar allows citations to be exported in several
different formats: BibTeX, EndNote, Reference Manager (RefMan), and
RefWorks. BibTeX is a format developed for the high-quality typesetting
system, LaTeX, and is often used in the sciences. EndNote and Reference
Manager, originally developed separately, are now both owned by
Thomson Reuters. ProQuest is the creator of RefWorks, a cloud-based
system that many academic libraries subscribe to on behalf of their student
body. But what if you use a different bibliographic management system?
Can Google Scholar citations be of any use to you? The answer is
absolutely yes! The best way to get citations into Zotero, Mendeley, or
most of the other citation systems is to use the RefMan link. This actually
generates a file with an RIS extension (It will be labeled “filename.ris”).
The RIS format is a file designed to exchange data between various
citation systems.
One of the disadvantages of Google Scholar is that citations must be exported
one at a time. This is a bit inconvenient because we become accustomed to the
functionality of vendor software from ProQuest, EBSCO, and the many other
vendors that allow export of many citations at a time to bibliographic
management software. Nevertheless, in most cases Google Scholar citations tend
to contain very clean metadata.
There are some notable exceptions to the generally clean metadata coming
from Google Scholar. Scholar seems to have trouble properly identifying book
chapters in some cases. Users need to check the accuracy of all Google Scholar–
generated citations, as they sometimes contain errors or omissions. In the case of
the article, “An Empirical Review of Library Discovery Tools,” the Google
Scholar citation omits the last two words of the article title, likely because there
was a link break in the original PDF. The Scholar citation looks like this:
Shi, Xi, and Sarah Levy. “An Empirical Review of Library.” Journal of Service Science and
Management 8.5 (2015): 716. [MLA Style]

The actual citation should look like this (bold added for emphasis):
Shi, Xi, and Sarah Levy. “An Empirical Review of Library Discovery Services.” Journal of Service
Science and Management 8.5 (2015): 716.

All this is to say that it is important not to blindly trust computer-generated


citations.
Let’s take as an example this chapter from an edited work:
Brown, C. C. Web Analytics Applied to Online Catalog Usage at the University of Denver. In T.
Farney and N. McHale (Eds.), Web Analytics Strategies for Information Professionals (167–181).
Chicago: American Library Association.

Notice that the Google Scholar citation does not make reference to the
chapter, but to the edited book as a whole (Figure 5.14).
The reason for this is that the metadata is supplied from the Google Books
module, which does not contain metadata for individual chapters.

“Save” to My Library
Google Scholar’s “Save” function allows saving to your personal “My
Library” page. Users first must log in to their personal Google accounts (or
Google will prompt for a login). Then citations are saved and can be accessed by
clicking the “My Library” link, as seen in Figure 5.15.
The saved articles page can be used to push citations to bibliographic
management software in groups. Although EndNote is listed as one of the export
features that are available for saved articles, RefWorks is not listed. To get
citations into RefWorks, select RefMan as the format, and an RIS file will be
saved to your computer. This RIS file can be uploaded into RefWorks rather
easily.

FIGURE 5.14.   Google Scholar Citation Failure for Book Chapters. Google and the Google logo are registered
trademarks of Google Inc., used with permission.

FIGURE 5.15.   “My Library” Feature in Google Scholar. Google and the Google logo are registered trademarks of
Google Inc., used with permission.

“Library Search”
This link appears when more than one library was set up for openURL linking
in the Google Settings options. Because up to five libraries can be specified, one
of the libraries will show up in the right margin, generally your primary library,
but the other libraries show up under “Library Search.”

EFFECTIVE SEARCHING IN SCHOLAR


The more specifically you can frame your search, but better results you will
get. Here are some tips.
In controlled vocabulary databases, such as ERIC, PsycInfo, and Sociological
Abstracts, you are best off by spending some time studying the online thesaurus,
familiarizing yourself with the descriptor terms (broader terms, narrower terms,
related terms), and searching the descriptor or subject fields accordingly. But
when it comes to Google, there is no underlying thesaurus. Rather than thinking
like an indexer or a thesaurus, you need to “think like a full text.” In other
words, imagine the nomenclature of the discipline you are searching or search by
names of scholars you expect to be cited.
Although you can use the synonym operator (~) in front of search terms, it
doesn’t appear that this really does much in terms of retrieval of search results. A
search for plant pathogens diseases retrieved 1,310,000 results (Nov. 24, 2016),
but the search ~plant ~pathogens ~diseases retrieved 1,340,000, with the same
results appearing on the first page. Thus it seems that old-fashioned
brainstorming of keywords is necessary, just as it always has been.

JSTOR and HeinOnline in Scholar


Many times students will say “my professor told me to search JSTOR.”
Although one can certainly search the JSTOR interface if the library
subscribes, it should be noted that nearly all of JSTOR is covered in
Google Scholar. I always warn researchers that JSTOR is primarily a
journal archive that has a three- to five-year “moving wall.” In other
words, the most recent content will not likely be found in JSTOR.
Suppose the topic is soilborne plant pathogens and diseases. A search
for soilborne plant pathogens and diseases jstor will retrieve JSTOR
journal content on that topic. A problem arises, however, with JSTOR
content: the university links to content in the right margin do not always
show up. I have never been able to figure out why this is the case, but it
regularly happens. To get around this, if your library uses EZproxy to
provide off-campus authenticated access, one could use Google Chrome’s
EZproxy extension (to find this extension, just launch Google Web in
Chrome and type: google chrome ezproxy. Check with your research
librarian to see if this option is a possibility in your case.
HeinOnline contains nearly all law review content. Thus, adding the
term heinonline to any Google Scholar search retrieves law review content
on the topic. Suppose you only want legal perspectives on this topic:
“electoral college” history constitution heinonline. The retrieved content
will only be articles from the HeinOnline database—in other words, a limit
to law review articles. This strategy is useful when trying to get citations to
federal and state statutes, case law, and general legal analysis.

A simple approach to keyword selection works well.

1. Brainstorm possible keyword ideas.


2. Consider synonyms for some or all of your keywords.
3. Search for the intersection of all keywords first.
4. Search for the intersection of some, but not all, of the keywords or their synonyms.

Let’s take as an example a research topic: women employed in agriculture in


Nigeria. Ideally, you would like to see the intersection of these four notions, as
portrayed in Figure 5.16.
FIGURE 5.16.   Keyword Search Brainstorming Strategy.

You may want to change the search a bit by searching for synonyms, broader
terms, or narrower terms for each of these. For example, instead of Nigeria,
search the neighboring country Cameroon, or the broader term sub-Saharan
Africa, or simply Africa. Other synonyms you might consider: farming, labor,
and the British spelling labour.
Many times an instructor will not only require peer-reviewed journal articles,
but articles with a specific methodology. Methodologies may include qualitative
studies, qualitative methods, participant observation, focus groups, structured
interviews, field notes, reflexive journals, document analysis, or mixed methods.
If you are looking for a study with quantitative methods, put that in your search.
The reason this strategy usually works is that academic articles often clearly
state the research methods used either in the abstract or in the first several
paragraphs of the article.

GOOGLE SCHOLAR METRICS


One of the side effects of Google identifying and storing citing references is
that it can track how often articles are cited. This kind of power can be leveraged
in various ways. Web of Science uses a metric they call “impact factor.” Google
Scholar has its own metric called “h-index.” To read more about this in greater
detail, search Google Web like this: google scholar metrics.
REFERENCES
Beall, Jeffrey. 2012. “Predatory Publishers Are Corrupting Open Access.” Nature 489, no. 7415: 179.
Google. 2016. “Inclusion Guidelines for Webmasters.” Available at:
https://scholar.google.com/intl/en/scholar/inclusion.html#content. Accessed December 10, 2016.

6
Google Books

As early as 2002 Google executives began conceiving of a way to digitize books


and bring them into a search engine index. Google reached out to the University
of Michigan to explore the possibility of scanning the 7 million volumes held by
the library. In December 2004, the Google Print Library Project was formally
announced. The project was later renamed Google Books and expanded both
library partners (the Library Project) and publisher partners (Book Search
Partner Program) (Google 2016). Thus two types of content have contributed to
the project: legacy scans of print materials held in libraries, and “born digital”
book content contributed by publishers.
The Google Books project, if viewed as an end in itself, will seem
disappointing. The book scans contain errors, some pages are skewed,
occasional hands and fingers mar the content, and books still in copyright cannot
be viewed at all. But if viewed as a means to an end and as a way to make up for
that weak library tool, the library catalog, Google Books can be the greatest
books discovery tool to date in human history.
Let’s take a deeper look at two ways book content becomes part of Google
Books (Figure 6.1): the Library Project and the Google Books Partner Program.

LIBRARY PROJECT
Partner Libraries
Over 40 libraries have contributed or are contributing to the ongoing Google
Books project. It all started with the University of Michigan and their initial
digitizing of library books (Band 2006), and then was also taken up by the
Committee on Institutional Cooperation (CIC), later named the Big Ten
Academic Alliance. Member libraries worldwide that contribute to Google
Books include, in alphabetical order, Bavarian State Library (Germany); Big Ten
Academic Alliance (formerly known as the CIC), consisting of Indiana
University, Michigan State University, Northwestern University, The Ohio State
University, Pennsylvania State University, Purdue University, University of
Chicago, University of Illinois, University of Iowa, and University of Minnesota,
as well as University of Michigan and University of Wisconsin-Madison, which
have separate agreements, as noted later; Columbia University; Cornell
University; Ghent University (Belgium); Harvard University; Keio University
(Japan); National Library of Catalonia (Spain); New York Public Library;
Oxford University; Princeton University; Stanford University; Universidad
Complutense of Madrid (Spain); University Library of Lausanne (Switzerland);
University of California (including the California Digital Library and ten
campuses: Berkeley, Davis, Irvine, LA, Merced, Riverside, San Diego, San
Francisco, Santa Barbara, Santa Cruz); University of Michigan (part of Big Ten
Academic Alliance, but with a separate Google agreement) (Baksik 2009;
Google 2016).

FIGURE 6.1.   Google Books Content Sources.

The CIC becomes important in this process because the HathiTrust grew out
of the Michigan/Google contract as an effort to provide a permanent platform for
books digitized by Google from member libraries. In many cases, the Google
scans in the HathiTrust have more liberal access than does Google Books. More
on that later.

Not Just Books—Serials Too


One of the interesting features of the Library Project is that it is not only
books that are contributed to the project—serials are also included. A serial is a
term that includes any publication issued periodically, whether regularly or
irregularly. The value of this is primarily for journals before the copyright public
domain cutoff date of 1923. Each of these journals should be available in full
view mode. To test this out, simply search for a periodical that existed before
1923. For example, if you search Google Books like this: atlantic monthly, you
will see several of the volumes positioned at the top of the results list. After
clicking any one of these titles and then clicking “about this book” in the left
margin, you will see “other editions” or other full years of the Atlantic Monthly
magazine. These will be available in full view, ads and all. For another test, let’s
try a historic union publication, The Journal of the Switchmen’s Union. By
clicking “about this book” you can peruse historical labor history and interesting
photos of trains. The frustrating part is that periodicals are treated as books, with
an entire library-bound issue in a single Google Books volume. It can take a long
time to navigate these results.
Academic titles work very well in the Google Books interface, including,
American Journal of Mathematics, American Naturalist, Journal of the
American Medical Association, Psychological Review, and American Journal of
Education. Journals post-1922 may also be included in the Google Books
interface, but because these will not be in full view but most likely in snippet
view, they are better accessed through traditional library journal databases.

GOOGLE BOOKS PARTNER PROGRAM


Google invites book publishers to contribute directly to the Google Book
project. Publishers can choose what kind of visibility they want to provide.
Whatever their decision is, all book content will be fully searchable. Google
doesn’t say who the partners are or how much content has been contributed.
Because we don’t know what content is present or missing from Google
Books, we must approach searching it with a degree of skepticism. We can never
be certain that we are finding everything, only that we are finding something.
Some publishers want nothing to do with Google; they have their own ideas for
publicity. There are individuals who self-publish through Google Books and see
the platform as a way to publicize content that otherwise would never meet the
criteria for acceptance by a commercial publisher.
There has been discussion as to the current state of Google Books (Wu 2015).
Is it still an active project in the mind of Google’s executives? Is content still
being contributed at an aggressive rate? Are publishers seeing value for their
contributions, or is interest dwindling? We cannot say for sure.

GOOGLE BOOK VIEWS


Not all Google Books are created equal. Books in the public domain for which
there are no copyright restrictions can be fully viewed and usually printed. At the
other end of the spectrum are books that authors or publishers do not want to
expose to any content for viewing. Some books have famously, as a result of
extensive litigation, been removed from the Google Books database at the
request of the publisher or author. Works that publishers and authors wish to
promote, but not fully give away, are available with limited preview (with a
small percentage of the work viewable). Snippet view is for works that may still
be under copyright or the status is unknown.

Full View
Full view books are great when you can get them. Books published before
1923 are not subject to copyright restrictions. Other materials that may also be
available in full view include works where the author or publisher has authorized
it (although this is rare), works where copyright was never applied or has been
determined to have expired, and international documents, which are not under
copyright. It’s been estimated that less than 10 percent of the books are available
in full view (Chen 2012).
Some full view books have no e-book version available, and thus will not be
able to be printed or downloaded. Others, typically before 1923, will be able to
be downloaded in PDF format, and thus will be able to be printed. Because of
these variations in what is allowed and what is not allowed, the Google Books
experience will be quite mixed, even for full view books. It is best not to rely on
Google Books as the final destination for accessing or reading textual materials.
It’s best to try to find the book in a local library, either as a print book or as an e-
book. I call this final step, the step of actually getting a version of the book you
can work with, “fulfillment.” Google Books is excellent at discovery, but not so
good at fulfillment. For that, we still need the academic library.

Limited Preview
The most commonly encountered of the four views is the limited preview, at
least in terms of what academic users need in the course of their research. Most
major publishers, through the Partner Program, authorize the limited preview. In
theory, any part of the books may be exposed based on the relevance of the
keywords. But gaps or ellipses soon become evident as one tries to page through
the book.
In cases where the author or publisher has given Google permission, a limited
number of pages can be viewed as a preview. If a keyword search is carefully
framed, you may be able to view enough of the book to determine whether or not
it is highly relevant for your research purposes. Although you will not be able to
view enough of a work to use it, the discovery process exposed enough of the
book so that you know that you want it. You can then go through your normal
library procedures to get the book. This generally means 1) searching the library
catalog to see if the library owns or has access to the book in print or online
format; 2) requesting the book through your library’s interlibrary loan
procedures; or 3) consulting with a reference librarian or support staff to see if
there are other options. Sometimes works can be located through other means.

FIGURE 6.2.   Limited Preview in Google Books. Google and the Google logo are registered trademarks of Google
Inc., used with permission.

A small subset of Google’s limited preview content is available through


Google Play (Figure 6.2). Google makes Google Play available to authors and
publishers through the Google Books Partner Program. Although Google Play is
a distinct platform from Google Books, users can discover books through it and
then purchase them through Google Play. Publishers have the option of exposing
anywhere from 20 percent to 100 percent of the content of their works through
Google Books.
Snippet View
Snippet view is just basic metadata (like a card catalog/online catalog brief
record display) plus a few words around your search term showing a bit of the
context. It will only show up to three snippets, then no more after that. Also, for
reference works, there may be no snippets at all, because a snippet from such a
work could be divulging enough information so that the book need not be
purchased at all. But a snippet might just be enough to let you know whether a
book will work for your research or not.

No Preview
Just because some Google Books have no preview does not mean that they
have no value. Remember that the library online catalog is extremely weak and
that we need a way to discover full text in books, whether we can view the full
text through the discovery interface or not. No preview searching is like
searching blindfolded, but it is certainly better than no results at all.

SEARCHING AND NAVIGATING GOOGLE BOOKS


Like the other two Google interfaces (Google Web, Google Scholar) Google
Books has its own advanced search page. Because the location of this link
sometimes changes with interface updates, I recommend just searching Google
Web for google books advanced search. If you search in the field labeled “with
all of the words” you are in reality doing a Boolean AND search. Searching
“with the exact phrase” forces adjacency of terms. Most users do this in Google
by placing phrases within quotation marks. Searching “with at least one of the
words” is performing a Boolean OR search. The Boolean NOT search is done
when you use the “without the words” search box.
Users have the option of limiting results to all books (all views), limited and
full views, full view only, and Google e-books only. For academic purposes I
recommend not restricting the search results. If you place initial limits on your
search, you may be ruling out of the retrieved results the most relevant tool for
your research. Other useful limits are searching by author or publisher. As with
the other Google interfaces (Google Web, Google Scholar), the more specific
you can make your search terms, the better the results will be, because Google
Books shows you where the search terms occur in the book.
Let’s take this sample search: African religious adherents Christianity. In
Figure 6.3.   we can see the several search term hits. A book with only one page
of search term hits likely would not be very helpful. The limited preview lets
you see the relevant pages, but probably not much else.

THE “FULFILLMENT” PART


Because Google Books doesn’t allow for reading or downloading entire
books, except for the 10 to 20 percent of books that are in the public domain,
users will have to find another way to get the full text. In other words, Google
Books is an excellent way to discover books, but it is probably not the place to
use the books, or in the terminology of library interlibrary loan departments,
acquire “fulfillment.” For fulfillment you can do one of two things: First, use the
“Find in a Library” feature, often found in the left margin of a Google Books
internal view, at other times found under the “Get this book in print” pull-down
menu, and sometimes not present at all. Here is how this Find in a Library
feature works. Google has partnered with OCLC (formerly known as the Online
Computer Library Center, and before that the Ohio College Library Center),
producers of the WorldCat bibliographic database, to link directly to libraries
with holdings recorded for individual books, serials, and other material types.
Because OCLC’s WorldCat is the largest bibliographic database in the world and
is used as the largest utility in the world for shared catalog records, there is a
very good chance that your local academic library has used these records to
catalog their materials. This feature is not without its problems, however. The
trend today is for libraries to load vendor-supplied machine readable cataloging
(MARC) records into local catalogs, which may or may not work in conjunction
with the WorldCat holdings. In addition, research libraries acquire records for
electronic books together with vendor-supplied MARC records. There is much
less chance that these records will be linked in the WorldCat database. Also,
many research libraries have demand-driven or patron-driven acquisition models
whereby they include MARC records in their local catalogs for books they do
not own but will gladly provide immediate access to whenever a user logs in.
For these reasons the Find in a Library feature may not be a true representation
of local library holdings.
FIGURE 6.3.   Search Term Hits with Google Books. Google and the Google logo are registered trademarks of
Google Inc., used with permission.

That means that the second option is a better one for most researchers: 2)
Simply look up the title you found in Google Books in your local library online
catalog. This way you will see all formats available for a given title, whether in
print or online as an e-book.

HOW COMPLETE IS GOOGLE BOOKS?


Whether Google Books is very comprehensive or not is really a two-fold
question: 1) How complete is the legacy scanning of books and serials through
the Library Project, and 2) How comprehensive is the publisher participation in
the ongoing contributions to Google Books?
There are several ways one can address the first question. One way would be
to go the HathiTrust project Web site, because this is the primary avenue through
which older materials are contributed, and view their statistics
(https://www.hathitrust.org/statistics_visualizations). There is significant, but not
total, overlap between HathiTrust digital content and Google Books digital
content. Another way would be to review comments from the library science
scholarly literature. Chen (2012) notes that “[a]s of late 2010 and early 2011,
there were hardly any WorldCat books that Google Books could not retrieve.”
It should also be noted that certain kinds of materials do not fall within the
scope of the Google Books project because of format or size. Small pamphlets or
large folios do not fit the scanning model Google is currently using and are thus
passed over.
I occasionally put Google Books to my own test, just to see how complete it is
and how much I can trust it. I go to my local public library, grab random titles
off of shelves, and search the titles in Google Books. I do the same thing at the
academic library where I work. Each researcher should do this kind of testing
just to ensure that they are getting results they expect to get. Another way is to
look for online front lists of books from academic publishers and search their
titles in Google Books. To date I have been very heartened that Google Books
seems to be doing a good job of representing current publications.

Library Catalog vs. Google Books


You need to write a paper about a Latina attorney from Texas, Adelfa
Callejo. A search of the local online catalog reveals a single resource:
Latinas in the United States: A Historical Encyclopedia, but you want
more. Try these steps.

1. Search Google Books: “Adelfa Callejo”. This yields a couple of limited view Google
Book results that leads you to want to get the books. The titles are Las Tejanas: 300 Years
of History and Texas Women: Their Histories, Their Lives.
2. Search your local library catalog. In the case of the first title, the University of Denver
Library had this title in print format. The second title was available online as an e-book.

The local catalog could not provide deep full-text access to book content
the way Google Books could. But once Google Books uncovered
additional helpful titles, it was not possible to use Google Books to
actually accomplish the research. But switching back to the local library
catalog with additional information in hand, you are able to get two
additional resources.

GOVERNMENT DOCUMENTS IN GOOGLE BOOKS


Since the beginning of U.S. history all three branches of the federal
government have made it a priority to disseminate government information to
citizens. As evidenced by the publication of the massive United States
Congressional Serial Set, the systematic distribution of public documents to
libraries in every state, then later the modern Federal Depository Library
Program, and the current online dissemination, access to the workings of
government through publications is one of the hallmarks of this country.
Through the library partners program in Google Books, many of these legacy
print documents have been scanned and are included in Google Books.
In theory all U.S. government publications should be fully viewable with
printing and download privileges through Google Books. But in reality it doesn’t
work that way. A search for the book Invasion, Intervention, “Intervasion”: A
Concise History of the U.S. Army in Operation Uphold Democracy allows users
to read the entire work, but not to print or download anything. However, the
federal publication Battle Participation of Organizations of the American
Expeditionary Forces in France, Belgium, and Italy. 1917–1918 is not only
viewable but also able to be fully downloaded in PDF or EPUB formats.
Although U.S. government publications are generally free of any copyright
restrictions and should be free available as full view books in Google Books,
Google does not do a very good job of curating this. Let’s take this congressional
hearing as an example:
Congressional Reports Elimination Act of 1982: Hearing Before a Subcommittee of the Committee on
Government Operations, House of Representatives, Ninety-seventh Congress, Second Session, on H.R.
6005 … July 29, 1982.
FIGURE 6.4.   U.S. Government Publication with Snippet View in Google Books. Google and the Google
logo are registered trademarks of Google Inc., used with permission.

Google Books shows this only as a snippet view (Figure 6.4).


The workaround for this is to make a beeline for HathiTrust (hathitrust.org).
Searching that same title in HathiTrust immediately gives us full-text access to
this congressional hearing (Figure 6.5).
One thing about full view books in HathiTrust needs to be mentioned. For
libraries that are not partner members of HathiTrust, entire public domain
documents may be viewed, but can only be downloaded one page at a time.
HathiTrust partners are able to download these materials in full.

MAGAZINES IN GOOGLE BOOKS


Why mention magazines in the context of searching Google for scholarly
purposes? Magazines can reveal the culture of the time. Full-page scans showing
advertising, headlines, and color photos can add to many kinds of research
endeavors. Google Magazines are very far from comprehensive. Most academic
libraries will have subscription-based tools that do far more than Google
Magazines can do. Nevertheless there are many libraries that do not have access
to these tools.
FIGURE 6.5.   HathiTrust Augments Google Books for Government Publications Full-Text Access.

There are two ways to access Google Magazines within Google Books: by
browsing and by searching. To browse English language magazines, search
Google Web like this: google books magazines, or simply go here:
https://books.google.com/books/magazines/language/en. Note that the last two
letters of this URL are en for English. That should raise your curiosity. What if
we substituted es instead? You would then find Spanish language magazines in
the project. Try also fr for French. Sorry, I haven’t discovered any other
languages.
To search within Google Magazines, go to the Google Books advanced search
page and simply type Google Books advanced search. Here you can figure out
how to search by keywords and to limit by date. Because Google Books (and
Magazines along with it) are completely controlled by metadata (unlike much of
Google Web), you can be assured that your result set is extremely precise in
terms of date searching across the searchable content.

GOOGLE BOOKS AND HATHITRUST


The Google Books project is intertwined with the HathiTrust initiative. Not all
Google Books partners are members of HathiTrust, and not all HathiTrust
members contribute their scans to Google Books. It’s just that there is a
significant overlap between the two.
FIGURE 6.6.   Contrast between Google Books and HathiTrust for Out-of-Copyright Publication.
Google and the Google logo are registered trademarks of Google Inc., used with permission.

This becomes especially important when it comes to works in the public


domain, as we noted earlier in the discussion about government publications.
Very often Google will lock down a text that should have been publicly
viewable. This often happens in the case of U.S. public documents, which are
not under copyright protection and thus are in the public domain.
The basic principle is that whenever you find a public domain book or
document in Google Books, search that same title in HathiTrust. You will very
likely have a much better experience. HathiTrust does not allow downloading of
entire documents, except to member libraries, but users can view the entire
document through their interface in multiple modes of viewing (scrolling, page
flipping, thumbnails, page by page, or plain text), something not possible with
Google Books.
HathiTrust also helps with serial publications. Let’s take a serial that also
happens to be a U.S. government publication, United States Geological Survey
Yearbook. Searching this title in Google Books brings up a snippet view of the
1994 annual volume. We note that this scan is from the University of Michigan.
Under “other editions” we can see, in unsorted order, various volumes from the
1980s. It’s very difficult to make sense of these records, especially because they
seem to all be in snippet view. But searching that same title in HathiTrust we can
see the entire issue, the same Google scan that was so tightly locked down by
Google. Note that in Figure 6.6, on the left Google inserts a watermark
“copyrighted image.” Actually, it isn’t copyrighted at all. It’s a U.S. government
document, paid for with taxpayer dollars. HathiTrust rightly makes it available
for public viewing.
GOOGLE BOOKS AND THE INTERNET ARCHIVE
It’s not as large of a public domain archive as the HathiTrust, but the Internet
Archive’s eBooks and Texts subsection (search: internet archive texts) is rich
with content that is easy to read online and download in multiple formats. In
addition, Microsoft’s book digitization efforts from the past have been archived
in the Internet Archive. These are available for browsing or searching. Rather
than giving the long URL to the project, simply use Google to search: microsoft
book digitization site:archive.org.
As a reference librarian, whenever someone comes to the reference desk
wanting to borrow a microform of an old book, I look it up first in HathiTrust,
then Internet Archive, and finally, as a last resort, in Google Books.

Internet Archive vs. HathiTrust vs. Google Books


In addition to HathiTrust, the Internet Archive Open Library
(https://archive.org/details/texts) contains over 10 million fully accessible
e-books and texts. Some of these are in both Google Books and
HathiTrust, whereas others are only in the Internet Archive. If you find a
work in here, you have many ways to download the full text and an easy
page-flipping experience. The scans are generally of extremely high
quality.
As an example, Google Books has a full view Somali-English dictionary
from 1897:
De Larajasse, Evangeliste. Somali-English and English-Somali Dictionary. K. Paul, Trench,
Trübner & Company, 1897.

Although the e-book is free to download from Google Books, this work
is also in HathiTrust and Internet Archive. The HathiTrust version can only
be downloaded by HathiTrust member libraries, but the same edition in the
Internet Archive is free to be downloaded by anyone.

FINGERS AND HANDS IN GOOGLE BOOKS


As ambitious and great as Google Books is, it has been rightly criticized for
scans of employees’ fingers, hands, and skewed images, as noted by the Web site
http://theartofgooglebooks.tumblr.com/. But in fairness to Google, one can see
that some of these miscues have been fixed by the Google scanning partners.
Pages documented by the aforementioned Web site have occasionally been
replaced with corrected scans. All this is to say that the Google Books scanning
project is far from perfect, but its place in searching history has changed the
game forever.

CITING GOOGLE BOOKS


There are three ways one can get automatic citations to books in Google
Books:

1. Click the “About This Book” link and you will see some available citation formats at the bottom of
the page. These are online export options in BiBTeX, EndNote, or RefMan formats. It is very likely
that this roundabout way will not work for many users, so one of the other two approaches may prove
better.
2. Take the title of the book and search for it in Google Scholar. If you are lucky, the book will appear
also in Google Scholar with the cite button clearly visible.
3. Look for the “Find in a Library” button for the book you are in. The link may be hidden under the
“Get this book in print” option. Sometimes that button does not exist at all (in cases where an OCLC
control number is not available in the underlying metadata). But when you do see that link you will
be passed along to the OCLC WorldCat interface. You will then see the “cite/export” link in the
upper-right area of the resulting Web page. This option gives you the greatest choice of citation
output formats, including Chicago author/date style, and Turabian style.

Use one of these three methods, leveraging Google Books metadata, to


accurately cite Google Books in your research.

REFERENCES
Baksik, Corinna. 2009. “Google Book Search Library Project.” In Encyclopedia of Library and Information
Sciences. DOI: 10.1081/E-ELIS3-120044502.
Band, Jonathan. 2006. “The Google Library Project: Both Sides of the Story.” Available at:
http://hdl.handle.net/2027/spo.5240451.0001.002. Accessed December 12, 2016.
Chen, Xiaotian. 2012. “Google Books and WorldCat: A Comparison of their Content.” Online Information
Review 36, no. 4: 507–516.
Google. 2016. “History of Google Books.” Available at:
https://books.google.com/intl/te/googlebooks/history.html. Accessed December 11, 2016.
Wu, Tim. 2015. “What Ever Happened to Google Books?” New Yorker, September 13, 2015. Available at:
http://www.newyorker.com/business/currency/what-ever-happened-to-google-books. Accessed
December 11, 2016.

7
Google as a Complement to Library
Tools

Librarians often hear from students that their instructor said not to use Google in
their research. I believe that what is meant in most cases is that students
shouldn’t only use resources that they just found with simple Google searches.
Too many times students cite Wikipedia as an authority for their research. But
what this book is arguing is that there is a proper place for Google in academic
research. Google augments what libraries can offer with the expensive licensed
content. What is horrifying to professors, as well as reference librarians, is that
students sometimes use Google as the only resource for academic research. This
could be a carryover of bad habits from high school, a lack of critical inquiry
skills, or ignorance of what resources libraries have that are beyond the reach of
Google Web.
As powerful as Google is, there is much content that is beyond its reach. We
have already discussed these limitations: Google Web doesn’t crawl where it is
not wanted; many databases have no Google Web presence because their
technology doesn’t allow for it; Google is primarily a text-based resource;
numerical, sound, and video data and metadata are often impossible to contribute
to the giant Google index; and passwords and firewalls prevent Google from
grabbing proprietary content, of which there is an entire university. Although
Google Web does a good job of exposing primary source content, especially
current content hosted on governmental and organizational Web sites, it is very
spotty in its ability to dig into content contained in the archives of libraries,
museums, and historical societies, much of which has not been digitized or even
indexed in online finding aids.
Google Scholar likewise has its limitations. Not only is the most current
content not immediately contributed to Scholar, but not every scholarly journal is
represented. Sometimes only metadata is present, but not full text. In addition,
Scholar tends to overlook nonscholarly resources such as newspapers—both
current and historic—popular magazines, and trade journals. Dissertations,
although they may be in Scholar because of their presence in institutional
repositories from various universities, constitute only a small subset of all
dissertations and theses completed over time.
Google Books has its limitations as well. Looking back in time, not every
book is owned by the scanning partner institutions. Some books, even if owned,
do not fit the Google Books specifications in terms of size—they may be large
folios or small pamphlets and thus not be eligible for scanning. Google Books
does not contain metadata for book chapters, making retrieval and citation a
challenge at times. And then there is “the gap”—the nearly century-long period
from 1923 to the resent when freely available online book content is hard to
discover. This is the reason why so many digitization projects end at 1922.
It is at this point that researchers should turn to their academic libraries. To
understand what libraries have to offer, it is essential to know how libraries
organize information and make it available. We will look at the tools available in
the typical academic library and how to approach these tools in relation to our
Google search skills.

WHAT’S WRONG WITH ACADEMIC LIBRARIES


This alarmist-titled subdivision is only intended to address search and
discovery problems encountered in academic libraries, not all problems like
staffing, budgets, and accessibility. It’s not an academic library’s fault that
technology is in the state it is today. But the academic researcher needs to be
aware of why libraries provide discovery options in the ways that they do and
how these often motivate students to run to Google first rather than to persevere
with quality library resources.

Library Catalogs—Not Powerful Enough


Chapter 1 briefly surveyed the evolution and development of library catalogs.
With all the power that they have, they still do not provide the level of deep
searching that researchers have come to expect. Many times, when sitting at a
reference desk, students have come up to me and said, “Your library doesn’t
have any books on my topic.” What that usually means is that they couldn’t find
any books, either because they weren’t searching with the library search
protocols that are expected or the library search tools simply did not search
deeply enough into the full text to retrieve the expected information.
Library catalogs function best as an inventory control system and only
perform basic searches across metadata, as described previously. Even though a
high percentage of new books added to academic library catalogs are online e-
books, the library catalog itself is not capable of reaching down into the texts to
provide discovery. So how can the researcher use Google Books to enhance
retrieval of book materials in an academic library? By using Google Books in
tandem with a local library catalog, users can have the best of both worlds:
searching across the full text of most English language book content and being
able to locate that book within the academic library. Using the University of
Denver catalog as an example (library.du.edu), let’s do a search for Japan AND
bullet trains AND history. Let’s ignore the fact that perhaps better or different
search terms could have been selected. When I perform this search in the
University of Denver catalog, I get two results from the catalog, none of them
really adequate. Doing the same search in Google Books yields 2,670 results. I
have many more results to review. One of the interesting titles I notice in the
Google Books results is The Second Age of Rail: A History of High Speed
Trains. Although for this book there is no preview in Google Books, the title
tells me that this book is highly relevant to my topic. Using the “find in a
library” button in the left margin, I can let WorldCat see what libraries own this
book. I do a search in my local library catalog only to discover that the
University of Denver does not own this book. I can then pursue an interlibrary
loan through my home library.
A second title from my Google Books result set also looks promising:
Shinkansen: From Bullet Train to Symbol of Modern Japan. However, in this
case, the “find in a library” link is subordinated under the “Get this book in
print” pull-down menu. When clicking the link, the content links to a different
book entirely. This is a case of international standard book numbers (ISBNs)
somehow not matching up.
The lesson is: if Google Books automatic library linking features work for
you, then you have no more work to do. If they fail, then simply look up the title
in local library catalogs.
As I keep going through the list of Google Books results, I come across Japan
in the 21st Century: Environment, Economy, and Society, which from the
preview looks relevant. I then search this title in the library catalog and discover
that a print version is owned by the library. What I couldn’t discover with the
library catalog I was able to discover using Google Books. We can’t blame
libraries for not being able to search the full text of everything on their shelves,
but we can think of Google Books as a library helper tool that makes our
research faster and more efficient.

Library Discovery Tools—Not Deep Enough


In the late 1990s and early 2000s library vendors attempted to solve the
problem they had partly created for libraries: how to provide access to the many
silos of electronic information owned and subscribed to by libraries. To rectify
this problem they created tools they called “federated search” tools. A better way
to describe these tools is metasearch or broadcast search tools. I argue that these
tools were never a true federated search, but rather a federation of results coming
from disparate information sources (see Figure 7.1).
FIGURE 7.1.   Broadcast Searching Model of Late 1990s to Early 2000s.

Under this model, users placed their search terms in the search box. When
they submitted the search a query was sent out to various information silos,
sometimes as many as 50 to 100 silos at once. Connection protocols then took
place, search terms were entered, and users waited—usually a very long time—
for results to return. Many times these searches would take over one minute,
with the search tools attempting to index these records “on the fly,” and to merge
and de-duplicate the records for presentation to the user. These endeavors were a
failure. In an age of immediate search results, with Google taking fractions of a
second, these tools would take 20 to 60 seconds or more. Users didn’t have the
patience for this.
Taking lessons from Google Scholar, vendors rethought the model of
information discovery and came up with tools they called “web-scale discovery,”
or simply “discovery” tools. Rather than linking out to numerous information
silos and waiting for results to return, vendors built a model much like Google
Scholar. They loaded metadata into a central server and indexed it. Some
vendors incorporated selected full text into the central server. But vendors would
have a much more difficult time acquiring full text from publishers than Google
Scholar, with all of its leveraging, was able to do. These discovery tools really
were the true federated search in that one single server was being searched,
eliminating the outside protocol connections and the wait time for return of
information and the tedious tasks of merging and de-duplicating the records. But
vendors had already used the term “federated search” for the failed broadcast
search tools. For that reason they had to brand the new tools as “discovery”
tools.
The real power of these licensed (and expensive) discovery tools is in the fact
that the search takes place against a single “pot” of metadata. This pot is rather
broad in scope, encompassing metadata for books from the library catalog,
newspaper articles, magazine articles, dissertations and theses, scholarly articles,
videos, sound recordings, archival materials, and many other materials owned or
subscribed to by academic libraries. They do not search everything—only those
resources compliant enough to be able to have metadata contributed to the
respective discovery service. The strength is in the breadth of coverage, not the
depth of indexing.
In contrast, Google Scholar is narrow in breadth, searching across scholarly
journal articles from a variety of sources, as well as “bleed through” from
Google Books. The difference is that the depth of searching is nothing short of
transformationally impressive (see Figure 7.2).
As of the writing of this book, the biggest players in this market are ProQuest
with their Summon product, ExLibris (now a ProQuest company) with Primo,
EBSCO’s EDS or EBSCO Discovery Service, and OCLC’s WorldCat Discovery.
These products all vary as to underlying architecture and capabilities.

The Information Access Anomaly


Let’s examine why library tools appear to be (and actually are) weak in
comparison to today’s researchers. Let us consider a typical print academic
book. For the sake of simplicity let’s assume that an average book contains 200
pages of actual text and that it has an average of 400 words per page
(WritersServices 2001). That means an average such book contains 80,000
words. Now when a library catalog performs a search, it doesn’t search the full
text of the book, it searches certain parts (fields) of the catalog record. Typically
a catalog keyword search would search all the author fields (personal author,
corporate author, and any added author entries), all the title fields (including the
title proper, but also series title, spine title, translated title, and any other such
fields), subject fields (whether they be subject headings, subject descriptors, or
subject keywords), and notes field. These catalog records are surrogates; they
stand in place of the whole work. If a cataloger/indexer has done an adequate
job, the surrogate record should accurately describe the work in terms of
physical description and extent, and it should also adequately describe the
general “aboutness” of the work. The catalog record, being a brief surrogate, is
only able to capture the main topics of the book, but not the details. Some
catalog records may contain tables of contents, chapter titles, and subheadings
that delineate the book outline in greater detail.
FIGURE 7.2.   Discovery Tools as True Federated Search Tools.

When catalog records are searched through an online catalog, generally it is


the author, subject, title, and notes fields, as described earlier, that are searched.
When you total up the indexable words in all the author, title, subject, and notes
fields, you may be searching 50 to 100 words. For the sake of this illustration,
let’s say that an average of 75 words are searched. This has huge implications in
terms of discovery of book information. Thus, a 200-page book, with 80,000
words would only have 75 words discoverable via the online catalog. I call this
problem the “information access anomaly.” The ratio of indexable words in a
typical catalog record (75) divided into 80,000 words of the entire book yields a
surrogate record to full-text ratio of 1 to 10,666. This is why students so quickly
give up on library catalogs for discovery and go to Google.
Let’s look now at the surrogate record to full-text ratio of typical scholarly
journal articles. Journal articles are smaller in size than books. Let’s say for the
sake of argument that an average scholarly journal article is 15 pages in length,
with 400 words per page. This yields 6,000 words. But journal articles no longer
need to be indexed to economize space. Instead, these articles are often assigned
multiple subject descriptors or subject headings, usually many more subjects
than are assigned to books. In addition, abstracts, whether they are author-
produced abstracts or added by professional indexers, tend to be several
paragraphs in length. For our argument, let’s just say that a surrogate record (that
is, an index record) for a scholarly article is 300 to 500 words in length. Let’s say
an average of 400 words. Dividing 400 into 6,000 yields a surrogate record to
full-text ratio of 1 to 15. This explains why journals articles tend to be easier to
discover than books for most researchers. Table 7.1 shows the differences in
depth of indexing between searching of surrogate records and full text in book
content and journal content.
Now let’s apply the idea of a ratio to Google. Google Scholar generally
indexes the full text of nearly all scholarly articles it has ingested. We can say
that Google Scholar has a surrogate record to full-text ratio of 1 to 1. Likewise,
even though not every book has been ingested into Google Books, for those that
are represented, the surrogate record to full-text ratio is 1 to 1.

TABLE 7.1. The Information Access Anomaly: Assumption of 400 Words per Page (WritersServices 2001).

    Book (Average)     Journal Article     Google


(Scholar/Books)

Typical length full text (FT) 200 pages × 400 = 80,000 15 pages × 400 =
    words     6,000         
Surrogate record (catalog or index     50–100 words (75 word 300–500 words (400
    avg)         
metadata) (SR) avg)
SR to FT Ratio     1 to 10,666     1 to 15     1 to 1
Source for assumption of 400 words per page is WritersServices 2001

THE THREE GOOGLES


Over the past decade I have presented a workshop I often call “The Three
Googles.” The catchy title highlights the fact that Google really has three distinct
interfaces, with different features, each of which has great potential to aid in the
academic research process. The 45-minute workshop briefly covers the topics
presented in more depth in this book: finding primary sources with Google Web,
locating relevant scholarly articles using Google Scholar, and going beyond what
your library online catalog can do by finding book content in Google Books.
I tie all this together with a diagram showing the features of each (Figure 7.3).
During the presentation I explain how to search Google Web for primary
sources: first by trying to figure out what entities care about the topic under
discussion. These may be international organizations, nongovernmental
organizations, foreign governments, agencies of the U.S. government,
subnational entities like state governments, and finally local governments. Then,
we restrict Google Web search results by top-level domain (TLD) and possibly
by file type. In cases where Google will not take us directly to the content we
desire, we need to explore indirect search strategies to find content on the deep
Web, also known as the hidden Internet. In most cases Google Web allows users
to discover and get primary sources and retrieve the actual document
(fulfillment), many times a PDF or other file format.
FIGURE 7.3.   The Three Googles Model.

I then go on to explain the background and development of Google Scholar—


how content from publishers is exposed to Google Scholar, both the metadata
and full PDFs in most cases. But academic libraries also give Google Scholar
their library holdings so that all this information together makes Scholar what it
is. Libraries that have a significant number of licensed subscriptions will enable
not only discovery of content, but also fulfillment in cases where they subscribe.
Google Books, however, is different. I show how discovery of book content is
much deeper than what is possible with library online catalogs but that
fulfillment will not happen. Unlike with Google Web and Scholar, users will not
be able to read, download, or print out the entire book. Users must search the
title in their local library or use the “find in a library” feature within the
interface. As long as we realize what Google Books is good at and don’t expect
that we will be able to read full books through the interface, we will be prepared
to partner with our local academic libraries to fulfill our research needs.
The “three Googles” approach shows researchers that Google is able to do
much to enhance the scholar’s needs, but in itself it is not sufficient. The millions
of dollars spent each year in academic libraries buys much more than Google
can even begin to expose through the three Googles.

WHAT’S RIGHT WITH ACADEMIC LIBRARIES


Beyond Google Web: Primary Source Materials, and Academic
Libraries
We demonstrated in Chapter 4 how Google can be used to locate primary
source materials. Using site-specific domain searching we can leverage Google
to force resources relevant to our research to the top of the list. We can further
limit by file type to see PDF files or other relevant file types. Let’s examine
some of the categories of materials that libraries provide, both in tangible
formats and online.

Archival Materials
Archives and libraries are cousins in the cultural heritage world—a world that
includes archives, museums, and libraries. Whereas libraries tend to focus on
published materials, of which multiple copies of works such as books exist in
multiple libraries, archival collections tend to be unique holdings of letters,
papers, minutes, photographs, transactions, and even objects.
Recently there has been a movement among archives to place finding aids to
their collections on the Web and even to digitize content and make it available
through a library Web site or an institutional repository. In addition library
vendors have been tripping over each other to get this unique content digitized
and sold through their platforms to libraries. Among such vendors are Gale,
ProQuest, Adam Matthew, Newsbank, East View Information, and Alexander
Street. Although it could be argued that content libraries publish on their own
through institutional repositories is discoverable with Google Web, content
through the vendors is not. This content is not inexpensive. Libraries often have
the option of either subscribing to this content or purchasing it. This archival
content would not generally be discoverable by any of the “three Googles.”
Content of these vendor-sourced collections is sometimes arranged topically
(slavery, civil rights, fashion, etc.), sometimes chronologically, and sometimes
biographically. Content that for decades required a trip to a university archive,
with all the plane fares, hotel nights, and letters of introduction, now can take
place digitally. Of course, not every library is able to afford these expensive
subscriptions, so in some cases travel is still necessary if one’s own research
library is not able to afford these subscriptions. The point is that Google is not
able to drill into these proprietary subscription databases.
On just the topic of slavery, vendors today offer these collections: American
Slavery Collection, 1820–1922 (Newsbank); Slavery and Anti-Slavery: A
Transnational Archive (Gale); Slavery, Abortion, and Social Justice (Adam
Matthew); Sources in U.S. History Online: Slavery in America (Gale); Black
Abolitionist Papers (ProQuest); and Slavery in America and the World: History,
Culture & Law (Hein) are among the many collections.

Archived Magazine Content


Google Books has its magazine scanning project, but this cannot begin to
compare with the content academic libraries pay for through vendors. As an
example, the Vogue Archive (ProQuest) covers the entire run of U.S. Vogue
magazine from 1892 to the present. Nation Archive (EBSCO historical archive of
the famous magazine) covers 1895 to present.

The Internet: The New Microfilm


Remember microfilm, microfiche, the readers you had to use to access them,
and the slimy paper printouts from microform printers? You may not remember
the years when microforms were popular in libraries, but most research libraries
still have many cabinets filled with these things. During the 1960s through the
1990s microfiche distribution was a way for libraries to purchase newspaper
content, little-used periodicals, archival collections, and doctoral dissertations in
a way that saved space and money. This format was a kind of precursor to the
Internet in that it took content difficult to deal with in physical formats or unique
in that it existed only in one archive and made it possible to distribute it far and
wide. The reason I am discussing microforms is that many of the microform sets
that libraries paid for in the 1960s, 1970s, and 1980s are now available to
purchase again in online format. Many of these collections are not discoverable
via direct Google searching.

Historic Newspaper Content


Print newspapers are such a valuable resource. Many newspaper aggregators
present current newspaper articles as text-only files. But when they were
available in microform, users could see the entire broadsheets in context,
advertising and all. Research libraries today are able to purchase full newspaper
content—sometimes scanned from microforms, sometimes rescanned from the
original print—and search them in full text. These pricey products are not seen at
all by the Three Googles. Examples of this include the [London] Times Digital
Archive from 1785, the New York Times from 1851, the Los Angeles Times from
1881, and hundreds of other newspapers, every issue, from cover to cover.

Market Research Reports


These reports are compiled by private research firms under contract for
businesses. Sometimes these reports are made available to other parties, like
libraries, on a subscription basis. Individual reports and collections of reports
packaged for academic libraries can be very expensive, because it was very
costly to produce them in the first place. Libraries with business school
programs are the most likely to have these. If you don’t believe me that these
reports are expensive, just try this Google search: market research report airline
industry site:com. Click on a few links in the result set and you will be
convinced.

Streaming Videos and Audio


We get so accustomed to looking to sources like YouTube for video content
that we may forget that libraries subscribe to large amounts of content through
various online streaming services. Some of these services provide feature films
on demand, others offer academic videos such as these projects: American
History in Video, Counseling and Therapy in Video, Opera in Video, Theatre in
Video, and World History in Video (each from Alexander Street); various
streaming documentaries from Docuseek2 and Kanopy; as well as many other
news, training, and performance video services. Expensive documentary films
are also typically collected by academic libraries. Few of these are freely
available, but are further evidence of the value added by libraries.

Beyond Google Scholar: Journal Content through Subscription


Databases
Although Google Scholar is fairly reliable and comprehensive for searching
scholarly content, there are reasons to search library databases as well. I often
advise students to start with Google Scholar first, because it is fast, efficient, and
produces more results quickly, but then to “play clean up” in specific library
databases. There are several advantages to searching scholarly articles in library
databases and not just relying on Google Scholar:

1. Library databases generally add current content faster than does Google Scholar.
2. Often individual databases offer controlled vocabularies. Although imperfectly applied in some cases,
controlled vocabularies (constructed and applied using thesauri) are an efficient way to get to “all and
only” the desired results.
3. Databases that contain scholarly articles often also contain related kinds of publications like book
chapters, dissertations or theses, conference proceedings, and book reviews, which may not be
covered adequately or at all in Scholar.
4. Proximity searching in library databases is often possible across full-text content. This usually allows
for a greater degree of precision, a feature often necessary for advanced research projects.

Not counting the library discovery tools already discussed earlier in this
chapter, scholarly articles are generally featured in two kinds of databases:
publisher portals and aggregator databases.

Publisher Portals
Most journal publishers view the world as starting from themselves and
produce search interfaces that search only their content. These can be very
powerful, but also very limiting, in that only their publications are featured. If
libraries followed vendors’ advice, users would experience even more frustration
than they already experience, because they would need to search several dozen
interfaces, each with their own idiosyncrasies, in order to do comprehensive
research.
Many research libraries will have publisher databases from Elsevier
(ScienceDirect), Wiley, Springer, Sage, Taylor & Francis, and others. Partner
projects like JSTOR, although not directly publisher generated, work with
publishers as a hosting venue for older scholarly content. Many university
presses also have their own publisher portals, such as Cambridge University
Press and Oxford University Press.

Aggregator Databases
Aggregator databases operate under license agreements with publishers,
sometimes indexing and pointing to publisher content via openURL technology,
but more often have rights to host PDF content from publishers on the
aggregator site itself. The advantage of this is that the scholarly publications of
many publishers can be searched at one time. In many cases the PDF publisher
content has been OCR-ed and thus can be searched in full text through the
aggregator interface.
Examples of these large aggregator products include EBSCO’s Academic
Search and Business Source (each with varying numbers of titles available),
ProQuest’s product ProQuest Central, and Gale’s Academic OneFile. In Chapter
1 we discussed the “back-generated” subjects created by aggregators, with the
strengths and weaknesses of this methodology. Nevertheless, these resources
should not be overlooked in the pursuit for scholarly journal articles.

Beyond Google Books: Book Content in Print and Online


Historic E-books
E-books are extremely popular today, even if they are annoying with all of the
digital rights management (DRM), software requirements, and other usage
limitations. But there are some stunning historic e-book collections that vendors
have put together. The Pollard and Redgrave Short Title Catalogue of Early
English Books from 1475 to 1640, originally accompanied with a microfiche
collection, is now completely available as an online e-book package. Wing’s
Short-title Catalogue of books printed in England, Scotland, Ireland, Wales, and
British America, and of English Books Printed in Other Countries, 1641–1700
was also originally accompanied by a microform set and is also now available
through ProQuest for online access. It’s true that some of these books may be
available in Google Books, but Google Books generally does not scan rare
materials. HathiTrust is a more likely place, but even HathiTrust does not have
the comprehensive coverage that these two expansive collections cover.
Eighteenth Century Collections Online (ECCO) is a Gale project for over
180,000 English language titles from the 18th century. Also originally in
microformat, this online series is invisible to Google.

Current E-book Content


Just as there are publisher portals for scholarly journals, there are also many
portals for e-books. There have been many mergers recently in the industry, but
currently ProQuest’s Ebook Central and EBSCO’s eBook Collection feature
books on all subjects from multiple publishers. Other major collections often
found in academic libraries include Books 24×7, Credo Reference, ebrary,
Knovel, Oxford Scholarly Editions, and Springer E-Book Series. Google Books
might feature titles from these publishers with a limited preview, but sometimes
with no preview at all. Many times libraries are able to catalog individual book
titles from each of these vendors and include them in the local online catalog,
but some libraries do not go to the expense to do this. You should consult with a
reference librarian to find out the local practices at your academic library.

TABLE 7.2.   Distinction among Resource Types.


Resource Type     Function
Almanacs     Annual facts, ready reference ,or quick look-up information
Atlases     Collections of maps
Bibliographies     Indexes to books about books
Chronologies     Timelines and chronological events
Concordances     Indexing of every word, especially in canonical religious works like the Bible and
Quran, and canonical literary authors like Shakespeare
Dictionaries (Language)     Word pronunciation, origins, and meanings
Dictionaries (Subject)     Definitions of terms within a specific discipline
Directories     Alphabetical listings including phone books, business directories, and
association/organization directories
Encyclopedias (General)     Long entries about general topics
Encyclopedias (Subject)     Subject-specific treatment of terms within specific disciplines
Gazetteers     Indexes to maps and atlases
Handbooks and Guidebooks     Basic guidebooks on any topic
Index and Abstracting Tools     Provides access to articles, book chapters, books, dissertations, and essays within a
subject or discipline
Lexicons     A specialized language work; more than a dictionary
Statistical Compendia     Collections of data sets and statistical tables
Thesauri     Two kinds: language thesauri (Roget’s) and controlled vocabularies (Thesaurus of
ERIC Descriptors)

THE “FLATTENING” OF INFORMATION SOURCES


When we had libraries filled to the brim with print resources, we could more
easily see the variations in kind and function of materials. Nowhere could this be
more easily seen than in the academic library reference collection. This was a
microcosm of the library as a whole, a representative collection to navigate the
much larger collection. There were almanacs to do quick look-ups of ready
reference factoids; subject dictionaries; subjects in humanities, social sciences,
and sciences; and atlases to navigate not only geography, but any feature that
could be represented graphically on a map. Table 7.2 delineates some of the
many resource types typically found in large academic reference collections.
The point here is that in an online environment, in the world of Google, the
functions and distinctions of these resources have all but disappeared: “out of
sight, out of mind.” If the researcher isn’t reminded of a resource type, it’s likely
that information seeking will not lead there. The information has been
“flattened.”

THE “HOLY GRAIL” OF SEARCH RESULTS IN THE


GOOGLE AGE
As mentioned already, the primary goal of advanced academic searching is to
find “all and only” the desired results, no matter how many or how few those
results are. But having Google at our fingertips sometimes distorts our thinking.
I find that many searchers, whether in traditional database searching contexts or
with a search engine, simply throw words at a search box until they get that
magic number of results. I’m not sure what the magic number is, but it seems
like it is more than 5 and fewer than 50—maybe 35. But throwing ill-advised
keywords into a search box just to cut down the results will discard some results
that would have helped you and will rule in results that are perhaps not relevant
at all—all for the arbitrary goal of a result set you can manage.

REFERENCE
WritersServices. 2001. “Matching World Count to Page Size.” Available at:
http://www.writersservices.com/writersservices-self-publishing/word-count-page. Accessed December
8, 2016.

8
Academic Research Hacks

Google is known for its “hacks,” or clever shortcuts for quick access to
information. To prove the point, try these searches in Google Web: tip
calculator, set timer for 15 minutes, songs by a-ha, books by dan brown, flight
UA815, time in tokyo, and the list of these “ready reference” hacks seems
endless. There are even not-so-academic Google tricks, such as searching for
google mirror, do a barrel roll, and google guitar—just a few of the many
pastimes that can distract you from your research. In fact, if you use Google to
search google hacks you can see many lists people have compiled of useful
things Google will do for you. We have already covered the basic research hacks
needed for academic research such as site-specific searching and file type
searching in Google Web. But there are other hacks that may help researchers in
specific situations.

NO COUNTRY REDIRECTION (NCR) SEARCHING


Perhaps this has happened to you, as it did to me. You are studying in a
foreign country. In my case it was Japan. I get a reference question from a
student back in the United States, and I realized that it would be helpful to use
Google Web to answer the question. But when I try to go to Google.com, Google
automatically reroutes me to google.co.jp—the Japanese language version of
Google Web. Google is trying to be helpful, but forcing Japanese kanji and local
Japanese ads on me is not helpful, as Figure 8.1. illustrates.
Then, when you perform a search, the results are difficult to access (see Figure
8.2).
There is a fix for this. Typing ncr after the URL will fix this for you. NCR
stands for “no country redirection.” This is a way to escape Google’s forcing you
into the local Google view. This hack works for Google Web and Google
Scholar (Figure 8.3).

FIGURE 8.1.   Google and Google Scholar as Seen in Japan. Google and the Google logo are registered trademarks
of Google Inc., used with permission.

FIGURE 8.2.   Google Scholar Search Results from Japan. Google and the Google logo are registered trademarks of
Google Inc., used with permission.

We’ve discussed how to navigate Google when you are visiting or residing in
a foreign country. But what if you are in the United States and you want to
search the Google interface of another country, either because you are from that
country and are accustomed to that Google experience or because you just want
to easily see news and events from that country?
FIGURE 8.3.   Accessing English-Language Google Scholar from Japan with NCR Fix. Google and the
Google logo are registered trademarks of Google Inc., used with permission.

FIGURE 8.4.   Accessing Google Web from Hungary with NCR Fix. Google and the Google logo are registered
trademarks of Google Inc., used with permission.

The easiest way to do this is to use Google Chrome as your Internet browser
and then download the NoCountryRedirect (NCR) add-in. To get to this, just use
Google to search for NoCountryRedirect (NCR) Chrome Web store. To get a list
of all Google domains available, search List of Google domains, then click on
the Wikipedia link. This will give you all the information you need to configure
the Google Chrome NCR add-in that you install. The add-in solves the “other
country” problems in both directions: it allows for searching U.S. Google while
in a foreign country, and it allows for searching a foreign Google interface while
in the United States. Google Scholar, when searched from Hungary, appears as
seen in Figure 8.4.
Here is how it works in another context. Let’s say you want British news, and
thus want the google.co.uk Google experience to be the default search engine.
Follow these steps.
1.  Configure the Google NCR add-in to point to co.uk (see Figure 8.5):

FIGURE 8.5.   Using the Chrome NCR Redirect Add-in. Google and the Google logo are registered trademarks of
Google Inc., used with permission.

2.  Go to Google—you likely will be initially directed to Google.com (in the United States).
3.  Click the Google NCR add-in button, and select “Open local “co.uk” version (see Figure 8.6).

FIGURE 8.6.   Using the NCR Redirect Add-in to Search a Different Local Version. Google and the Google
logo are registered trademarks of Google Inc., used with permission.
Now when you type “news” into the UK Google, you will get UK news at the
top (see Figure 8.7).
This should work for any of the countries featured on the Wikipedia “List of
Google Domains” page.

GOOGLE BOOKS NGRAM VIEWER


A term from computational linguistics, an “n-gram” is sequence of “n” items
(n stands for number—in this case, the number of words or phrases) from a
given corpus of a speech or texts. The Google Books project has amassed over
25 million scanned books, and the Google Books Ngram Viewer
(https://books.google.com/ngrams) has taken the entire full text contained in
these works and allows searching for word or phrase frequencies across the
entire corpus. Although this is a subset of the estimated 130 million printed
books in existence, it nevertheless can be a research base in itself, given the huge
sample size. Michel et al. (2011) have been able to use Google Books Ngram
Viewer to document the size of the English lexicon over time. They also
demonstrate some of the many historical and social science applications that can
be made of the project.
One of the more interesting demonstrations of the power suggested by Michel
et al. is to search for the popularity of various American food favorites over time
(Figure 8.8).
FIGURE 8.7.   Viewing UK News from the United States with Chrome NCR Add-in. Google and the
Google logo are registered trademarks of Google Inc., used with permission.

There are countless applications in humanities, social sciences, and the history
of science. Try these interesting searches: Jane Austen, William Blake, Geoffrey
Chaucer, Charles Dickens; telegraph ,radio, television, Internet; railroad,
carriage, automobile, aeroplane, airplane; football, baseball, basketball,
hockey; George Washington, John Adams, James Madison, Benjamin Franklin,
Thomas Jefferson, Abraham Lincoln.
FIGURE 8.8.   Google Books Ngram Viewer Showing Popularity of Various Foods. Google and the Google
logo are registered trademarks of Google Inc., used with permission.

IMAGE SEARCHING
Google has very powerful image-searching capabilities. The problem is that
for academic purposes, we need to get permission before just appropriating other
people’s creations. We’ll get to that soon. But first, let’s just focus on searching.
We all know how to search for images. But how do we use this powerful feature
to assist us in our academic work?

Image Searching by Keywords


You can start from Google Web, search for your keywords, then click the
image link; or you can start from Google Images directly
(https://images.google.com/). You can also search: “google image advanced
search” to bring up a page where you can limit by image size, aspect ratio, color,
type of image (face, photo, clip art, line drawing, or animated), region, and most
importantly, usage rights. If you intend to use the image in a presentation or
publication, you need to be concerned about licensing so that you don’t find
yourself being sued. Whether you start with the advanced search page or just
begin with Google Web, you will be able to filter by usage rights and other
features after you view the results. The variations on usage rights in Google
Images include not filtered by license, labeled for reuse with modification,
labeled for reuse, labeled for noncommercial reuse with modification, and
labeled for noncommercial reuse. You need to select the appropriate usage rights
from the outset. But even doing this does not necessarily keep you out of court.
You are responsible for doing due diligence to ensure that you are not violating
someone’s copyright.
FIGURE 8.9.   Go Example of Google Images Used to Identify a Place Photo. Google and the Google logo are
registered trademarks of Google Inc., used with permission.

Searching by Image
Searching can also be done by image. The image may be something you see
on the Internet, a computer image file on your computer, or a print photograph
that you digitize with a digital camera.
You can either paste an image URL or upload an image. If you paste an image
URL, make certain that it is the URL directly to the image, not the Web page on
which the image lives. In Figure 8.9.   we see a publisher’s book catalog with an
attractive photo on the front cover. Inside there is no attribution telling us where
this photo was taken. By using screen capture software (I used Jing from
Techsmith) to grab just the photo of the library, saving the image to my
computer, and then uploading the image into Google Image Search, we are able
to have Google make a “best guess” as to where this image is from. This method
works well for places, especially famous places, and doesn’t really work at all
for people.

Image Usage Rights


It is important in research to attribute creative work to the owner. Individuals,
universities, and publishers can be sued for failure to do so. Google simply
gathers URLs for images into its huge searchable index. It should generally be
assumed that images are not free to use; the default assumption should be that
the researcher needs to ask for permission to use them. When searching Google
Images, clicking on the Tools button will reveal additional pull-down menus, one
of which is “usage rights.” Limiting options under here include not filtered by
license, labeled for reuse with modification, labeled for reuse, labeled for
noncommercial reuse with modification, and labeled for noncommercial use. I
find this rather confusing. I find it less confusing, however, the way usage rights
are labeled on the Google Advanced Image Search Form at
https://www.google.com/advanced_image_search, or simply search: google
advanced image search. There you will see the same rights stated differently: not
filtered by license; free to use or share; free to use or share, even commercially;
free to use share or modify; free to use share or modify, even commercially.
As with other Google functions, users should not place ultimate faith in
Google. Image-tagging rights may be incorrect, and users should take it upon
themselves to research any possible fees or permissions.

GOOGLE TRANSLATE
Google Translate (https://translate.google.com/) is a very powerful feature.
Much research has been done to evaluate the accuracy of Google’s translation
ability. One study concluded that although Google Translate is far from error
free, it approaches the minimum standards necessary for university admissions at
many institutions (Groves and Mundt 2015).
The intuitive interface has an auto-detect feature. It usually correctly guesses
the language you pasted into the search box. Currently Google Translate can
translate to or from 104 languages. This can be extremely useful for rendering
foreign phrases encountered in books and articles. Care should be taken when
attempting to translate technical reports, legal documents, or medical diagnoses,
as less-than-accurate results may occur.
Google Translate, rather than serving as an absolute authoritative translation
service, provides a relatively helpful service and deserves its place among the
many helpful tools Google provides the academic researcher.

Fun with Google Translate


Strange turns of the phrase can occur when translating from English to
another language and then back-translating to English. To see the
dangers of this, let’s translate the Preamble of the U.S. Constitution into
Chinese and then back-translate into English again. The original:
We the People of the United States, in Order to form a more perfect Union, establish Justice,
insure domestic Tranquility, provide for the common defense, promote the general Welfare,
and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish
this Constitution for the United States of America.

After translation into Chinese and then back-translating into English, we


end up with this:
The American people, in order to form a more perfect union, establish justice, guarantee
domestic tranquility, provide common defense, promote universal welfare, and ensure the
blessings of liberty ourselves and our consequences, do anything and establish the
Constitution of the United States of America.

For another example, here is a paragraph from President Barack


Obama’s remarks on lighting the national Christmas tree, December 1,
2016:
And this is just another example of why the holidays here at the White House are so special.
Last week, I pardoned a turkey. [Laughter] Tonight we’re lighting the National Christmas
Tree. This one is easier because a tree does not move. It does not gobble. [Laughter] You
just push a button, and it’s electrified, which is exactly what you don’t want to have happen
at a turkey pardon. [Laughter] I thought that was funny, Michelle. [Laughter] Thankfully,
both events have gone off without a hitch.

Using Google Translate to render this into Arabic and then back again to
English gives us this:
This is just another example of why the holidays here at the White House is very special.
Last week, he pardoned a turkey. [Laughter] tonight we feast lighting National Christmas
Tree. This one is easier because the tree does not move. Do not swallow it. [Laughter] You
just push a button, and it electrified, and that’s exactly what you do not want to happen in
the amnesty Turkey. [Laughter] I thought that was funny, Michelle. [Laughter] Thankfully,
the two events had gone without a hitch. [Bold added]

Let’s take the first sentence of Lincoln’s Gettysburg Address:


Four score and seven years ago our fathers brought forth on this continent, a new nation,
conceived in Liberty, and dedicated to the proposition that all men are created equal.

Now translate it into French and back-translate to English:


There are four points and seven years, our fathers brought back on this continent, a new
nation, conceived in freedom, and dedicated to the proposition that all men are created
equal.

LEGAL CASES
Google Scholar incorporates many U.S. legal cases into it. Although no law
school student would be advised to rely on Scholar for getting though law
school, and no attorney would use this for billable hours, it can be useful to the
lay student who just needs to “pull a case” for quick reference. People outside
the legal community do not generally have access (unless they are willing to pay
a lot of money) to prime legal resources like Lexis, Westlaw, or Bloomberg Law.
These databases are given to law school students in hopes of addicting them to
their products. But they do not appear on database lists for general academic
libraries. Instead, academic libraries may subscribe to the greatly pared down
versions: LexisNexis Academic and Westlaw Campus.
Google Scholar’s case law can generally be viewed as a massive experiment.
It contains citations to the cases in question by other cases, without the added,
and necessary, indication if the law is “good law”—still in force—or not. For
that, researchers are best served by using the full Lexis and Westlaw services.
Scholar can be configured to feature courts in a given state. Using the “Select
courts …” link in the left margin of Scholar, you can select which courts, state
and federal, you want to see in the result set. Google already knows what state
you are in, so it will feature that state court more highly in the result list.
A nice feature incorporated into the federal and state cases in Google Scholar
is “star pagination.” Because the text shown in Scholar is displayed in HTML
format and not PDF format, page numbers are not obvious. Scholars need to cite
to specific page numbers when citing cases, either in legal publications or other
kinds of social science scholarship. Page numbers in Scholar are denoted with an
asterisk or star so that researchers can tell exactly where a new page begins. This
has long been the practice of licensed online legal databases, but now has been
incorporated into Google Scholar as well.
One hint on legal research deserves mention here. Many times researchers
only want a legal perspective on an issue, that is, a perspective from a law
review or law journal. HeinOnline is fully indexed in Google Scholar and is a
resource that contains the full text of most law reviews. Framing a search that
contains heinonline as one of the keywords will restrict the results to all and only
content contained in the HeinOnline content—a very powerful limiter.

SEARCHING PATENTS
A patent is a protection on an invention so that the inventor can prevent others
from making, using, or selling the invention. So important are patents that they
are part of the U.S. Constitution. Congress is empowered “To promote the
Progress of Science and useful Arts, by securing for limited Times to Authors
and Inventors the exclusive Right to their respective Writings and Discoveries”
[Constitution, Article I, Section 8]. Patents are interesting from a research
perspective because they are simultaneously legal documents and engineering
documents. A professional patent researcher will have a strong background in
law as well as in engineering.
The problem for the researcher is that sometimes patents are written with
obfuscation in mind. A “hide the ball” attitude is sometimes built into patent
applications so that others will have a more difficult time uncovering them.
Combine this with one of the most detailed and complicated classification
systems imaginable, and you have a real challenge. This section on patents is not
intended as a full legal or technical discussion, but merely as a brief introduction
to a colorful part of the history of technology as it relates to general research.
Patents are one of three areas known as intellectual property that includes
patents, trademarks, and copyrights. Copyrights are handled by the Library of
Congress, whereas patents and trademarks are the purview of the U.S. Patent and
Trademark Office (PTO). Patents from 1976 onward are fully searchable through
the U.S. PTO databases, even searching into the full text of these patents. Before
then, from 1790 to 1995, although the full images of patents had been digitized
by the U.S. PTO, the only intellectual access to them are by patent number or
classification code, not a very user-friendly way to search.
But then along comes Google, taking the images that were already in the
public domain and performing their own OCR on them. Now patents from 1790
to present are all full-text searchable. The OCR is far from perfect, but Google
does seem to correct the OCR text for patents that are often accessed.

Fun with Patents


Find an object around the house or office—the older, the better—and see if
it has a U.S. patent number on it. Then look up that patent number in
Google Patents and look for the drawings.
If you cannot easily locate something with a patent number, then have
fun this way: search Google Patents for these patent numbers (Table 8.1).

TABLE 8.1.   Interesting U.S. Patents.


Patent Number      Description
223898      Light bulb—Thomas Edison
174465      Telephone—Alexander Graham Bell
6469      Buoying vessels over shoals—Abraham Lincoln
821393      Flying machine—Wright Brothers
D11023      Statue of Liberty (design patent)—Frédéric Auguste Bartholdi
139121      Blue jeans—Jacob W. Davis for Levi Strauss and Co.
D183626      Frisbee (design patent)—Walter Frederick Morrison
2415012      Slinky—Richard T. James
D285,687      Original Macintosh computer design—Steven P. Jobs, et al.
D469,109      iPod design elements—Steven P. Jobs, et al.
7,479,949 Touchscreen—Steven P. Jobs, et al.

TABLE 8.2.   World Patent Office Content in Google Patents (as of December 16, 2016).
Patent Office      Patents Granted      Patent Applications
JP Japan        7,898,919      18,179,932
CN China        8,990,822        6,911,241
US United States      10,343,415        4,737,851
EP European Patent Office        1,500,257        4,206,017
WO World Intellectual Property Organization (WIPO)                      0        3,644,348
DE Germany        4,678,577        2,816,209
GB Great Britain        1,271,621        2,380,956
KR South Korea        1,997,371        2,236,991
FR France        2,065,869        1,034,096
CA Canada        2,125,610           933,663
ES Spain           803,352           617,339
RU Russia           848,083           494,766
NL Netherlands           206,706           434,920
FI Finland           220,993           367,188
DK Denmark           446,203           113,578
LU Luxembourg                  953             62,633
BE Belgium             586,727                        0
Source: https://patents.google.com/

Google Patents can be accessed at https://patents.google.com/. Not only are


U.S. patents from 1790 to the present searchable through the interface, recent
developments by Google have folded in many foreign patents. Google provides a
breakdown of patents included by patent office (Table 8.2).
Full-text coverage varies from one patent office to another, but Google’s
coverage is very impressive, although not totally comprehensive. Google Patents
advanced search page (https://patents.google.com/advanced) allows for
restricting results to any one of the world patent offices, as well as limiting by
assignee, filing date, inventor, language, filing status, patent type, citing patent,
and CPC (Cooperative Patent Classification), which is a jointly devised
classification system between the European Patent Office and the U.S. Patent
Office.

REFERENCES
Groves, Michael, and Klaus Mundt. 2015. “Friend or Foe? Google Translate in Language for Academic
Purposes.” English for Specific Purposes 37: 112–121.
Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, Joseph P.
Pickett, … Erez Lieberman Aiden. 2011. “Quantitative Analysis of Culture Using Millions of Digitized
Books.” Science 331, no. 6014: 176–182.

9
Case Studies in Academic Research

This chapter attempts to put into practice all the Google research skills
mentioned in the book so far. This is done with case studies of actual research
topics I have encountered in the course of research consultations. In each case
study a question I pose a question; then we explore how Google Web, Google
Scholar, and Google Books can be used to help answer the question. But we
don’t stop there. Google cannot do everything. We look to licensed, specialized
library resources typically found in academic libraries to see how we can go
beyond Google for further assistance.

CASE STUDY 1: RESOURCES ON HUMAN TRAFFICKING


Question: I need resources on human trafficking, specifically sex trafficking
and labor trafficking, for my academic research paper. I need peer-reviewed
journal articles as well as some information from official sources.

Google Scholar
We will use Google Scholar to find scholarly, peer-reviewed articles. A search
for “human trafficking” “sex trafficking” yielded about 9,380 results. Limiting
to articles since 2010 cut the results down to 6,940. Because labor trafficking is
often linked to sex trafficking, we can add to the search terms the idea of labor to
further restrict the results: “human trafficking” “sex trafficking” labor. A
relevant article appears to be:
Alvarez, M. B., & Alessi, E. J. (2012). Human Trafficking Is More than Sex Trafficking and
Prostitution Implications for Social Work. Affilia, 27(2), 142–152. [citation derived by clicking “cite”
link within Google Scholar].
[Click the “Cite” button under article listing.] This article is cited by 43 subsequent sources. User
examines these citations.
[Click the “Related articles” button under article listing.] Also note that the 100 related articles reveal a
few more gems.

Google Web
Now on to primary sources. User searches Google Web: “human trafficking”
“sex trafficking” labor and retrieves about 403,000 results. Looking through the
first few pages of results, we try to see what agencies are concerned about this
issue, looking especially for international organizations and U.S. government
agencies. We make a list of nongovernmental organizations (NGOs),
intergovernmental organizations (IGOs), and U.S. government agencies. Not
seeing enough UN resources toward the top, we amend the search to: “human
trafficking” “sex trafficking” labor united nations. Table 9.1 shows some of the
stakeholder agencies arranged by category.
Now that we have a tentative working list of “who cares” about this topic, we
need to get to work with site-specific searching. Searching site:unodc.org sex
trafficking yields 2,290 results. Noting that many of these results are HTML
pages, we limit the search by PDF file type like this because substantive
materials very often appear in PDF format: site:unodc.org sex trafficking
filetype:pdf. We keep doing this kind of hunt for primary sources from
authoritative agencies, carrying it over to U.S. agencies, like this: site:state.gov
sex trafficking filetype:pdf. This will take some time with so many agencies to
investigate.

Google Books
We saw a few book results in Google Scholar, but they were buried down
around the 25th result. Searching Google Books will focus things a bit more on
book content. Doing the same search “human trafficking” “sex trafficking”
labor in Google Books yields 6,360 results. One particular title jumped out: Sex
Trafficking: Inside the Business of Modern Slavery. This is a limited preview
book, but clicking some of the markers on the right gave us enough indication
that this book will be useful (see Figure 9.1).

TABLE 9.1.   Finding Web Domains from Entities That Care about the Topic.

NGOs      IGOs      U.S. Agencies

polarisproject.org      ungift.org      acf.hhs.gov


traffickingresourcecenter.org      unodc.org      nationalservice.gov
aclu.org      ungift.org      state.gov
hrw.org      unitar.org      nij.gov
     un-act.org      fbi.gov

FIGURE 9.1.   Google Books Highlights Page Hits. Google and the Google logo are registered trademarks of Google
Inc., used with permission.

We then search that book title in the University of Denver library catalog and
see that the library has both a print version and an online version of the book.
Note that the book was discovered using Google Books, something that could
not be adequately done using just the library catalog, but eventually this led us
back to the library catalog for “fulfillment.”

Added Value of Library Resources


There are many U.S. congressional materials on this topic. Although
congressional hearings and House and Senate reports are freely available and
discoverable by Google, ProQuest Congressional provides a level of
bibliographic control and enhanced access that is not possible with open Web
searching.
Library access to scholarly, peer-reviewed journal articles is also important. In
addition to excellent coverage of this topic in the general databases provided by
ProQuest, Gale, and Ebsco that are ubiquitous in academic libraries, the
Homeland Security Digital Library (free from hsdl.org, but twice the content
when accessed through a federal depository library) and Oxford Analytica (or the
ProQuest version of this called OxResearch) contains much information about
this topic. Our student now has a lot to do with scholarly secondary sources from
Google Scholar, primary sources in the form of research reports, statistical
studies and official analyses from Google Web, and general book content from
Google Books. Library resources augmented the Google results with the help of
licensed database content.
In these case studies it is impossible to know which strategies were the most
useful for the users. The intent is to demonstrate how the three Google interfaces
can be put to use in scholarly contexts and how library resources can augment
these results. We cannot know how users actually used the advice given.

CASE STUDY 2: FUNDING FOR RELIGIOUS SCHOOLS IN


TANZANIA AND SOUTH AFRICA
Question: Is state funding for religious schools allowed in Tanzania and South
Africa?

Google Web
Let’s start with Google Web to try to find some primary sources. First we need
to find the top-level domain (TLD) for Tanzania and South Africa. By typing tld
into Google Web, then clicking on the “List of Internet top-level domains—
Wikipedia” link, we discover that the TLD for Tanzania is .tz and the TLD for
South Africa is .za.
Starting with Tanzania, let’s search like this: site:tz religious schools. We
really don’t intend to go to any of the results we see here. We just want to see if
we can identify Internet domains coming from the Tanzanian government to try
to get an official perspective on the issue. One of the top results is a Web site
from the domain .go.tz, obviously a government site. Now we can amend our
Google search to restrict it to only government sites and perhaps also to get
down to the funding level. We frame our search like this: site:go.tz religious
schools funding. We further want to restrict results to more substantive things, so
we restrict results to PDF format: site:go.tz religious schools funding
filetype:pdf. Hopefully reading through some of these documents will point us in
the right direction.
Next, let’s do the same thing for South Africa, beginning with the search:
site:za religious schools. We had to look a bit further down the page to discover
that the government domain for South Africa is .gov.za. We can thus amend our
search to: site:gov.za religious schools. Again, to restrict to more important
reports, amend the search to: site:gov.za religious schools filetype:pdf.
We should also search the respective government domains for information
about the budget or finance ministries. Searches like site:go.tz budget and
site:gov.za finance ministry will be helpful in this regard.
Even though we have searched for primary sources from the governments of
Tanzania and South Africa on the topic, there may be international bodies that
have some discussions. Because many international bodies use .org as their
TLD, let’s do an initial search like this: “religious schools” funding “south
africa” government policy site:org. Ignoring the Wikipedia entries at the top of
the result list and looking further down, we see results from .worldbank.org and
.unesco.org. Although there are likely other organizations we should call
attention to, let’s just go with these two for this case study. We next search:
“religious schools” funding “south africa” government policy
site:worldbank.org. After optionally limiting to filetype:pdf and examining these
results, we can then search for: “religious schools” funding “south africa”
government policy site:unesco.org. Some international organizations use an .int
Internet domain, so it would be advisable to search like this as well: “religious
schools” funding “south africa” government policy site:int. Then, of course, do
the same thing all over again for Tanzania.

Google Scholar
Let’s now shift from primary sources from the two African governments to
secondary sources that are from scholarly sources by searching Google Scholar.
Although the search should ultimately be done for Tanzania and South Africa
separately, let’s first search for articles that might deal with both countries at the
same time by searching Scholar like this: religious schools funding tanzania
“south africa”. We place South Africa in quotes because we know that we can
force the phrase—in this case, the invariant name of a country. We can restrict
the results to more recent dates, say from 2010 to the present, by placing 2010 in
the first box of “Custom Range,” and leaving the second box blank. Not seeing
any obvious titles in the first page of the result set, we can opt to place religious
schools in quotes to force a greater degree of relevancy: “religious schools”
funding tanzania “south africa”.
Let’s shift the focus to just Tanzania and amend the search just to articles that
speak about government policy: “religious schools” funding tanzania
government policy. We can then amend the search to articles from 2012 onward.
In the result list we see an article that seems to be on point: “ ‘Affordable’
Private Schools in South Africa. Affordable for Whom?” We notice that it is
available through our university, so we know we can access the full text of the
article. But while we are in Google Scholar, let’s save the citation by clicking the
Cite button. If the professor requires APA style, then the citation would look like
this:
Languille, S. (2016). ‘Affordable’ Private Schools in South Africa. Affordable for Whom? Oxford
Review of Education, 42(5), 528–542.

We notice two problems with the Google Scholar–generated citation. There is


no space between “affordable” and “private,” and there is an extra period after
the question mark in the article title. No computer-generated citation system is
perfect. Whenever automated citation processes are used, researchers must
proofread them carefully to avoid embarrassment.
This article, recently published, has no “related articles” link at this time. If it
had contained this link, we would be able to link out to about 100 other articles
that share some of the concepts.

Google Books
Let’s now move on to Google Books. By searching “religious schools”
funding “south africa” government policy in Google Books, attention is drawn to
Law and Religion in Africa: The Quest for the Common Good in Pluralistic
Societies. The books retrieved seemed to have some lengthy discussions in the
context of South Africa. Because we cannot read much of the book in Google
Books, we need to find it in our local academic library or pursue borrowing it
from another library. But first we need a citation to the book so that we can
include it in the bibliography. Depending on the way Google Books features a
book, there may be several ways to get the citation. The first way is to see if
Google Books has a “Find in a library” link somewhere on the page. In this case,
there is such a link. This links us over to Worldcat.org. At the top right of the
Worldcat.org page is a “cite/export” link. From this link we can get the citation
styles we need. A second way to get a citation is to search for the book title in
Google Scholar. In this case, however, that strategy did not work.

Added Value of Library Resources


Let’s look at some library resources beyond what Google is typically able to
discover.
General article databases: Databases such as Expanded Academic ASAP
(Gale), ProQuest Central, and Academic Search (Ebsco) would have content
from newspapers, popular magazines, and scholarly journal content.
Dissertations and theses: ProQuest’s Dissertations and Theses database, which
covers content historically covered in Dissertation Abstracts International,
might help with master’s theses and doctoral dissertations on the topic.
Reference works: The Political Handbook of the World covers all political
parties in each country of the world, including policy perspectives. The
Statesman’s Yearbook and Europa World Year Book (known online as
Europaworld) is a way to navigate the government agencies likely to deal with
these matters.
For this case study our student now has primary sources concerning the
religious school budgets from Tanzania and South Africa. Additional primary
source material was supplied by various international organizations. Using
Google Scholar we augmented the primary source materials with secondary
scholarly journal articles. Google Books offered some additional resources from
book-length materials. Beyond Google, we discovered that some general library
databases and online reference materials could shed light on this topic.

CASE STUDY 3: PARIS TANGO HOUSE FASHION FROM


THE 1920S
Question: How can I locate information about tango house fashion in Paris in
the 1920s? I need scholarly articles, books, and images.

Google Web
Let’s start first with Google Web. We can start with a very direct search: tango
houses paris 1920s. We find several excellent results with results from BBC
News and several other sites that look excellent but likely are not of academic
quality. We need to formulate a search that accounts for hidden Internet content
—content that would not have enough direct metadata exposed to Google.
Perhaps we can find a searchable database that specialized in fashion images by
searching: fashion image database.
An initial search for sites located in France did not appear to be directly
fruitful: historic fashion paris site:fr. However, we note that one of the results is
from Palais Galliera, Museum of Fashion, apparently a fashion museum in Paris
with an Internet domain of parismusees.paris.fr. This paris.fr domain may help
us to narrow things down. Many French Web sites only have content in French.
For this reason we can use Google Translate to help us frame a search in French.
When placing paris tango fashion history into Google Translate, we get “Paris
tango histoire de la mode.” Back at Google Web we can now search like this:
site:paris.fr Paris tango histoire de la mode. Among the results is a Web domain
parismuseescollections.paris.fr that has sketches of tango fashions.

Google Scholar
Now on to Google Scholar. Doing the search: paris tango houses fashion
history brings up a typical mixture of scholarly articles and book content
bleeding through from Google Books. An article that seems relevant is titled
“Globalization and the Tango.” In addition to getting the citation by clicking the
“cite” button in Scholar, we can click the “cited by” button to find other scholars
that cite this article in subsequent publications. In addition the “related articles”
button will point to 100 other items that may assist in this question.

Google Books
Google Books has some results when searching: paris tango houses fashion
history. A couple of reference works come up in the search results: Historical
Dictionary of the Fashion Industry and The Greenwood Encyclopedia of
Clothing through World History: 1801 to the Present. We should know by now
that by checking our local library catalog, we can possibly get the full text, either
in print or in electronic format.
But now let’s do something different with Google Books. With our results
from the previous search on the screen, click the “search tools” link in the
Google Books navigation menu. A new menu now appears with the options:
“Any books; Any document; Any time; and Sorted by.” If you select the “Any
document” pull-down menu, you will see selections possible for “magazines.”
When we try our search under the “magazines” limit we see some results. Now
we can adjust the time frame using the “Any time” pull-down feature.
Unfortunately this did not produce any results. So we need to rethink our
strategy. Instead of restricting to magazines with a limit to the 1920s, let’s focus
on books with a limit of dates between 1920 and 1930. This time we get some
results from the time period we are interested in examining. Most of these will
be after the 1923 international copyright cutoff date, so their content, including
images, could still be under copyright restrictions. However, we can see if our
academic library has access to these in print or online somehow.

Added Value of Library Resources


In addition to scholarly journal content that we can cross-check through the
many library-licensed databases, we can find newspaper content from some of
the many historical newspapers from the 1920s that have been fully digitized.
Sources like the [London] Times Digital Archive (Gale), ProQuest Historical
Newspapers, and America’s Historical Newspapers (Newsbank) can be used as
primary sources to show the mood of the time. These licensed databases have
stunning views of entire newspapers, ads and all. In addition there are many
contemporary magazines that the library may have access to the full digital
content. Then there is the stunning Berg Fashion Library database from
Bloomsbury Publishing, which has images and text covering all eras of fashion.
Our approach to this question was a bit different from the previous cases,
because we are dealing with a historical question. We looked to Google Web to
discover historic images and discovered an online treasure trove from Paris.
Google Scholar and Google Books helped out a bit with this question, but the
subscription newspaper content from licensed databases really helped in
illustrating the mood of the times, going far beyond what any of the Google
interfaces can offer.

CASE STUDY 4: DATA ON GEOTHERMAL HEAT PUMPS


Question: I am looking for data on geothermal heat pumps. Specifically, I
need data on how many installations there are, consumption, revenue,
employment within the industry, efficiency of heat pumps, and any other
relevant data.

Google Web
You didn’t say if the data you are interested in is for the United States, for
another country, or internationally. So I will address all of these areas with some
strategies. Let’s begin by using Google Web to find some primary sources, both
research reports and statistical sources. To find out what U.S. agencies are
interested in this, do a general search like this: site:gov geothermal heat pumps. I
note that some prominent U.S. agencies are energy.gov, nrel.gov, energystar.gov,
epa.gov, and pnnl.gov. There are others, but we’ll start with these. Now take
each of these domains in turn and drill down with more focused Google Web
searching: site:energy.gov geothermal heat pumps statistics; site:nrel.gov
geothermal heat pumps data; site:nrel.gov geothermal heat pumps filetype:pdf;
and site:gov geothermal heat pumps. Notice that I searched by closely related
terms (data vs. statistics) and that I sometimes limited to the PDF file type to
discover more substantive materials.
If you wanted to do the same kind of digging for international organizations,
first find the stakeholder entities by searching like this: site:org geothermal heat
pumps statistics international. Relevant domains from this search appear to be
iea.org; worldenergy.org; iea-gia.org; and unep.org. Now do the same kind of
searching as we did for the .gov sites. These results should uncover major
research reports, as well as statistical and data sources from primary
stakeholders.

Google Scholar
Now let’s try to find scholarly secondary sources using Google Scholar. Not
only do we want to find articles on the topic, we also want to discover data
sources that were used in the research to see if these might be useful to us as
well. Searching Scholar like this: geothermal heat pumps statistics “united
states” produces what appears to be a highly relevant article, “Analysis of
renewable energy development to power generation in the United States.”
Clicking the “cited by” link takes us to subsequent articles that have interacted
with this content. Then clicking “related articles” gives us 100 additional
citations that may share relevant keywords. Don’t forget to use the “cite” feature
to save your citations in the desired citation format.

Google Books
Moving to Google Books, we can initially search the same way: geothermal
heat pumps statistics “united states”. One of the prominent results is the Census
Bureau’s Statistical Abstract of the United States. Rather than struggling to use
the Google Books version of the Statistical Abstract, we can go to Google Web
and find this publication on the Census Bureau Web site. It should be noted,
however, that the Census Bureau ceased publication of this in 2012. We will
need the library resources (see next) to get more recent data.

Added Value of Library Resources


As noted earlier, the Statistical Abstract of the United States ceased
publication in 2012. However, ProQuest took it upon themselves to continue this
publication to the best of their ability. Many libraries subscribe to this product.
Searching for geothermal within the ProQuest Statistical Abstract does reveal
some helpful results.
Another statistical database, Statista, has many results on geothermal heat
pumps, including tables on capacity of U.S. geothermal heat pump shipments,
consumption from heat pump use, employment in the industry, and worldwide
market share.
Market research reports can be very expensive, given the amount of money
spent to gather the research. Libraries can subscribe to some of these for
academic purposes. One of these tools, Research.com Academic, has a report
that contains U.S. Shipments of Geothermal Heat Pumps by Model Type, 1997
to 2023.
General article databases, like Academic Search, Business Search, ProQuest
Central, ABI Inform, and the Gale Business Collection, each contain valuable
information on this topic that is not available via Google Scholar. Resources
such as trade publications and newspaper articles, although not scholarly,
provide valuable information on trends in the industry.
Engineering databases like IEEE Xplore, ENGnetBASE, ACM Digital Library,
and the Engineering Village are among the many databases in the field of
engineering that provide specialized portals to the topic. These engineering
databases often have conference proceedings, book chapters, and dissertations
that are not included in Google Scholar or Google Books.
This case study presented the challenge of searching for statistical material.
Google Web searching uncovered statistics from several stakeholder entities,
both American and international. Google Scholar pointed us to relevant articles
beyond those initially discovered. Google Books pointed us to a standard
reference tool that, even though no longer published, is augmented by a library
subscription database. Engineering databases, through library licenses, provided
access to numerous conference proceedings, book chapters, and dissertations
beyond what is exposed through the three Googles.
This chapter illustrated both the power of the three Googles (Google Web for
primary sources, Google Scholar for secondary scholarly sources, and Google
Books for book content), as well as the necessary contribution of the academic
library. The success of Google Scholar and Google Books would not be possible
without the academic library in the background supplying the linkages to
scholarly articles and the print and electronic books necessary to carry on
research. The numerous resources beyond Google that libraries provide through
full-content textual, audiovisual, and creative content should prove the value of
academic libraries and the investments they make in these resources. Libraries
continually struggle with how to advertise these many resources. But the bottom
line is that reference librarians should always be consulted for any serious
research topic, because they have intimate knowledge of their institution’s
licensed resources.

10
Searching for Statistics

Many librarians are deathly afraid of reference questions involving statistics. Yet
statistics are one of the most common categories of questions that users ask at
reference desks. This is very likely because of the numerical nature of the
results, the fact that the answers are very often in hidden Internet databases far
beyond the direct reach of Google, and the indirect search strategies that may
need to be employed to uncover them. This chapter will show how Google can
assist in alleviating this fear and actually work toward unearthing the desired
statistics. Because statistics involves numerical data, and because search engines
are strong at searching and delivering textual data, we have a problem: How can
we find meaningful statistics? How can we frame a search that works?

STEP ONE: DETERMINE WHO CARES ABOUT THE


STATISTICS YOU ARE SEEKING
Before we start, however, a bit of background on how to approach statistical
questions. The first question to ask is, “Who cares about these statistics?” In
other words, what kind of entity would issue these and make them available?
Would it be a local government, a state government, a national government, or
an international agency? Would it be the purview of an association, a business, a
commercial publisher, a research institute, or a university? The kind of entity
will be a huge hint as to the Internet top-level domain (TLD) that they use (see
Table 10.1).
What is challenging about statistics is that we are searching for numerical data
in a textual universe. For this reason we must employ the indirect searching
techniques delineated in Chapter 2. To work through this topic, let’s suppose that
we are seeking statistics on fisheries, any kind of fisheries, anywhere in the
world.

TABLE 10.1.   Entities Likely to Issue Statistics about Fisheries.


Statistical Issuing Body     Possible TLDs
International Organization (like a     .org, .int
UN body)
Foreign Government     Consult a TLD list for countries. Search: tld in Google Web. The Wikipedia page is
recommended.
U.S. Government     .gov, .mil
State Governments     .us (see Chapter 4 for table of exceptions)
Local Governments     .us (see discussion in Chapter 4 for exceptions)
Associations     .org; or foreign organizations like or.jp
Businesses     .com for United States, varies for foreign countries. Example: co.jp for Japan;
com.mx for Mexico

STEP TWO: SEARCH WITHIN THE INTERNET DOMAIN OF


THE ENTITIES LIKELY TO ISSUE STATISTICS
I want to suggest two ways of determining organizations likely to issue
statistics. The first way is pure Google. We might start with international
organizations and search these two ways: fisheries statistics site:int; and then
fisheries statistics site:org. From this we can see that the Northwest Atlantic
Fisheries Organization (nafo.int) has something, as well as the Food and
Agriculture Organization of the United Nations (fao.org). We can see if selected
foreign governments have statistics, say Norway and Japan. Looking up their
respective TLDs (.no and .jp), we can frame searches accordingly: fisheries
statistics site:no, and fisheries statistics site:jp, respectively. In the case of
Norway, no clear government subdomain is evident, but rather the subdomain
fisheries.no. We should know by now how to more deeply drill down into that
site for statistical information. In the case of Japan, we clearly see that the
government subdomain is go.jp, enabling us to now drill into those Web sites.
Thinking that the U.S. government should have some statistics, we can further
search Google Web like this: fisheries statistics site:gov. We notice that
stakeholder agencies include the National Oceanic and Atmospheric
Administration (noaa.gov), the U.S. Geological Survey (usgs.gov), the U.S. Fish
and Wildlife Service (fws.gov), the U.S. Department of Agriculture (usda.gov),
and Fishwatch.gov, among the many government entities that have statistics.
Armed with these subdomains you can continue the drilling-down process to
acquire the statistics you need.
Next, let’s take on state statistics. If we take Alaska, we need to account for
statistics on both the alaska.gov domain and state.ak.us. We frame searches like
this: fisheries statistics site:alaska.gov and fisheries statistics site:state.ak.us. At
this point we may want to find official reports containing statistics by amending
our search like this: fisheries statistics site:state.ak.us filetype:pdf. You have
enough knowledge now to be able to tackle any state or local government.
Association data can be tricky. Many times association data is only available
for sale or to members. But it never hurts to try. Try starting by finding the
names of U.S. associations: site:org united states fisheries associations.
Prominent among the results is the American Fisheries Society, so refine your
search like this: site:fisheries.org fisheries statistics.
A second way of finding how to search for specific entities that issue statistics
is to get a little help from a commercial database product. ProQuest’s Statistical
Insight is a database commonly found in academic libraries. Searching the term
fisheries in that database reveals entities in all the categories we have already
discussed: international organizations, foreign governments, U.S. government
agencies, state agencies, and commercial interests. Even though Statistical
Insight has some statistical tables embedded within it, these are often not the
most recent data, and more recent issuances can be found by searching for the
agency’s domain with Google and then doing the same site-specific searching
we have been illustrating earlier.

KEYWORD SEARCHING AND STATISTICS


In Chapter 2 we discussed two general ways to think about framing a search: a
top-down search, where searches are based on general topics, likely titles of
documents and the general “aboutness” of what you are trying to find, and the
bottom-up approach, where searches are based on expected content within the
desired document. For obvious reasons, the second of these strategies is not
available to us when searching for statistics. It would be ridiculous for us to
search Google like this: site:state.co.us 1,345,567 34,678. You likely don’t know
the numerical values in advance. So that means that when searching Google for
statistics, we need to search with keywords that describe the statistical data set
we are trying to acquire. We can search for what keywords we would expect to
find in titles of tables.
There are many possible synonyms related to data and statistics that we should
try when searching. Table 10.2 suggests various keywords depending the
statistical realm you are searching.
Be sure to combine these keywords with the site-specific domain you are
searching. You may end up getting a direct Google hit on a PDF that contains the
exact data you are looking for, or you may discover a publicly available database
that can be searched.

TABLE 10.2.   Brainstorming Keywords for Statistical Questions.

Reference Frame     Keywords


General     Statistics, statistical, numbers, database, data, data set, tables, compendium/compendia,
repositories, archive, percent/percentage, visualization, sampling, time series analysis, fast
facts, fun facts, comparative, compare
Financial     Spending, budget, balance sheets, income statement, ratios, filing, accounts
Demography/Census     Census, demography, population, counts, estimates, projections, survey
Public Opinion     Sampling, survey, public opinion, polling

STATISTICS CASE STUDY 1: IMMIGRANT STATISTICS FOR


THE UNITED STATES
Question: I am looking for statistics on how many people in the United States
are immigrants. I’m looking for any and all kinds of breakdowns: legal vs.
illegal immigrants, countries of origin, etc.
Let’s walk this through the logical steps presented in this chapter. First, we
need to determine who (what government entities in this case) care about this
issue and who is likely to keep and publish statistics. We initiate a Google Web
search: site:gov immigration statistics.
We note the following Internet domains that seem, at first glance, to have
relevant information. We just want to make a list at this point, not get
sidetracked with answers.
dhs.gov—U.S. Department of Homeland Security
uscis.gov—U.S. Citizenship and Immigration Services (under Homeland Security)
ice.gov—U.S. Immigration and Customs Enforcement (under Homeland Security)
state.gov—U.S. State Department
justice.gov—U.S. Justice Department
bjs.gov—U.S. Bureau of Justice Statistics
gao.gov—U.S. Government Accountability Office

Now we do a different keyword search: site:gov immigrants census.


census.gov—U.S. Census Bureau
You can see we really have our work cut out for us.
Now, we take each of these domains in turn and see what kinds of data each
may contain (see Table 10.3).

TABLE 10.3.   Summary of Domain Searches.


site:dhs.gov     PDFs and raw data files on refugees and asylees, naturalizations, and nonimmigrant admissions.
Yearbook of Immigration Statistics
site:uscis.gov     Provides forms for filing; points to dhs.gov for data; has data for each form filing by immigration office
in the United States
site:ice.gov     Immigration removal reports
site:state.gov     Immigrant visa statistics
site:justice.gov     Asylum statistics; Statistics Yearbook including Immigration Court statistics

site:bjs.gov     Data on noncitizens in state or federal prisons or jails and on persons held by Immigration and Customs
Enforcement (ICE)
site:gao.gov     PDF reports on “criminal alien” statistics and other aspects of immigration over time
site:census.gov     Publications on “foreign born” and the American FactFinder database with statistics from the Decennial
Census and the American Community Survey

By systematically uncovering the stakeholder agencies and then carefully


reviewing what each agency has to offer in terms of data and statistics, we stand
a better chance of providing a more complete answer.
Now that we have some preliminary statistics, we can take the time to delve
more deeply into the Census Bureau’s American FactFinder tool. You can do
that yourself by visiting http://factfinder.census.gov/.

STATISTICS CASE STUDY 2: SUBNATIONAL DATA FOR


UGANDA
Question: Researcher is looking for subnational (regional, within the country)
data for Uganda, specifically gross domestic product (GDP). We first look up the
TLD for Uganda and discover it is .ug. We then search: site:ug government and
discover that go.ug is the governmental subdomain for the country. Government
agencies such as the Ministry of Foreign Affairs (mofa.go.us), the Ugandan
Statehouse (statehouse.go.us), the Ministry of Finance (finance.go.us), and the
Uganda Revenue Authority (ura.go.ug) each have materials relating to the
budget. Various initial searches can be done, such as site:go.ug statistics,
site:go.ug database gdp, site:go.ug district gdp, and site:go.ug subnational gdp.
As it turns out, just doing the simple search uganda statistics showed that
Uganda uses a simple .org domain to host its official statistical site, not a .ug
domain. The Web site http://www.ubos.org/ contained the country’s statistical
abstract, which has many subnational breakdowns, but not for GDP. Although
we gave this a good try, sometimes we just have to ask someone. The Web site
has an official e-mail address, and we then ask if this information is available.
This question illustrates the fact that many times Google comes up short on
the ability to discover all results. When Google can’t uncover the desired
statistics, and when reference librarians run out of answers, sometimes an old-
fashioned communication with a government official turns out to be the best
answer.
Statistical research is among the most challenging for researchers and
reference librarians alike. But if we can first figure out “who cares” about
issuing the desired statistics and then systematically search Google for the
relevant primary sources and then search Google Scholar and Google Books for
secondary sources that may cite primary sources, we may be well on our way to
solving these difficult academic challenges.

Conclusion
Some readers may have the impression that this book, written by a reference
librarian, is implying that Google has everything and that we really don’t need
academic libraries any longer. Nothing can be further from the truth.
I can’t remember the last time I ever heard parents say that they based
selection of where their children attended college on the strength of library
holdings or databases subscribed to. Although this should be among the top
criteria, it rarely is. It’s true that libraries are becoming more “vanilla” in terms
of databases subscribed to. But the specialized resources are more and more
distinguishing the top schools from the mediocre ones.
Although Google Web can get to some current primary source materials, there
is so much that it cannot expose. As for Google Scholar, even though it exposes
a wide swath of scholarly journal articles to discover, library database products
expose the same content through different interfaces. I generally tell researchers
that it’s beneficial to examine the same material through the power of two or
more search engines, because they will all search, retrieve, and rank the same
results in a different manner. Also, Google Scholar, although deep in terms of
searching, is very narrow in breadth: it doesn’t cover trade journals, popular
magazines, newspapers, dissertations and theses, technical reports, and market
research reports, among other things. Each of these has its own place in the
research process. Google Books, as powerful as it is, has gaps. We may not even
know what all the gaps are, but there definitely are gaps.
I also have a running “wish list” of what I would like Google to change or
implement to make research easier. Here are some of the items on my list:

Pass-through between the three Googles. Being able to pass a search from Google Web to Google
Books or Google Scholar, and from Google Scholar to Google Books, etc.
Citation links under Google Books metadata from the initial browse screen, following the pattern set
forth in Google Scholar.
Better grouping of serial items in Google Books, following the way HathiTrust groups serial holdings
under a single serial title.
Ability to restrict searching in Google Scholar to abstracts only, as you can do when a date sort is done.
This function, available in a limited fashion, needs to be expanded to the entire interface.

Google is not the only place to go for academic information. It may not even
be the first place to search, but it cannot be ignored.

Index
Aggregator content, 3, 6, 44, 48, 53, 59, 91–93
aggregator databases, 93
ANSI/NISO Z39.88 standard, 53
Archival materials, 81, 85, 89–90

Big Ten Academic Alliance, 67–68


Boolean operators, 4–5, 8–9, 19, 72
Bottom-up searching, 16–18, 125
Broadcast search model, 83–85

Card catalogs, 1–2, 7–8, 71


Commercial content, 39–40, 123, 125
Controlled vocabularies, 2, 4–9, 64, 92, 94
in law, 5–6
in medicine, 5–6
thesauri, 4

Digital rights management (DRM), 3, 93


Doctoral dissertations and master’s theses, 50, 59, 82, 85, 90, 92, 94, 116, 120, 129

E-books, 3, 48, 70, 72, 74, 75, 79, 82, 93–94


current e-books, 93–94
historic e-books, 93
E-journals, 3, 44, 48–54, 58–59, 64, 81–82, 85, 87, 92–93, 113, 116, 118, 129
EBSCO databases, 6, 8, 48, 54, 62, 90, 113
Academic Search, 8, 93, 116
Business Source, 93
eBook Collection, 93–94
EBSCO Discovery Service (EDS), 85
Nation Archive, 90
End of Term Archive, 28
Evaluation, authorship, 26–27
currency, 25–26
Google Scholar content and, 50–51
URL diagnostics, 24–25
Web content, 23–24

Federated search tools, 83–86


File type searching, 4, 8–9, 21–22, 30, 34–35, 88, 89, 112, 119
other file types (besides PDF), 40
Foreign government content, 31–34, 124–125
Full text, 3–6, 8–9, 23, 29, 43–47, 57, 58
full text searching, 8–9, 16–17, 43–47, 64
Google Books and, 72–79
Google Scholar and, 43–47, 51, 58
Google Patents and, 57
Google Web and, 23, 29

Gale databases, 6, 8, 48, 90, 93, 113, 118, 120


Academic OneFile, 93
Expanded Academic ASAP, 116
Google Books, 67–80
case studies, 112–113, 116, 117–118, 119–120
citations, 80
completeness, 74
content sources, 68
depth of searching, 87
“find in a library” feature, 72–73, 80
fingers and hands, 79–80
“fulfillment” (get the full book), 72, 88–89
full view, 70, 72
government documents, 75–76, 78
HathiTrust and, 77–78
Internet Archive and, 79
Library Project, 67–69
limited preview, 70–71
magazine content, 76–77
Ngram viewer, 101–103
no preview, 72
OCLC and, 73
partner libraries, 67–68
Partner Program, 69
scan quality, 67, 79
searching, 72
serial content, 68–69
snippet view, 71
views, 69–72
Google Patents, 57, 108–110
Google Scholar, 43–66
alerts, 58
article versions, 59
bibliographic citations, 60–62
bibliographic management software, 60–62
browser settings, 51; case studies, 111–112, 115–116, 117, 119
“cite” feature, 60–62
“cited by” feature, 58–59
clustering of results, 59
date limiting, 55–56
depth of searching, 45–47, 87
evaluating content, 50–51
EZproxy and, 52–53, 65
free content, 54
frequency of updating, 48–49, 81
Google Books “bleed through”, 49–50, 80
HeinOnline and, 64–65
integration of library holdings, 44–45
integration of publisher metadata, 43–44
journal linking, 51–54
JSTOR and, 64–65
legal cases, 54–55
legal cases, searching for, 107–108
“library search” feature, 64
metadata searching, 56
metrics, 66; patents, 57
“related articles” feature, 59
“save” feature, 62–63
scope of coverage, 48–50
secondary scholarly sources, 88–89
sorting results, 56
Web of Science and, 59
Google Translate, 105–107, 117
Google Web, 29–42
1,000 result limit, 20–21
advanced search, 20
cached content, 27
case studies, 112, 114–115, 117, 119, 126–127
depth of searching, 87
disappeared content, 15
file type searching, 21–22
firewalls, 15
in China, 13; indexing, 12
keyword formulation, 23
limitations on crawling, 13–15
numerical data, 15, 17
password restrictions, 15
phrase searching, 19–20, 22
plus sign in searching, 22
politeness (in Google crawling), 13
primary sources, 88–89
ranking of results, 29
relevance ranking, 9, 12
searching, 9
searching from another country, 97–101
site-specific searching, 20–22, 30–41
use in academic assignments, 81–82

HeinOnline database, 52, 64–65, 90, 107, 108


Hidden Internet, 12–13, 17–18, 88, 117, 123

Images, image searching, 103–105


image use rights, 104–105
Indexing, 1, 29
“back-generated”, 6, 8, 93
Indirect searching, 17–18, 88, 123
Information access anomaly, 85–88
Information sources, “flattening” of, 95
various types, 94–95
International organizations, 31–32, 88, 112, 114, 115, 116, 119, 123, 124, 125

JSTOR database, 48, 52, 64, 92


Journal publishers, 43–44, 92

Keyword searching, 2, 3, 9, 16, 19, 21, 23, 43, 58, 59, 64, 65–66, 70, 77, 85, 86, 95, 103, 108, 119, 125, 126

Legal cases, 6, 107–108


Libraries, 1, 2, 6, 11, 14, 15, 17, 43–45, 47, 48, 51, 52, 54, 56, 61, 64, 67, 68, 73, 75, 76, 81, 82, 95, 107,
111, 113, 120–121, 125, 129
strengths of, 89–94
weaknesses of, 82–88
Library catalogs, 2, 6, 7, 11, 17, 23, 25, 67, 71, 74, 85, 87, 113, 118
and Google Books, 74–75
lack of power, 82–83
Library discovery tools (“web-scale” discovery), 83–86, 92
Library of Congress Subject Headings, 1, 2, 6–8
Library resources, added value, 113–114, 116–117, 118, 120–121
Library vendor databases, 3–9, 12, 44, 48, 51, 52, 56, 62, 73, 83, 85, 89, 90, 92, 93, 94
relevance ranking, 12
Library visitors, 47–48
Link resolvers, 53–54, 59
Local government content, 26, 36, 39, 88, 123–125

Machine readable cataloging (MARC) records, 73


Magazine content, 1–2, 76–77, 82, 85, 90, 116, 118, 129
Market research reports, 48, 91, 120, 129
Metadata, 4–5, 12, 23, 26, 43–45, 53–60, 62, 71, 77, 80, 82, 84–89, 117, 130
Microform formats, 44, 79, 90–91, 93

Newsbank databases, 90, 118


Newspaper content, 15, 59, 82, 85, 90, 91, 116, 118, 120, 129
Ngram viewer, 101–103
Chrome add-in, 102
No country redirection (NCR) searching, 97–101
Nonprofit content, 26, 31, 40–41

OpenURLs, 53–54, 64, 93


Optical character recognition (OCR), 46–47, 93, 108, 110

Patents, 57, 108–110


Poole, William Frederick, 1
Primary source materials, 15, 29–41, 49, 81, 88, 89–90, 112–120, 128–129
ProQuest databases, 6, 8, 48, 61, 62, 85, 90, 93, 113, 120
Congressional, 113
Dissertations and Theses, 116
Ebook Central, 93
Historical Newspapers, 118
Legislative Histories, 15
ProQuest Central, 93, 116
Statistical Abstract of the United States, 120
Statistical Insight, 125
Proximity operators, 4–5, 9, 92
Publisher portals, 43, 92–93, 120

Robots, robot exclusion protocols, 13–14

Search engines, 5, 11–13, 17, 19–20, 29, 43, 67, 95, 100, 123, 129
crawling, 11–1;
indexing, 11–12
search engine bias, 29–30
search engine optimization, 17, 30
Searching, 8–9
“all and only” (retrieval), 95
“left-anchored” searching, 2
bibliographic references, 58–59
bottom-up search strategy, 16–17
commercial content, 39–40
depth of searching, 45–47
file type searching, 4, 8–9
foreign government content, 32–34
history of, 1–3
images, searching by, 104
images, searching for, 103–105
indirect, 17–18
international organization content, 31–32
local government content, 39
natural language, 20
nonprofit content, 40–41
phrase searching, 19–20, 72
search strategies, 9
searching from another country, 97–101
state government content, 36–39
statistical data, 17, 123–128
top-down search strategy, 16
U.S. government content, 34–36
Site-specific searching, 20–22, 30–41, 89, 97, 112, 124–125
State government content, 36–39, 88, 123–124
Statistics, 17–18, 31, 35, 119–120
keyword searching, 125–126
searching for, 17, 123–128
site-specific searching, 124–128
stakeholders, 123–124, 127
Streaming video and audio, 91
Subject descriptors, 4, 7–9, 86
postcoordination, 7
Subject headings, 1–2, 5–6, 7–8, 16, 23, 86–87
precoordination, 7
rule of three, 7
semantic notions, 7
Subscription databases, 92–94

Three Googles, 88–90, 120


suggested improvements, 129–130
Top-down searching, 16, 18, 125
Top-level domains (TLD), 20–21, 25, 30, 32, 35–36, 88, 114, 123

U.S. government content, 21, 34–36, 75–78, 88, 112, 124–126


URL structure, 25, 39

Wayback Machine, 24, 27–28


Wilson, H. W., 1, 8

About the Author


CHRISTOPHER C. BROWN is Reference Technology Information Librarian
at the University of Denver Main Library. He has been providing reference
services in academic settings since the mid-1980s. Brown is also an active
government information librarian, having served for three years on the
Depository Library Council, advising the Director of the U.S. Government
Publishing Office. His research has taken him to Japan many times, where he
complied a bibliography of the works of the United Nations Centre for Regional
Development. As an active reference librarian, he enjoys daily consultations
with undergraduate and graduate students in areas of public policy, international
studies, statistics, and government information. Brown has been teaching as an
affiliate faculty member in the University of Denver Library and Information
Science program since 1999.

S-ar putea să vă placă și