Documente Academic
Documente Profesional
Documente Cultură
THE POWER OF
GOOGLE
What Every Researcher Should Know
Christopher C. Brown
Contents
Illustrations
Introduction
Chapter 1: Searching Generally
History of Searching
Tension between Controlled Vocabulary and Full Textuality
Subject Headings vs. Subject Descriptors
Full-Text Searching: A Different Way of Thinking
Reference
Chapter 2: How Google Works
Crawling and Indexing
Ranking of Results
Beyond Google’s Reach
General Google Search Tips
Indirect Searching—Hidden Internet Content
Reference
Chapter 3: Searching Google Web
Basic Search Techniques
Power Search Techniques
Keyword Formulation
Evaluating Web Content
Cached Content
Reference
Chapter 4: Power Searching for Primary Sources Using Google Web
Site-Specific Google Searching
Site Searching on the International Level
Site Searching on the Foreign Government Level
Site Searching on the U.S. National Level
Site Searching on the State Government Level
Site Searching on the Local Level (Counties and Communities)
Site Searching for Commercial Content
Site Searching for Nonprofit Content
Reference
Chapter 5: Google Scholar and Scholarly Content
Depth of Searching in Google Scholar
What Content Does Google Scholar Retrieve?
Books—From Google Books
Evaluating Google Scholar Content
Right Margin Links
Left Sidebar
Citation-Specific Links
Effective Searching in Scholar
Google Scholar Metrics
References
Chapter 6: Google Books
Library Project
Google Books Partner Program
Google Book Views
Searching and Navigating Google Books
The “Fulfillment” Part
How Complete Is Google Books?
Government Documents in Google Books
Magazines in Google Books
Google Books and HathiTrust
Google Books and the Internet Archive
Fingers and Hands in Google Books
Citing Google Books
References
Chapter 7: Google as a Complement to Library Tools
What’s Wrong with Academic Libraries
The Three Googles
What’s Right with Academic Libraries
The “Flattening” of Information Sources
The “Holy Grail” of Search Results in the Google Age
Reference
Chapter 8: Academic Research Hacks
No Country Redirection (NCR) searching
Google Books Ngram Viewer
Image Searching
Google Translate
Legal Cases
Searching Patents
References
Chapter 9: Case Studies in Academic Research
Case Study 1: Resources on Human Trafficking
Case Study 2: Funding for Religious Schools in Tanzania and South Africa
Case Study 3: Paris Tango House Fashion from the 1920s
Case Study 4: Data on Geothermal Heat Pumps
Chapter 10: Searching for Statistics
Step One: Determine Who Cares about the Statistics You Are Seeking
Step Two: Search within the Internet Domain of the Entities Likely to Issue
Statistics
Keyword Searching and Statistics
Statistics Case Study 1: Immigrant Statistics for the United States
Statistics Case Study 2: Subnational Data for Uganda
Conclusion
Index
Illustrations
FIGURES
Figure 1.1. Key Events in Recent Searching History
Figure 1.2. Contrast in Thinking When Searching by Controlled Vocabulary vs.
Searching Full Text
Figure 2.1. Robots Exclusion Example
Figure 2.2. Example of Numerical Data Not Exposed to Google Searching
Figure 2.3. Indirect Searching to Uncover Hidden Internet Content
Figure 3.1. Basic Structure of a URL
Figure 5.1. Google Scholar Content Ingest Model
Figure 5.2. Examples of Metadata in Google Scholar Records
Figure 5.3. How to Test for Google Scholar Search Depth
Figure 5.4. Testing Google Scholar Search Depth
Figure 5.5. Testing Full-Text Searching Depth in Google Scholar
Figure 5.6. Reason for Failure Revealed
Figure 5.7. Setting Google Scholar for a Specific Library
Figure 5.8. Google Scholar Features
Figure 5.9. Date Sort and Searching Only within Abstracts of Articles
Figure 5.10. Scholar’s Option to Search Only in Citations
Figure 5.11. Google Scholar Features underneath Citations
Figure 5.12. Clustering of Google Scholar Records
Figure 5.13. Citations in Google Scholar
Figure 5.14. Google Scholar Citation Failure for Book Chapters
Figure 5.15. “My Library” Feature in Google Scholar
Figure 5.16. Keyword Search Brainstorming Strategy
Figure 6.1. Google Books Content Sources
Figure 6.2. Limited Preview in Google Books
Figure 6.3. Search Term Hits with Google Books
Figure 6.4. U.S. Government Publication with Snippet View in Google Books
Figure 6.5. HathiTrust Augments Google Books for Government Publications
Full-Text Access
Figure 6.6. Contrast between Google Books and HathiTrust for Out-of-
Copyright Publication
Figure 7.1. Broadcast Searching Model of Late 1990s to Early 2000s
Figure 7.2. Discovery Tools as True Federated Search Tools
Figure 7.3. The Three Googles Model
Figure 8.1. Google and Google Scholar as Seen in Japan
Figure 8.2. Google Scholar Search Results from Japan
Figure 8.3. Accessing English-Language Google Scholar from Japan with NCR
Fix
Figure 8.4. Accessing Google Web from Hungary with NCR Fix
Figure 8.5. Using the Chrome NCR Redirect Add-in
Figure 8.6. Using the NCR Redirect Add-in to Search a Different Local Version
Figure 8.7. Viewing UK News from the United States with Chrome NCR Add-in
Figure 8.8. Google Books Ngram Viewer Showing Popularity of Various Foods
Figure 8.9. Go Example of Google Images Used to Identify a Place Photo
Figure 9.1. Google Books Highlights Page Hits
TABLES
Table 1.1 Contrast between E-books and E-journals
Table 2.1 Examples of Robot Exclusion from Popular Web Sites
Table 3.1 Commonly Used Power-Searching Strategies
Table 3.2 Evaluation Criteria
Table 4.1 TLDs from Selected Countries
Table 4.2 Variations in National Government Secondary Domains
Table 4.3 Examples of TLDs Exploited for Commercial Purposes
Table 4.4 Domains for U.S. States and the District of Columbia
Table 5.1 Evaluation Criteria for Google Scholar
Table 7.1 The Information Access Anomaly: Assumption of 400 Words per Page
(WritersServices 2001)
Table 7.2 Distinction among Resource Types
Table 8.1 Interesting U.S. Patents
Table 8.2 World Patent Office Content in Google Patents (as of December 16,
2016)
Table 9.1 Finding Web Domains from Entities that Care about the Topic
Table 10.1 Entities Likely to Issue Statistics about Fisheries
Table 10.2 Brainstorming Keywords for Statistical Questions
Table 10.3 Summary of Domain Searches
Introduction
Why is a reference librarian writing about using Google for academic research?
Don’t professors tell students not to use Google in their research? Isn’t Google a
threat to librarianship, and won’t it eventually replace the need for librarians?
This book will suggest that Google is extremely valuable in the academic
research process, but users need to understand what is being searched, how to
constrain searches to academically relevant resources, how to evaluate what is
found on the Web, and how to cite what is found.
It’s not uncommon for new university students to think they already know
how to search. Their first paper comes due, and what do they do? They resort to
using Google. They think they know it all—or at least where everything can be
found. But when they get that first paper back with an unsatisfactory grade and
comments like, “you need to cite peer-reviewed articles,” “don’t rely on
Google,” and “you need reliable sources to support your arguments,” many of
them show up for research consultations and to meet with reference (research)
librarians. It takes a librarian to really show them how to search.
I’ll let you in on a little secret: all reference librarians, academic or otherwise,
use search engines, especially Google. The extent to which it is used varies, of
course, but Google can be the single best starting point for navigation down the
right path. When someone has a question, they don’t know the answer. This
seems obvious, but it is profoundly important. If someone doesn’t know
something, they may not even know how to visualize what the answer looks like
or what path to pursue. Trying to navigate in a fog is nearly impossible. But
Google is there to correct our misspellings, suggest new pathways, and clear the
fog.
This book is not intended to cover every feature of Google. We intentionally
gloss over Google Earth, most of the Google widgets, Google personalization
features, linkages to Google+ and other Google properties, and even some of the
search capabilities that have little to do with academic research. This is
intentional. Many books already do that. All you have to do is search Google
Web like this: how to search google, or Google Books like this: how to search
google. This book is focused on assisting students, researchers, teachers,
professors, and librarians in finding primary and secondary sources using
Google.
1
Searching Generally
HISTORY OF SEARCHING
The blossoming of magazine and journal publishing soon necessitated a way
to discover all this content. Thus modern indexing was born. A glance at a
technology timeline will help give us a historical perspective (see Figure 1.1).
The mid-1800s saw the beginning of periodical indexing with pioneers like
William Frederick Poole and H. W. Wilson. Poole’s Index to Periodical
Literature, published in the mid-1800s through various editions and
supplements, economized space with abbreviations and small print and was
tedious and challenging to use, but it worked. Wilson, whose work endures to
this day, also saved space with abbreviations, but incorporated a technology that
was being developed in his day: subject headings along the lines of Library of
Congress subject headings.
Some of our older readers will remember libraries with card catalogs—those
handsome wooden cabinets with tiny drawers to accommodate cards with
information about the books or other materials owned by the library. The most
common scheme for library card searching was the dictionary catalog approach:
cards arranged alphabetically for authors, cards for subjects, and cards for titles.
But it gets more complex that just these three simple categories. When there are
multiple authors or editors, another card set needs to be created. Every subject
assigned generates more card sets. Title cards would account not only for the
main title, but also for additional titles such as series titles, varying forms of the
title, translated title, etc.
There was no keyword searching in the physical library catalog days. Access,
at least for English language materials, was “left-anchored.” That is, searchers
had to start from the left to look up an author, subject, or title. If an author’s
name was Christopher C. Brown, the name was inverted “phonebook” style to
Brown, Christopher C. Subjects were governed with controlled vocabularies
such as the Library of Congress subject headings. Titles had special
considerations as well. Users had to omit initial articles (for English, omitting a,
an, or the from the beginning of titles). This greatly inhibited the access to
materials, but because that was the state of the art at that time, users didn’t know
what the future held.
The late 1970s and 1980s saw the development of online catalogs. Because
libraries were worried that the public would not accept the new technology,
online catalog records were made to look like printed catalog cards. But there
was one major advancement of technology that was transformative: the ability to
search the online catalog record by keyword. In other words, users wouldn’t
need to think about inverting an author’s name. If all that was known of a title
was several words within the title and perhaps an author’s first name, the book
could likely be found. Users unaware of the proper formation of subject
headings could nevertheless locate materials simply by searching using
keywords. This technology was a huge forward leap and should be fully
appreciated.
The quest for magazines or scholarly journal content experienced a
transformation similar to books in library contexts. When indexes no longer had
to be printed out, economy of space was less important. Abstracts of articles
could be incorporated into the entry for each article. Keyword searching likewise
transformed access to scholarly journal literature.
But the technology didn’t stop there. Full texts of e-books and articles began
to appear in the late 1990s and early 2000s. Early e-books were not fun to use.
Digital rights management (DRM) systems made access onerous. In order to sell
their new e-book model to print publishers, e-book vendors tried to replicate the
print user experience with their digital books, making the argument that the
online model completely replicated a print user model: one user per time, per e-
book. But users didn’t understand it that way: they wanted unlimited access to
online content, not “one simultaneous user.” Other vendors applied DRM with
helper applications such as Adobe Digital Editions. Downloading of auxiliary
software places additional barriers before the user, as evidenced by the many
assistance calls placed to reference desks. These and more e-book barriers persist
to this day.
Where e-books failed, e-journals succeeded (see Table 1.1). Books often had
DRM protections, but journals didn’t need this, because it was the part
(individual articles) and not the whole (entire books) that were being exposed.
The only barriers users had to e-journal content was whether their library
subscribed or not and was the initial authentication process. Once users were
authenticated, they could easily save entire e-journal articles, print them out, and
read them either online or as a printout.
Google entered the world of full text of both e-books and e-journals with
Google Books and Google Scholar, respectively. But more about those models
later. Suffice it to say that these two additional Google initiatives transformed the
way students think about content. From the initial search to finally accessing the
full text, the discovery and fulfillment process was forever changed.
The ProQuest and Gale products, because they don’t have to deal with an
environment with subject headings thrown in with subject descriptors, have
back-generated thesauri that show only subject descriptors with only a single
semantic notion in each one.
FIGURE 1.2. Contrast in Thinking When Searching by Controlled Vocabulary vs. Searching Full
Text.
REFERENCE
National Information Standards Organization. 2010. “ANSI/NISO Z39.19 - Guidelines for the Construction,
Format, and Management of Monolingual Controlled Vocabularies.” Available at:
http://www.niso.org/standards/resources/Z39-19.html. Accessed December 10, 2016.
2
How Google Works
Search engines are one of the marvels of our time. They provide responses from
billions of pages of Web content in a fraction of a second. But how do they
actually work?
RANKING OF RESULTS
Google’s indexing of the Web alone is not sufficient. With billions of Web
page content, some sense must be made to the ranking of results in response to a
user query. This relevance ranking is proprietary to each search engine, and is, in
fact, the feature that distinguishes the good search engines from the great search
engines. Google is constantly tweaking its relevance ranking algorithms, but
over the years Google has proven itself to have superior relevance ranking.
Library vendors, either online catalog vendors or online database vendors,
have offered their versions of relevance ranking. In general these companies are
forthcoming with how they rank the search results within their result sets. But
Google does not tell us how they do it. For one thing, this is how they position
themselves within the competitive search engine market, and for another thing,
they are continually modifying the way their relevance ranking works. So don’t
expect Google to tell us why certain results appear before others within search
results.
BEYOND GOOGLE’S REACH
Not all Web information is findable with Google. Often called the “hidden
Internet,” the “deep Web,” or the “invisible Web,” much content, perhaps much
more than Google actually is able to crawl, is completely invisible to Google.
Some have estimated that this hidden content is 500 times what is findable in
Google (Bergman 2001). There are many reasons for this, and we need to
discuss each of them. Failure to understand why Google does not find everything
will only perpetuate the myth that everything and anything can be found with
Google.
Google Is Polite
Google is not always wanted in all parts of the world. Perhaps you have seen
news stories about China blocking all access to Google, favoring its own
powerful search tool, Baidu. Lawsuits, both domestic and abroad, occasionally
mandate that Google take down certain Web content because it is offensive,
illegal, or disputed as part of legal actions. These kinds of actions have a big
impact on Chinese students who return home after studying in other countries, or
on researchers visiting China, but otherwise will have little impact on
researchers in the United States.
Google will not crawl where it is not wanted. Google pays attention to robot
exclusion protocols. A robot is another term for a Web crawler. One of the oldest
of these protocols, still in existence today, is the robots.txt files posted on the
root domains of many major Web sites. This file tells search engines where they
should not crawl. Try this: go to your favorite Web site, and after the forward
slash from the root directory, add robots.txt. For my university, the University of
Denver, the root domain is: www.du.edu. Adding robots.txt to the root Web
address gives us this URL: http://www.du.edu/robots.txt. Here you will see
portions of the university Web site that Google need not bother to crawl, either
because it is a waste of resources or because it doesn’t add anything to the
discoverability of information the university wants discovered. Here is what
shows up on that page (Figure 2.1).
Not all Web sites employ this old technology. Many have newer tools to
exclude Google and other search engines. But as a fun experiment, try to see
how many robot exclusion files you can discover. Here are just a few examples
(Table 2.1).
The point of this is that Google stays out when it is not wanted, accounting for
at least a small reason why not all information is available to Google.
Technology Exclusion
Google is not able to go where the technology does not permit. Many
databases are closed to search engines because of the way they work. Other
databases contain information that, even if they could be crawled, would be
meaningless. For example, the U.S. Naval Observatory has a database
(http://aa.usno.navy.mil/data/docs/RS_OneYear.php) that gives sunrise/sunset
data for each day of the year for every place in every state. An example can be
seen in Figure 2.2.
Even if Google was able to index this database, what’s the point? It’s all
numerical data. Google does have its own way of serving up sunrise and sunset
data, through nicely integrated widgets, but not via Google searches directly.
Password and Firewall Exclusion
Google doesn’t have access rights to content that is proprietary. We can easily
illustrate this with library database content you may already be familiar with.
Academic libraries subscribe to many wonderful resources such as Access World
News (local newspaper content from around the world), Alternative Press Index
(indexing of alternative press articles), Archives Unbound (digitized archival
collections of primary sources from various archives), Berg Fashion Library
(texts and images of fashion history), Legislative Histories (a ProQuest database
with exhaustive legislative histories going back to the first Congress),
MarketResearch.com Academic (expensive research reports for use in business
research), United States Congressional Serial Set (digitized copies of
congressional reports and documents from the early 1800s to present), and Web
of Science (cited references across all science, social science, and humanities
disciplines). Because of licensing restrictions that require password access,
Google is not allowed to crawl these resources. Academic libraries typically
subscribe to many hundreds of such databases. It’s true that some of the content
can be found in Google by other means. But generally these expensive interfaces
provide enhanced access that makes the subscription well worth the investment.
Disappeared Content
Sometimes content is removed from the Web for any number of reasons: it has
become obsolete and has been superseded, it is outdated, the funding for the Web
site has stopped, a legal action forced the content to be removed, there are
temporary server problems, and hundreds of other reasons. This is the reason
that most citation styles require researchers to include the access date in their
bibliographic citations. The claim is being made that, at least on the date stated,
the content was viewed. No claim can be made for any other date.
Suppose you were looking for export data between Japan and Argentina,
specifically the most recent available statistics for cars exported from Japan into
Argentina. You could perform a Google search for automobiles export Japan
Argentina statistics. Although this search has some promising results, it doesn’t
really give the precise data you need. The top-down strategy didn’t work, and a
bottom-up style strategy won’t work, since we cannot envision what the
statistical answers would be in this case. An indirect search strategy would
enable us to find hidden Internet content. We could frame a search like this:
foreign trade database by country by commodity. This search turns up the UN
Comtrade Database, which indeed contains the answers we need.
REFERENCE
Bergman, Michael K. 2001. “White Paper: The Deep Web: Surfacing Hidden Value.” Journal of Electronic
Publishing 7, no. 1.
3
Searching Google Web
When I started teaching Internet Reference in 1999 Google was not the search
engine I favored. In 1999 the World Wide Web was just six years old, and the
popular search engines were AltaVista and soon after that AllTheWeb (known as
Fast.com). Google existed at that time, but I paid little attention to it. But within
a few years I became a convert: Google had figured out the relevance ranking
magic and was continually developing new ideas through Google Labs. It was
evident that Google was developing an interest in searching and discovery
beyond what other search engines were interested in accomplishing. Labs was
retired in 2011, but while it existed, it demonstrated the excitement and
determination Google had in developing new ideas.
As time went by Google-like searching became so instantiated in culture that
library search engines had to amend their search capabilities to keep up with user
expectations. For example, the phrase contained within quotes, a search engine
staple, is now a standard search feature in many, if not most, library-subscribed
databases.
Phrase Searching
Phrase searching is accomplished by enclosing your search term within
quotes. When we say “quotes,” we mean what is sometimes referred to as
“double quotation marks.” As you read this book you see smart, or curly,
quotation marks used within the text. However, these must not be used in
Google. Copying and pasting items containing curly quotes from Microsoft
Word into a Google search box will sometimes give undesirable results.
Although placement of a phrase in quotes is very powerful as a constraining
mechanism, it should not be overused. Only enclose a phrase in quotes if it is
really a “frozen phrase” in the language in which you are searching. “United
States” is a frozen phrase, as we never say “the States that are United.” However,
“first amendment right” would not be a good idea to enclose in quotes, as the
same idea might be framed as “first amendment gives us the right” or “the rights
of the first amendment.”
Site-Specific Searching
Most researchers I speak with don’t realize that Google doesn’t even let you
see beyond the first 1,000 results, at best. Assuming that you retrieved 1 million
results with your search, and assuming that you had many years to sift through
results, Google prohibits you from looking at those results.
To test this, set your Google search results to 100 results per page, just to
make this task easier. Now perform any broad, general search in Google, go to
the bottom of the page, and you will see up to 10 pages to which you can
navigate. The first page should have results 1 to 100, the second page results 101
to 200, and so on. Usually the results will stop far short of 1,000—maybe around
600 to 800 results. What this means is that if the results affecting academic
research are not in these top 1,000 results, you will never see them. Throwing
more words at Google will certainly bring different results to the top, but it may
also keep out results that would have helped you. There must be a better way—
and there is!
Site-specific searching means that you access Google’s indexing of only a
specific site. For example, it is often difficult to locate documents at my
university, the University of Denver. But if I do a site-specific Google search
using the syntax: site:du.edu, followed by my keywords, the result set is only
results from the du.edu Internet domain. This syntax has a couple of rules that
must be followed religiously: “site” must not be capitalized; and there must be
no space after the colon. Technically it is okay to have a “dot” after the colon.
For example, we can search site:.gov to find U.S. government information, but I
never teach this as a best practice. When I teach in front of groups, I fear that the
dot may be confused with a space, so I always omit it.
Other Considerations
There are those occasions when the advanced search interface is essential. For
example, if searching for local information, such as news local to a country, the
“region” limit on the advanced search page works really well and cannot easily
be performed apart from the advanced search page.
Google occasionally changes its search operators. For example, many readers
may be accustomed to using the plus sign (+) in searches. Formerly the plus sign
could be used to force a word or phrase to be present in the results, rather than
treating it as an optional OR. But Google intentionally removed the plus sign as
an operator so that it could be used with Google Plus. Unlike library databases,
which stick with the same search operators year after year, we are at the mercy
of the whims of Google.
A variation on phrase searching is something Google calls verbatim searching.
The syntax is verbatim:your search terms. Google claims that this will not
search for plurals or alternative spellings. Another way to do this is to go directly
to https://www.google.com/webhp?tbs=li:1. This defaults your search to a
verbatim search in Google. There is a significant difference between verbatim
searching and quotation (phrase) searching. Searching with quotes around terms
finds the exact strings. But doing verbatim searching finds all the search terms,
with the exact spelling, but not necessary just as a string. Thus verbatim
searching will produce more results.
If you wanted to find a common misspelling of United States in Google, you
could search verbatim:untied states, although for whatever reason I find more
predictable results by visiting the URL mentioned in the previous paragraph.
KEYWORD FORMULATION
When selecting keywords for Google searches, you can be more generous
than when selecting keywords for library database searches. When searching a
library database, the default search is often not to search the full text of content,
but rather to search the metadata only. For example, library catalogs today search
the text within a catalog record. This definitely includes the title, subject
headings, and possibly notes, summaries, and at times tables of contents of the
work, but not the full text of materials. Reference librarians often advise that
users search online catalogs with broad terms to capture the most relevant record
set. When searching Google, which reaches down into the full text of content,
we can be more generous and use more specific search terms. I call this “going
for the gold.” Search as if you know you can find exactly what you are looking
for. If and when that strategy fails, then think of synonyms, broader terms and
related terms to change the search.
When you search Google Web with many keywords, Google presents results
even if all keywords were not present on the Web pages. Google does you the
favor of saying which keywords were not found. I did a keyword search looking
for subnational data from Uganda like this, making reference to two subnational
regions: uganda database wakiso masindi economy gdp. None of the Web pages
Google retrieved contained all of these keywords. On the first result Google
noted this: Missing: wakiso masindi. This was extremely helpful, informing me
that the top result was about a database on the Ugandan economy, but it did not
contain the names of the Wakiso or Masindi regions of Uganda.
Be as specific as possible when searching Google. Because library databases
have a small set of data that you are searching against, you often get no results
when framing a search with many search terms. But Google’s index is many
times larger; thus, you can afford to throw more words at Google. If Google
doesn’t find all the terms, it starts to selectively discard terms, as noted earlier,
so as to still give you some results.
In the Web world, the same criteria apply. It’s just that the methods of
determining the answers may be different, and different tools are available to use
to make these determinations. Let’s look back at some of the additional
evaluation considerations for Internet content.
URL Diagnostics
There is an additional criterion available to researchers when considering
Internet research: indicators given by the URL of the Web site. A URL can
contain hints that divulge much about a Web site’s authenticity or credibility. As
an example, a tilde (~) in a URL is often an indicator of a personal Web site.
Many universities have tilde sites for faculty, students, or staff. In the early days
of the Web, Congress issued tilde sites to senators and members of the House of
Representatives. For example Senator Ted Kennedy had the site
www.senate.gov/~kennedy. Congress does not use this pattern today, but the
Internet Archive’s Wayback Machine has captured Kennedy’s old Web sites at
various points of time. By visiting the Wayback Machine at archive.org and
typing http://www.senate.gov/~kennedy/ into the Wayback Machine search box,
we can see old versions of Senator Kennedy’s Web sites from as far back as
1997. The point is that we need to pay attention to such markers within URLs, as
they can give clues as to the authority of the site.
Structure of a URL
To begin to evaluate content, it is necessary to know something about the
structure of a URL. Take this URL as an example:
http://www.census.gov/geo/maps-data/data/tiger.html.
Another way to learn more about a Web site is to try what I call “backing off
the URL.” In this Web page about Halloween
(http://www.jeremiahproject.com/culture/halloween.html), we find many things
that make us suspect that there is an underlying agenda. By backing off the URL
—first taking off halloween.html, then removing culture/, we get back to the root
directory from which we can more clearly discern the background and intentions
of the authors.
Currency
When was the Web site created? When was the page or site last updated? For
current news contexts, a current date and even time are very important.
However, sometimes the date on a Web page really doesn’t matter. Take this
URL, for example:
http://memory.loc.gov/ammem/aap/aaphome.html.
The title of the page is “African American Perspectives: Pamphlets from the
Daniel A. P. Murray Collection, 1818–1907.” Examining the URL we note that it
is published by the Library of Congress. It is part of their American Memory
Project. If we “back off” the URL we see the root site:
http://memory.loc.gov/ammem/index.html. But back to the page in question. At
the bottom of the Murray page we see that the date is “Oct-19-1998.” Should we
assume this is not an authoritative page because it is nearly 20 years old and
doesn’t seem to have been updated at all since that time? No, in this case that
wouldn’t make sense. This is historical content, posted on the Web, and not
needing any kind of updates. It’s just as valid and credible now as it was in 1998.
Google Web Date Searching: Not to Be Completely Trusted
Google has a challenging task: index all it can of the World Wide Web. If
only all Web publishers and all the content that Google ingests would
follow standards. Ideally every Web page would contain information,
either explicitly stated for all to see, or at least hidden in the metadata,
about who created the content and when it was created or modified. Such
data does exist for Google Scholar and Google Books records, because that
metadata comes into Google in normalized formats from publishers and
other sources following standards. But Google Web does not have that
luxury in many cases.
Some Web pages are created with all the proper metadata by software
produced by commercial firms or freely available products. Other pages
are created through automated processes that fail to incorporate basic
metadata into their pages. This is part of the reason for the metadata being
irregularly present.
Google Web results can be limited by date, but don’t be fooled. These
limits only apply to compliant pages. Only those pages that publish their
created or modified dates will show up in the results; all other results will
be lacking, and you will get a false sense of completeness.
Authorship/Creatorship
A Web page or site didn’t just happen; it was created by someone. It may have
been a person or several people, or it may be attributable to a group, what we
call a corporate entity. By corporate we don’t necessarily mean a corporation or
company; we mean a group entity of some type: a nonprofit organization, a
commercial company, a local government, an international organization, a
federal government agency, a religious group, or a political action committee. It
is important to know who created a Web site, because you need to know the
authority of the person or group (do they have sufficient credentials to speak to
the subject?) and because you need to properly cite the page in your cited
references.
Hopefully author, whether personal or corporate, will be clearly stated. But if
it is not a Web tool, WhoIs, is helpful in determining ownership of Web sites.
Just search Google for whois, and you will uncover many WhoIs servers. These
servers contain the public Internet registration information from those who
register domain names. You can discover names of the people who registered the
domain, which can sometimes assist in the background work.
We had previously referred to the Wayback Machine from the Internet
Archive. By going to archive.org and entering a domain or a Web page into the
Wayback Machine search box, you can see cached content over time and see
how pages have changed and if viewpoints have been modified over time.
CACHED CONTENT
The Web is volatile. Using Google, just type define:volatile. The second
definition given there is “liable to change rapidly and unpredictably, especially
for the worse.” We’ve all experienced Web content that has disappeared; it no
longer exists. But there is hope. Several techniques can be attempted to recover
content that is no longer accessible in the original manner.
Google’s Cache
Perhaps you have had the experience of using Google to locate information,
only to discover that by clicking on the title you get an error. This is because
when we search Google we are not searching the live Web—we are searching
Google’s indexing of the Web, which was done at some point in the past
(perhaps a week ago, perhaps several months ago). Sometime after Google’s last
indexing the page or documents have disappeared. One way to recover from this
is to see if Google has a cached (or stored) version of the document. In Google
Web you notice that the Web page titles are in blue. Underneath you will see the
URL in green. To the right of the URL you may see a small green downward-
pointing arrow. Clicking this arrow gives you the possibility of accessing
Google’s cached version of the Web page—in other words, the version of the
Web page that Google indexed. This is extremely helpful in trying to recover
vanished content.
Wayback Machine
Another way to recover content is the previously mentioned Wayback
Machine. Take this government publication as an example of how the Wayback
Machine can recover content that was previously indexed by Google but is now
seemingly extinct:
United States Forest Service. 2006. Carpenter Ants in Alaska: Insect Pests of Wood Products.
Anchorage, AK: U.S. Dept. of Agriculture, Forest Service, Alaska Region 10, Forest Health
Protection, State and Private Forestry.
REFERENCE
Google. 2016. “Research at Google: Natural Language Processing.” Available at:
http://research.google.com/pubs/NaturalLanguageProcessing.html. Retrieved December 10, 2016.
4
Power Searching for Primary Sources
Using Google Web
The power of Google is the power of the full-text search. In the very early days
of Web search engines would not always scour every word of every document.
Sometimes search engines would not pursue links or index words buried deeply
down in the directory structure of Web sites. Other times long documents, such
as PDFs, would not be searched in their entirety. But that was then; today search
engines are capable of so much more power in terms of depth of searching and
tenacity at drilling down into the entire directory structures of Web sites.
When we search Google, Google decides which results come to the top.
Multiple factors affect Google search rankings. Among them are
Age of Internet domain (how long the Web site has been registered; longevity is assumed to be more
reliable than transitivity)
Domain history (drops in registration or ownership may imply unreliability)
Public vs. private domain registration (Are the registrants hiding something?)
Duplication of content, frequency of content updating, outbound linking, grammar, and spelling (a
signal of quality)
Broken links and cited references and sources (a possible sign of scholarship) are only a few of the
factors thought to go into Google rankings
Perhaps you have heard of “search engine bias.” This tends to occur in the
business world of Web site positioning. When you want to eat Japanese food,
you would likely type Japanese restaurants into a search engine. The results you
see on top are typically ads. These ads are the result of Google knowing where
you are and what companies in your area are willing to pay for ads. But after the
ads you will see links to Web sites, with Google pushing to the top results that
are in your area mixed with results from Web sites that have paid for higher
positioning. Search engine optimization protocols are used to make Web sites
more visible and help to push them higher in the results list. Although much has
been written about these practices, that is beyond the purpose of this book. Here
we only care about using Google to assist us in our academic pursuits.
Use a government organizational directory. The official directory, published annually, is the United
States Government Manual. Available at https://www.gpo.gov/fdsys/browse/collection.action?
collectionCode=GOVMAN, the manual provides an understanding of the complexities of the federal
government as well as providing direct URLs to agencies.
USA.gov provides a comprehensive list of U.S. government agencies at https://www.usa.gov/federal-
agencies/a.
Search Google directly for the name of the government entity. Let’s say you wanted to find the
government agency that deals with safety on highways. Your initial Google search might look like this:
site:gov highway safety. You would then discover that the National Highway Traffic Safety
Administration (NHTSA) is the appropriate agency and that their URL is http://www.nhtsa.gov/.
Whether using the more explicative government manual or the more dynamic
USA.gov site, once you have identified an agency of interest, you can put your
site-specific searching to use in Google. For example, once you discover that the
National Aeronautics and Space Administration site is nasa.gov, you could
frame a search such as this: site:nasa.gov moon landing. Using the file type limit
we covered previously, we could narrow Google results to more substantive
pages: site:nasa.gov moon landing filetype:pdf.
Let’s try a different topic. Let’s say you have a research project on the
Falun Gong religious sect, which has been banned in China. You can use
these steps.
1. Find the top-level domain for mainland China. Type tld into Google Web, and you can
access one of several lists of TLDs. Because the Wikipedia list is near the top, we can use
that list. We find that the TLD for China is. cn.
2. Search Google Web like this: site: site:cn falun gong. Now you will notice in the search
results that the secondary domain for Chinese government Web sites is gov.cn. We can
further narrow our search like this: site:gov.cn falun gong.
3. For a contrasting viewpoint, we can see what U.S. government sites have to say about this
topic. Search like this: site:gov falun gong. We note from the results list that the U.S. State
Department has something to say about this topic, so can focus our search like this:
site:state.gov falun gong.
4. We can further refine these results by restriction to file type: filetype:pdf site:state.gov
falun gong.
Although most U.S. government sites use a .gov TLD, there are many
exceptions to this. The largest exception is U.S. military sites that use .mil as
their TLD. Several agencies use the older .us for their domain. The U.S. Forest
Services uses fs.fed.us. Because the U.S. Forest Service is the most prolific user
of this domain, we need to use our Google Web search skills here to discover
other federal sites that use .us. We first frame a search like this: site:fed.us. We
see that nearly all the top searches are Forest Service related. We want to rule out
any Forest Service sites from the search results, so we incorporate a NOT
operator into the search this way: site:fed.us −site:fs.fed.us. Notice the minus (−)
sign directly before the second element of the search, telling Google to eliminate
results from the Forest Service. We now see that many U.S. courts also use the
.us TLD. Just to make it more complex, the United States Postal Service uses
.com (usps.com, redirected from usps.gov) and National Defense University,
under the Department of Defense, uses .edu (ndu.edu). These are not the only
examples of exceptions, but they illustrate the need to not make assumptions
about government top-level domains.
Oklahoma ok.gov and oklahoma.gov, most content under ok.gov; content also under
state.ok.us
Oregon oregon.gov; content under state.or.us
Pennsylvania pa.gov; pennsylvania.gov redirects to pa.gov, content under pa.gov and state.pa.us
Puerto Rico pr.gov
Rhode Island ri.gov; rhodeisland.gov redirects to ri.gov, very little content under
rhodeisland.gov; most content only under ri.gov and state.ri.us
South Carolina sc.gov; content also under state.sc.us
South Dakota sd.gov; content also under state.sd.us
Tennessee tn.gov and tennessee.gov; content also under state.tn.us
Texas texas.gov, some content also under tx.gov; content also under state.tx.us
Utah utah.gov; content also under state.ut.us
Vermont vermont.gov; vt.gov redirects to vermont.gov; content under vt.gov, vermont.gov,
and state.vt.us
Virgin Islands vi.gov (don’t confuse with British Virgin Islands, .vg)
Virginia virginia.gov; content also under state.va.us
Washington wa.gov; washington.gov redirects to wa.gov, but content only under wa.gov;
content also under state.wa.us
West Virginia wv.gov; content also under westvirginia.gov and state.wv.us
Wisconsin wisconsin.gov; wi.gov redirects to wisconsin.gov; content under wi.gov,
wisconsin.gov and state.wi.us
Wyoming wyo.gov; wy.gov and wyoming.gov redirect to wyo.gov; content wyo.gov, wy.gov,
wyoming.gov and state.wy.us
Searchers should keep in mind that URLs used by states are subject to change
and updating at any time.
According to the Census Bureau in 2012 there were 3,031 counties, 19,519
municipalities, and 16,360 townships (Hogue 2012). It would not be possible to
list all the possible variations of Internet domains within this publication.
Besides, the results would change perhaps daily. This means that the skillful
searcher simply needs to apply the Google searching skills described in the book
to come up with the most comprehensive results.
In these cases it’s best simply to search Google for the subject of the nonprofit
organization and restrict the search to the appropriate limits, either the TLD of
.org (in the case of organizations registered within the United States) or the
secondary-level TLD (in the case of organizations registered in other countries).
As an example, to find human rights associations in the United States, search:
site:org human rights. To find organizations in Japan, Russia, Mexico, and
Nigeria, search like this: site:or.jp human rights; site:org.ru human rights;
site:org.mx human rights; site:org.ng human rights. It’s even better if you can
search in the national language of the country: site:or.jp 人権; site:org.ru права
человека; site:org.mx derechos humanos; site:org.ng droits de l’homme.
REFERENCE
Hogue, Carma. 2012. “Government Organization Summary Report: 2012.” Census Bureau. Available at:
http://www2.census.gov/govs/cog/g12_org.pdf. Accessed December 11, 2016.
5
Google Scholar and Scholarly Content
Google’s early successes as a search engine led them to undertake a very smart
experiment. Google must have realized that librarians and publishers are very
principled in the way they produce metadata following basic standards so that
citations to scholarly literature invariably contain elements like title, author, date
of publication, and control numbers such as international standard serial number
(ISSN) and international standard book number (ISBN). Leveraging these
metadata features, Google was able to produce a product so powerful that it
surpassed all existing indexing and abstracting tools in use in libraries in terms
of depth of searching.
Traditional article databases in libraries search across basic indexed fields
such as author, title, subjects or keywords, and abstract or summary of the
article. Google has these metadata elements in its giant index, but also has the
full text of a high percentage of scholarly publications. The default Google
Scholar search is to scour across the entire full text of articles, whereas the
default search in many, perhaps most, article databases found in libraries is to
search across all metadata fields, but not across all full text. The plus side of
searching only the metadata is that the user gets more relevant search results.
The negative is that the user gets fewer results. In Google’s case many more
results are retrieved, but the results are ranked by Google’s proprietary relevance
ranking scheme.
Of course, we don’t really know Google’s inside story, but from the product
that results, we can surmise the following. Google went to publishers and said to
them, “Give us your metadata—metadata to all scholarly articles that you have
ever published.” But why would any publisher ever want to do that? After all,
publishers have been able to sell this metadata in print, in compact disc, and later
through Web portals, at a very high price to libraries. But that wasn’t all. Google
further said to publishers, “Don’t stop with the metadata; give us your full PDFs
as well.” So why would any publisher be interested in doing this? The answer
would seem to be “to monetize it.” Publishers could give away access to
abstracts of scholarly articles in exchange for worldwide access to their obscure
journal content. This could be a win-win situation: a win for Google, being the
host of all this content (and later able to sell advertising), and a win for
publishers, who would not allow access to full-text content unless it was
purchased.
But Google didn’t just stop with this model. They went to libraries and
requested a listing of all journals to which a library subscribed, in both tangible
formats (print or microfiche) and in online format, through any and all
publishers and aggregators, and also the specific holdings for those journals (that
is, the dates that the library owned or had online access to).
How could any library, even academic libraries with large budgets, pull this
off? Well, most couldn’t do this without help, without a vendor prepared to take
on this task. This was the time when several vendors were prepared to do just
that: prepare a large XML-formatted file with a library’s complete journal
holdings, subscriptions, and ownership, covering all years, all vendors, and all
publishers. Google could then come along on a periodic basis and grab that
information and do what Google does best: index the information into its giant
Google database (see Figure 5.1).
The result of this merging of publisher and library holdings information is
depth of discovery provided by Google and direct journal access (hopefully)
provided by your local academic library. Because of this linkage of library
content with Google Scholar results, Scholar is not a tool that works in
opposition to libraries; rather it is one of the greatest proponents of the richness
of an academic library’s expensive investments. That’s why Scholar works best
for those associated with research libraries and does little to assist public
libraries.
FIGURE 5.1. Google Scholar Content Ingest Model.
FIGURE 5.2. Examples of Metadata in Google Scholar Records. Google and the Google logo are registered
trademarks of Google Inc., used with permission.
In Figure 5.2. the boxes under each respective citation point out the metadata
supplied by publishers, and the boxes in the right margin point out the metadata
matches based on library holdings.
We will copy the word string from page 83 of the article (Figure 5.3) and
search it in Google Scholar.
Generally when doing these tests it is best not to cross line breaks, as they can
produce irregular results. But with this article, because the columns are so brief,
we had no choice but to take our text from a second line. Placing the search
within quotes in Google Scholar gives us the result we expect (Figure 5.4).
Of course we will find instances where this test fails. Although much of the
content in Google Scholar was “born digital”—that is, originally created in an
online format—some content contains scans of older articles created long before
the digital age. We might call this “legacy content” because it originally
appeared in a print format and had to be digitally scanned and then processed
with optical character recognition (OCR) software to make the words searchable
in online environments. The OCR process is not 100 percent perfect, leading to
some misfires when performing tests such as these. Nevertheless, the indexing
power of Google Scholar is impressive enough to capture the full-text indexing
of a very high percentage of scholarly publications over the years, at least in the
English language.
In this 1893 journal article we attempt the same test (Figure 5.5) and the test
fails.
Steele, Theodore C. “Impressionalism.” Modern Art 1, no. 1 (1893).
The reason for this failure to retrieve the text can be seen when copying and
pasting the underlying text. When we copy and paste the text directly into the
Google search bar, we see “gremlin” characters in the pasted text (Figure 5.6):
FIGURE 5.6. Reason for Failure Revealed. Google and the Google logo are registered trademarks of Google Inc., used
with permission.
The underlying OCR-ed text in this case contains errors. When we search for
the error-laden text, we do indeed retrieve the text of this article. Google does
not correct these errors; if it did take the time to micromanage every such error,
we certainly would not have the product we have today. Its OCR indexing is
good enough to get the job done. This means that there will be some degree of
failure when searching the full text of articles. But for the most part, the text is
retrievable.
This strangely long URL is actually understandable. It’s easy to identify the
field tags here: sid, auinit, aulast, atitle, volume, issue, date, spage, and issn.
It should be evident now why it is called an openURL. You can read the
metadata directly from the URL. The article title is plainly visible, as is the
source title, that is, the journal. Volume, issue, date, starting page, and ISSN
control number are also present. Google has some kind of underlying database
with various control numbers that somehow, mysteriously to us, points to and
generates the eventual openURL.
This is the opposite of what you want with online banking. We don’t want to
pass bank account numbers, passwords, and balance information through URLs.
This is why banking is done with sophisticated encryption technologies. But
bibliographic citations need no security; thus, openURLs make sense in these
contexts.
Van de Sompel’s invention was really a coup for the publishing world. It
created an international standard whereby publishers and aggregators that
compete with each other can nevertheless pass information along for the
purposes of information discovery. When a library link is clicked, the library’s
openURL resolver finds all the content, down to the specific journal article, that
meets the criteria.
Among the link resolvers libraries commonly subscribe to are SFX from Ex
Libris, 360 Link from Serials Solutions, LinkSource from EBSCO, WebBridge
from Innovative Interfaces, and GoldRush from the Colorado Alliance of
Research Libraries.
LEFT SIDEBAR
What makes Google Scholar distinct from Google Web is its separate interface
with unique features. These features are made possible because 100 percent of
Google Scholar’s content is metadata based. Let’s examine each of these features
(see Figure 5.8).
Let’s first examine the features available in the left sidebar of a Google
Scholar result set. It should be reiterated that these facets or limiters are evidence
of the underlying metadata that builds up the Google Scholar database. Unlike
Google Web, these metadata-driven features are evidence of the reliability of the
dates and other metadata types.
Limit by Date
This is perhaps the most powerful feature of Google Scholar. Because, unlike
in Google Web, dates are consistently applied to all Scholar records, the date
limit is extremely reliable. Google Scholar ranks by its idea of relevancy,
incorporating older materials and newer materials together. The power behind
the date limit is to factor out materials that are not relevant to your research. If
you only care about research from the last five years on your topic, then adjust
the date accordingly. The custom date range feature works extremely well. If you
wanted to view medical research during the early days of the recognition,
understanding, and diagnosis of HIV/AIDS, you could easily restrict the dates to
say 1980 to 1984 and view only those scholarly articles.
Google Scholar has the ability to limit by date because of its underlying
metadata. In the left margin of a Scholar result set we see date limit suggestions,
as well as the possibility to limit by any other dates we choose. This is distinctly
different from Google Web’s ability to limit by date. Google Web indexes all
Web pages to which it has access, whether those pages contain principled
metadata or not. It is only those pages that contain adequate date information
that are available for Google’s date limit within Google Web. Thus, whenever
we use Google Web to limit by date, we are only retrieving those records that
have sufficient underlying metadata to appear in the result set. In other words,
records that should be in the Web result set are omitted simply because there was
insufficient metadata for them to be included. Now with Google Scholar
metadata is all produced by publishers, libraries, or library vendors and is
completely metadata driven, enhancing our confidence that any date limits have
a high degree of accuracy and completeness, unlike Google Web.
FIGURE 5.9. Date Sort and Searching Only within Abstracts of Articles. Google and the Google logo are
registered trademarks of Google Inc., used with permission.
FIGURE 5.10. Scholar’s Option to Search Only in Citations. Google and the Google logo are registered
trademarks of Google Inc., used with permission.
Create Alert
Alerts are an extremely useful feature, especially for researchers doing long-
term research on a topic. Undergraduate students with only a passing interest and
who are just writing a quick research paper may not need the power of alerts.
Scholar alerts, like alerts within the Google Web interface, monitor new content
added to the Scholar database. When new content is added that would be
retrieved by your keywords, you receive an e-mail notifying you of the
additions. Think of this as a clipping service or a monitoring service for new
scholarly content that is added to the Google Scholar mix.
CITATION-SPECIFIC LINKS
Other features within Google Scholar appear underneath the keyword excerpts
in the center column of the results. Let’s go over these features in turn. Figure
5.11. illustrates the topics we will cover: “cited by,” “related articles,” “all xx
versions,” “cite,” “save,” and “library search.”
“Cited By”
Because Scholar ingests not just metadata, but also full text, it is able to scan
all the footnotes and bibliographic references contained within scholarly articles.
In compiling these, a linkage system is created to all subsequent articles that cite
the article in question. Thus, Google Scholar not only tracks what is in each
scholarly article’s bibliography, but also the cross-linking between articles.
“Cited by” is powerful in scholarship because it shows the interaction of other
scholars over time. What it is incapable of showing, however, is whether the
subsequent citation is a positive one (agreeing with the cited author) or a
negative one (taking issue with the cited author), or simply a neutral reference.
This function is similar to what Web of Science and Scopus are capable of
doing, but for much less money (in fact, no money at all). It should be noted that
the citations are forward looking (“who cites whom”) and not backward looking
(“whom who cites”), as the expensive software packages noted earlier are
capable of doing.
FIGURE 5.11. Google Scholar Features underneath Citations. Google and the Google logo are registered
trademarks of Google Inc., used with permission.
Google, rather than limiting its citations to source lists, simply ingests all the
content it can. We can assume that Google has some criteria as to what journals
to include in Scholar and which sources to exclude. It’s just that Google isn’t
forthcoming with its criteria, so we are left to guess as to what they are doing. It
appears from what we see with Google Scholar that they include journals that,
for the most part, tend to be scholarly. Sometimes they let doctoral dissertations
into the mix, as well as newspaper content. They generally include university
institutional repository content, which can vary widely from scholarly content,
dissertations and theses, and even capstone projects. Some of these are not the
same scholarship level as Scopus or Web of Science would include, but these are
among the reasons for the generally higher citation numbers for Google Scholar
over the other two citation tracking systems.
“Related Articles”
Google Scholar doesn’t say much about how their related articles are
gathered. But it appears that it somehow looks for shared keywords and shared
relevancy when gathering the materials. If there is a sufficient number of
articles, clicking “Related Articles” will invariably show 101 results, with the
first result being the article in question and the remaining 100 articles being
somehow related. This feature actually seems to work extremely well. In fact, it
could be that the relevancy Scholar provides will exceed the relevancy that
proprietary library search tools are able to provide, making this an important step
in the research process.
“Web of Science”
The Web of Science link within Google Scholar will only be visible and will
only work in on-campus environments from universities that subscribe to Web of
Science. These links will not show up from off-campus locations. However, if
you are at a university that subscribes to Web of Science, you can get cited
references from it in addition to those supplied by Google.
FIGURE 5.12. Clustering of Google Scholar Records. Google and the Google logo are registered trademarks of
Google Inc., used with permission.
It needs to be noted that there are often significant differences between the
number of citations found by Web of Science and the number of citations found
by Google. This is because of differences in scope of coverage.
“Cite”/Bibliographic Citations
Because Google Scholar is metadata driven, bibliographic citations can easily
be provided, and that is exactly what Google Scholar does. Google provides
popular citation styles for Modern Language Association (MLA) style,
American Psychological Association (APA) style, Chicago style, Harvard style,
and Vancouver style. It needs to be noted that the Chicago style provided by
Google Scholar is the older Documentation I, or Notes and Bibliography, style
often preferred by researchers in the humanities. Many social science researchers
will prefer Chicago Documentation II, the Chicago Author-Date style, but this
style is not provided by Google Scholar (see Figure 5.13).
Using the Google Scholar citation link provides a limited number of citation
styles, but by linking out to commercial citation managers, an unlimited number
of citation options is available. Researchers needing one of the hundreds of other
existing styles should use the links below the citation and export the citation to
one of the four citation management software programs that are supported
(BibTeX, EndNote, RefMan, or RefWorks). BibTeX works well with Mendeley,
which is a free download. RefMan, short for Reference Manager, is no longer
supported by Thomson Reuters, although the style is really RIS format, a format
developed by the company Research Information Systems, that nearly all
citation programs are capable of importing. Most universities support EndNote
and/or RefWorks by providing subscriptions that students and faculty can access
through university-wide subscriptions. If this is not available, then individual
subscriptions or software purchases can be made at the researcher’s expense. In
any event, once a citation is imported into one of these software packages,
alternative citation styles can be selected. In this way researchers needing
Chicago Documentation II or Turabian styles can be accommodated.
FIGURE 5.13. Citations in Google Scholar. Google and the Google logo are registered trademarks of Google Inc., used
with permission.
The actual citation should look like this (bold added for emphasis):
Shi, Xi, and Sarah Levy. “An Empirical Review of Library Discovery Services.” Journal of Service
Science and Management 8.5 (2015): 716.
Notice that the Google Scholar citation does not make reference to the
chapter, but to the edited book as a whole (Figure 5.14).
The reason for this is that the metadata is supplied from the Google Books
module, which does not contain metadata for individual chapters.
“Save” to My Library
Google Scholar’s “Save” function allows saving to your personal “My
Library” page. Users first must log in to their personal Google accounts (or
Google will prompt for a login). Then citations are saved and can be accessed by
clicking the “My Library” link, as seen in Figure 5.15.
The saved articles page can be used to push citations to bibliographic
management software in groups. Although EndNote is listed as one of the export
features that are available for saved articles, RefWorks is not listed. To get
citations into RefWorks, select RefMan as the format, and an RIS file will be
saved to your computer. This RIS file can be uploaded into RefWorks rather
easily.
FIGURE 5.14. Google Scholar Citation Failure for Book Chapters. Google and the Google logo are registered
trademarks of Google Inc., used with permission.
FIGURE 5.15. “My Library” Feature in Google Scholar. Google and the Google logo are registered trademarks of
Google Inc., used with permission.
“Library Search”
This link appears when more than one library was set up for openURL linking
in the Google Settings options. Because up to five libraries can be specified, one
of the libraries will show up in the right margin, generally your primary library,
but the other libraries show up under “Library Search.”
You may want to change the search a bit by searching for synonyms, broader
terms, or narrower terms for each of these. For example, instead of Nigeria,
search the neighboring country Cameroon, or the broader term sub-Saharan
Africa, or simply Africa. Other synonyms you might consider: farming, labor,
and the British spelling labour.
Many times an instructor will not only require peer-reviewed journal articles,
but articles with a specific methodology. Methodologies may include qualitative
studies, qualitative methods, participant observation, focus groups, structured
interviews, field notes, reflexive journals, document analysis, or mixed methods.
If you are looking for a study with quantitative methods, put that in your search.
The reason this strategy usually works is that academic articles often clearly
state the research methods used either in the abstract or in the first several
paragraphs of the article.
6
Google Books
LIBRARY PROJECT
Partner Libraries
Over 40 libraries have contributed or are contributing to the ongoing Google
Books project. It all started with the University of Michigan and their initial
digitizing of library books (Band 2006), and then was also taken up by the
Committee on Institutional Cooperation (CIC), later named the Big Ten
Academic Alliance. Member libraries worldwide that contribute to Google
Books include, in alphabetical order, Bavarian State Library (Germany); Big Ten
Academic Alliance (formerly known as the CIC), consisting of Indiana
University, Michigan State University, Northwestern University, The Ohio State
University, Pennsylvania State University, Purdue University, University of
Chicago, University of Illinois, University of Iowa, and University of Minnesota,
as well as University of Michigan and University of Wisconsin-Madison, which
have separate agreements, as noted later; Columbia University; Cornell
University; Ghent University (Belgium); Harvard University; Keio University
(Japan); National Library of Catalonia (Spain); New York Public Library;
Oxford University; Princeton University; Stanford University; Universidad
Complutense of Madrid (Spain); University Library of Lausanne (Switzerland);
University of California (including the California Digital Library and ten
campuses: Berkeley, Davis, Irvine, LA, Merced, Riverside, San Diego, San
Francisco, Santa Barbara, Santa Cruz); University of Michigan (part of Big Ten
Academic Alliance, but with a separate Google agreement) (Baksik 2009;
Google 2016).
The CIC becomes important in this process because the HathiTrust grew out
of the Michigan/Google contract as an effort to provide a permanent platform for
books digitized by Google from member libraries. In many cases, the Google
scans in the HathiTrust have more liberal access than does Google Books. More
on that later.
Full View
Full view books are great when you can get them. Books published before
1923 are not subject to copyright restrictions. Other materials that may also be
available in full view include works where the author or publisher has authorized
it (although this is rare), works where copyright was never applied or has been
determined to have expired, and international documents, which are not under
copyright. It’s been estimated that less than 10 percent of the books are available
in full view (Chen 2012).
Some full view books have no e-book version available, and thus will not be
able to be printed or downloaded. Others, typically before 1923, will be able to
be downloaded in PDF format, and thus will be able to be printed. Because of
these variations in what is allowed and what is not allowed, the Google Books
experience will be quite mixed, even for full view books. It is best not to rely on
Google Books as the final destination for accessing or reading textual materials.
It’s best to try to find the book in a local library, either as a print book or as an e-
book. I call this final step, the step of actually getting a version of the book you
can work with, “fulfillment.” Google Books is excellent at discovery, but not so
good at fulfillment. For that, we still need the academic library.
Limited Preview
The most commonly encountered of the four views is the limited preview, at
least in terms of what academic users need in the course of their research. Most
major publishers, through the Partner Program, authorize the limited preview. In
theory, any part of the books may be exposed based on the relevance of the
keywords. But gaps or ellipses soon become evident as one tries to page through
the book.
In cases where the author or publisher has given Google permission, a limited
number of pages can be viewed as a preview. If a keyword search is carefully
framed, you may be able to view enough of the book to determine whether or not
it is highly relevant for your research purposes. Although you will not be able to
view enough of a work to use it, the discovery process exposed enough of the
book so that you know that you want it. You can then go through your normal
library procedures to get the book. This generally means 1) searching the library
catalog to see if the library owns or has access to the book in print or online
format; 2) requesting the book through your library’s interlibrary loan
procedures; or 3) consulting with a reference librarian or support staff to see if
there are other options. Sometimes works can be located through other means.
FIGURE 6.2. Limited Preview in Google Books. Google and the Google logo are registered trademarks of Google
Inc., used with permission.
No Preview
Just because some Google Books have no preview does not mean that they
have no value. Remember that the library online catalog is extremely weak and
that we need a way to discover full text in books, whether we can view the full
text through the discovery interface or not. No preview searching is like
searching blindfolded, but it is certainly better than no results at all.
That means that the second option is a better one for most researchers: 2)
Simply look up the title you found in Google Books in your local library online
catalog. This way you will see all formats available for a given title, whether in
print or online as an e-book.
1. Search Google Books: “Adelfa Callejo”. This yields a couple of limited view Google
Book results that leads you to want to get the books. The titles are Las Tejanas: 300 Years
of History and Texas Women: Their Histories, Their Lives.
2. Search your local library catalog. In the case of the first title, the University of Denver
Library had this title in print format. The second title was available online as an e-book.
The local catalog could not provide deep full-text access to book content
the way Google Books could. But once Google Books uncovered
additional helpful titles, it was not possible to use Google Books to
actually accomplish the research. But switching back to the local library
catalog with additional information in hand, you are able to get two
additional resources.
There are two ways to access Google Magazines within Google Books: by
browsing and by searching. To browse English language magazines, search
Google Web like this: google books magazines, or simply go here:
https://books.google.com/books/magazines/language/en. Note that the last two
letters of this URL are en for English. That should raise your curiosity. What if
we substituted es instead? You would then find Spanish language magazines in
the project. Try also fr for French. Sorry, I haven’t discovered any other
languages.
To search within Google Magazines, go to the Google Books advanced search
page and simply type Google Books advanced search. Here you can figure out
how to search by keywords and to limit by date. Because Google Books (and
Magazines along with it) are completely controlled by metadata (unlike much of
Google Web), you can be assured that your result set is extremely precise in
terms of date searching across the searchable content.
Although the e-book is free to download from Google Books, this work
is also in HathiTrust and Internet Archive. The HathiTrust version can only
be downloaded by HathiTrust member libraries, but the same edition in the
Internet Archive is free to be downloaded by anyone.
1. Click the “About This Book” link and you will see some available citation formats at the bottom of
the page. These are online export options in BiBTeX, EndNote, or RefMan formats. It is very likely
that this roundabout way will not work for many users, so one of the other two approaches may prove
better.
2. Take the title of the book and search for it in Google Scholar. If you are lucky, the book will appear
also in Google Scholar with the cite button clearly visible.
3. Look for the “Find in a Library” button for the book you are in. The link may be hidden under the
“Get this book in print” option. Sometimes that button does not exist at all (in cases where an OCLC
control number is not available in the underlying metadata). But when you do see that link you will
be passed along to the OCLC WorldCat interface. You will then see the “cite/export” link in the
upper-right area of the resulting Web page. This option gives you the greatest choice of citation
output formats, including Chicago author/date style, and Turabian style.
REFERENCES
Baksik, Corinna. 2009. “Google Book Search Library Project.” In Encyclopedia of Library and Information
Sciences. DOI: 10.1081/E-ELIS3-120044502.
Band, Jonathan. 2006. “The Google Library Project: Both Sides of the Story.” Available at:
http://hdl.handle.net/2027/spo.5240451.0001.002. Accessed December 12, 2016.
Chen, Xiaotian. 2012. “Google Books and WorldCat: A Comparison of their Content.” Online Information
Review 36, no. 4: 507–516.
Google. 2016. “History of Google Books.” Available at:
https://books.google.com/intl/te/googlebooks/history.html. Accessed December 11, 2016.
Wu, Tim. 2015. “What Ever Happened to Google Books?” New Yorker, September 13, 2015. Available at:
http://www.newyorker.com/business/currency/what-ever-happened-to-google-books. Accessed
December 11, 2016.
7
Google as a Complement to Library
Tools
Librarians often hear from students that their instructor said not to use Google in
their research. I believe that what is meant in most cases is that students
shouldn’t only use resources that they just found with simple Google searches.
Too many times students cite Wikipedia as an authority for their research. But
what this book is arguing is that there is a proper place for Google in academic
research. Google augments what libraries can offer with the expensive licensed
content. What is horrifying to professors, as well as reference librarians, is that
students sometimes use Google as the only resource for academic research. This
could be a carryover of bad habits from high school, a lack of critical inquiry
skills, or ignorance of what resources libraries have that are beyond the reach of
Google Web.
As powerful as Google is, there is much content that is beyond its reach. We
have already discussed these limitations: Google Web doesn’t crawl where it is
not wanted; many databases have no Google Web presence because their
technology doesn’t allow for it; Google is primarily a text-based resource;
numerical, sound, and video data and metadata are often impossible to contribute
to the giant Google index; and passwords and firewalls prevent Google from
grabbing proprietary content, of which there is an entire university. Although
Google Web does a good job of exposing primary source content, especially
current content hosted on governmental and organizational Web sites, it is very
spotty in its ability to dig into content contained in the archives of libraries,
museums, and historical societies, much of which has not been digitized or even
indexed in online finding aids.
Google Scholar likewise has its limitations. Not only is the most current
content not immediately contributed to Scholar, but not every scholarly journal is
represented. Sometimes only metadata is present, but not full text. In addition,
Scholar tends to overlook nonscholarly resources such as newspapers—both
current and historic—popular magazines, and trade journals. Dissertations,
although they may be in Scholar because of their presence in institutional
repositories from various universities, constitute only a small subset of all
dissertations and theses completed over time.
Google Books has its limitations as well. Looking back in time, not every
book is owned by the scanning partner institutions. Some books, even if owned,
do not fit the Google Books specifications in terms of size—they may be large
folios or small pamphlets and thus not be eligible for scanning. Google Books
does not contain metadata for book chapters, making retrieval and citation a
challenge at times. And then there is “the gap”—the nearly century-long period
from 1923 to the resent when freely available online book content is hard to
discover. This is the reason why so many digitization projects end at 1922.
It is at this point that researchers should turn to their academic libraries. To
understand what libraries have to offer, it is essential to know how libraries
organize information and make it available. We will look at the tools available in
the typical academic library and how to approach these tools in relation to our
Google search skills.
Under this model, users placed their search terms in the search box. When
they submitted the search a query was sent out to various information silos,
sometimes as many as 50 to 100 silos at once. Connection protocols then took
place, search terms were entered, and users waited—usually a very long time—
for results to return. Many times these searches would take over one minute,
with the search tools attempting to index these records “on the fly,” and to merge
and de-duplicate the records for presentation to the user. These endeavors were a
failure. In an age of immediate search results, with Google taking fractions of a
second, these tools would take 20 to 60 seconds or more. Users didn’t have the
patience for this.
Taking lessons from Google Scholar, vendors rethought the model of
information discovery and came up with tools they called “web-scale discovery,”
or simply “discovery” tools. Rather than linking out to numerous information
silos and waiting for results to return, vendors built a model much like Google
Scholar. They loaded metadata into a central server and indexed it. Some
vendors incorporated selected full text into the central server. But vendors would
have a much more difficult time acquiring full text from publishers than Google
Scholar, with all of its leveraging, was able to do. These discovery tools really
were the true federated search in that one single server was being searched,
eliminating the outside protocol connections and the wait time for return of
information and the tedious tasks of merging and de-duplicating the records. But
vendors had already used the term “federated search” for the failed broadcast
search tools. For that reason they had to brand the new tools as “discovery”
tools.
The real power of these licensed (and expensive) discovery tools is in the fact
that the search takes place against a single “pot” of metadata. This pot is rather
broad in scope, encompassing metadata for books from the library catalog,
newspaper articles, magazine articles, dissertations and theses, scholarly articles,
videos, sound recordings, archival materials, and many other materials owned or
subscribed to by academic libraries. They do not search everything—only those
resources compliant enough to be able to have metadata contributed to the
respective discovery service. The strength is in the breadth of coverage, not the
depth of indexing.
In contrast, Google Scholar is narrow in breadth, searching across scholarly
journal articles from a variety of sources, as well as “bleed through” from
Google Books. The difference is that the depth of searching is nothing short of
transformationally impressive (see Figure 7.2).
As of the writing of this book, the biggest players in this market are ProQuest
with their Summon product, ExLibris (now a ProQuest company) with Primo,
EBSCO’s EDS or EBSCO Discovery Service, and OCLC’s WorldCat Discovery.
These products all vary as to underlying architecture and capabilities.
TABLE 7.1. The Information Access Anomaly: Assumption of 400 Words per Page (WritersServices 2001).
Typical length full text (FT) 200 pages × 400 = 80,000 15 pages × 400 =
words 6,000
Surrogate record (catalog or index 50–100 words (75 word 300–500 words (400
avg)
metadata) (SR) avg)
SR to FT Ratio 1 to 10,666 1 to 15 1 to 1
Source for assumption of 400 words per page is WritersServices 2001
Archival Materials
Archives and libraries are cousins in the cultural heritage world—a world that
includes archives, museums, and libraries. Whereas libraries tend to focus on
published materials, of which multiple copies of works such as books exist in
multiple libraries, archival collections tend to be unique holdings of letters,
papers, minutes, photographs, transactions, and even objects.
Recently there has been a movement among archives to place finding aids to
their collections on the Web and even to digitize content and make it available
through a library Web site or an institutional repository. In addition library
vendors have been tripping over each other to get this unique content digitized
and sold through their platforms to libraries. Among such vendors are Gale,
ProQuest, Adam Matthew, Newsbank, East View Information, and Alexander
Street. Although it could be argued that content libraries publish on their own
through institutional repositories is discoverable with Google Web, content
through the vendors is not. This content is not inexpensive. Libraries often have
the option of either subscribing to this content or purchasing it. This archival
content would not generally be discoverable by any of the “three Googles.”
Content of these vendor-sourced collections is sometimes arranged topically
(slavery, civil rights, fashion, etc.), sometimes chronologically, and sometimes
biographically. Content that for decades required a trip to a university archive,
with all the plane fares, hotel nights, and letters of introduction, now can take
place digitally. Of course, not every library is able to afford these expensive
subscriptions, so in some cases travel is still necessary if one’s own research
library is not able to afford these subscriptions. The point is that Google is not
able to drill into these proprietary subscription databases.
On just the topic of slavery, vendors today offer these collections: American
Slavery Collection, 1820–1922 (Newsbank); Slavery and Anti-Slavery: A
Transnational Archive (Gale); Slavery, Abortion, and Social Justice (Adam
Matthew); Sources in U.S. History Online: Slavery in America (Gale); Black
Abolitionist Papers (ProQuest); and Slavery in America and the World: History,
Culture & Law (Hein) are among the many collections.
1. Library databases generally add current content faster than does Google Scholar.
2. Often individual databases offer controlled vocabularies. Although imperfectly applied in some cases,
controlled vocabularies (constructed and applied using thesauri) are an efficient way to get to “all and
only” the desired results.
3. Databases that contain scholarly articles often also contain related kinds of publications like book
chapters, dissertations or theses, conference proceedings, and book reviews, which may not be
covered adequately or at all in Scholar.
4. Proximity searching in library databases is often possible across full-text content. This usually allows
for a greater degree of precision, a feature often necessary for advanced research projects.
Not counting the library discovery tools already discussed earlier in this
chapter, scholarly articles are generally featured in two kinds of databases:
publisher portals and aggregator databases.
Publisher Portals
Most journal publishers view the world as starting from themselves and
produce search interfaces that search only their content. These can be very
powerful, but also very limiting, in that only their publications are featured. If
libraries followed vendors’ advice, users would experience even more frustration
than they already experience, because they would need to search several dozen
interfaces, each with their own idiosyncrasies, in order to do comprehensive
research.
Many research libraries will have publisher databases from Elsevier
(ScienceDirect), Wiley, Springer, Sage, Taylor & Francis, and others. Partner
projects like JSTOR, although not directly publisher generated, work with
publishers as a hosting venue for older scholarly content. Many university
presses also have their own publisher portals, such as Cambridge University
Press and Oxford University Press.
Aggregator Databases
Aggregator databases operate under license agreements with publishers,
sometimes indexing and pointing to publisher content via openURL technology,
but more often have rights to host PDF content from publishers on the
aggregator site itself. The advantage of this is that the scholarly publications of
many publishers can be searched at one time. In many cases the PDF publisher
content has been OCR-ed and thus can be searched in full text through the
aggregator interface.
Examples of these large aggregator products include EBSCO’s Academic
Search and Business Source (each with varying numbers of titles available),
ProQuest’s product ProQuest Central, and Gale’s Academic OneFile. In Chapter
1 we discussed the “back-generated” subjects created by aggregators, with the
strengths and weaknesses of this methodology. Nevertheless, these resources
should not be overlooked in the pursuit for scholarly journal articles.
REFERENCE
WritersServices. 2001. “Matching World Count to Page Size.” Available at:
http://www.writersservices.com/writersservices-self-publishing/word-count-page. Accessed December
8, 2016.
8
Academic Research Hacks
Google is known for its “hacks,” or clever shortcuts for quick access to
information. To prove the point, try these searches in Google Web: tip
calculator, set timer for 15 minutes, songs by a-ha, books by dan brown, flight
UA815, time in tokyo, and the list of these “ready reference” hacks seems
endless. There are even not-so-academic Google tricks, such as searching for
google mirror, do a barrel roll, and google guitar—just a few of the many
pastimes that can distract you from your research. In fact, if you use Google to
search google hacks you can see many lists people have compiled of useful
things Google will do for you. We have already covered the basic research hacks
needed for academic research such as site-specific searching and file type
searching in Google Web. But there are other hacks that may help researchers in
specific situations.
FIGURE 8.1. Google and Google Scholar as Seen in Japan. Google and the Google logo are registered trademarks
of Google Inc., used with permission.
FIGURE 8.2. Google Scholar Search Results from Japan. Google and the Google logo are registered trademarks of
Google Inc., used with permission.
We’ve discussed how to navigate Google when you are visiting or residing in
a foreign country. But what if you are in the United States and you want to
search the Google interface of another country, either because you are from that
country and are accustomed to that Google experience or because you just want
to easily see news and events from that country?
FIGURE 8.3. Accessing English-Language Google Scholar from Japan with NCR Fix. Google and the
Google logo are registered trademarks of Google Inc., used with permission.
FIGURE 8.4. Accessing Google Web from Hungary with NCR Fix. Google and the Google logo are registered
trademarks of Google Inc., used with permission.
The easiest way to do this is to use Google Chrome as your Internet browser
and then download the NoCountryRedirect (NCR) add-in. To get to this, just use
Google to search for NoCountryRedirect (NCR) Chrome Web store. To get a list
of all Google domains available, search List of Google domains, then click on
the Wikipedia link. This will give you all the information you need to configure
the Google Chrome NCR add-in that you install. The add-in solves the “other
country” problems in both directions: it allows for searching U.S. Google while
in a foreign country, and it allows for searching a foreign Google interface while
in the United States. Google Scholar, when searched from Hungary, appears as
seen in Figure 8.4.
Here is how it works in another context. Let’s say you want British news, and
thus want the google.co.uk Google experience to be the default search engine.
Follow these steps.
1. Configure the Google NCR add-in to point to co.uk (see Figure 8.5):
FIGURE 8.5. Using the Chrome NCR Redirect Add-in. Google and the Google logo are registered trademarks of
Google Inc., used with permission.
2. Go to Google—you likely will be initially directed to Google.com (in the United States).
3. Click the Google NCR add-in button, and select “Open local “co.uk” version (see Figure 8.6).
FIGURE 8.6. Using the NCR Redirect Add-in to Search a Different Local Version. Google and the Google
logo are registered trademarks of Google Inc., used with permission.
Now when you type “news” into the UK Google, you will get UK news at the
top (see Figure 8.7).
This should work for any of the countries featured on the Wikipedia “List of
Google Domains” page.
There are countless applications in humanities, social sciences, and the history
of science. Try these interesting searches: Jane Austen, William Blake, Geoffrey
Chaucer, Charles Dickens; telegraph ,radio, television, Internet; railroad,
carriage, automobile, aeroplane, airplane; football, baseball, basketball,
hockey; George Washington, John Adams, James Madison, Benjamin Franklin,
Thomas Jefferson, Abraham Lincoln.
FIGURE 8.8. Google Books Ngram Viewer Showing Popularity of Various Foods. Google and the Google
logo are registered trademarks of Google Inc., used with permission.
IMAGE SEARCHING
Google has very powerful image-searching capabilities. The problem is that
for academic purposes, we need to get permission before just appropriating other
people’s creations. We’ll get to that soon. But first, let’s just focus on searching.
We all know how to search for images. But how do we use this powerful feature
to assist us in our academic work?
Searching by Image
Searching can also be done by image. The image may be something you see
on the Internet, a computer image file on your computer, or a print photograph
that you digitize with a digital camera.
You can either paste an image URL or upload an image. If you paste an image
URL, make certain that it is the URL directly to the image, not the Web page on
which the image lives. In Figure 8.9. we see a publisher’s book catalog with an
attractive photo on the front cover. Inside there is no attribution telling us where
this photo was taken. By using screen capture software (I used Jing from
Techsmith) to grab just the photo of the library, saving the image to my
computer, and then uploading the image into Google Image Search, we are able
to have Google make a “best guess” as to where this image is from. This method
works well for places, especially famous places, and doesn’t really work at all
for people.
GOOGLE TRANSLATE
Google Translate (https://translate.google.com/) is a very powerful feature.
Much research has been done to evaluate the accuracy of Google’s translation
ability. One study concluded that although Google Translate is far from error
free, it approaches the minimum standards necessary for university admissions at
many institutions (Groves and Mundt 2015).
The intuitive interface has an auto-detect feature. It usually correctly guesses
the language you pasted into the search box. Currently Google Translate can
translate to or from 104 languages. This can be extremely useful for rendering
foreign phrases encountered in books and articles. Care should be taken when
attempting to translate technical reports, legal documents, or medical diagnoses,
as less-than-accurate results may occur.
Google Translate, rather than serving as an absolute authoritative translation
service, provides a relatively helpful service and deserves its place among the
many helpful tools Google provides the academic researcher.
Using Google Translate to render this into Arabic and then back again to
English gives us this:
This is just another example of why the holidays here at the White House is very special.
Last week, he pardoned a turkey. [Laughter] tonight we feast lighting National Christmas
Tree. This one is easier because the tree does not move. Do not swallow it. [Laughter] You
just push a button, and it electrified, and that’s exactly what you do not want to happen in
the amnesty Turkey. [Laughter] I thought that was funny, Michelle. [Laughter] Thankfully,
the two events had gone without a hitch. [Bold added]
LEGAL CASES
Google Scholar incorporates many U.S. legal cases into it. Although no law
school student would be advised to rely on Scholar for getting though law
school, and no attorney would use this for billable hours, it can be useful to the
lay student who just needs to “pull a case” for quick reference. People outside
the legal community do not generally have access (unless they are willing to pay
a lot of money) to prime legal resources like Lexis, Westlaw, or Bloomberg Law.
These databases are given to law school students in hopes of addicting them to
their products. But they do not appear on database lists for general academic
libraries. Instead, academic libraries may subscribe to the greatly pared down
versions: LexisNexis Academic and Westlaw Campus.
Google Scholar’s case law can generally be viewed as a massive experiment.
It contains citations to the cases in question by other cases, without the added,
and necessary, indication if the law is “good law”—still in force—or not. For
that, researchers are best served by using the full Lexis and Westlaw services.
Scholar can be configured to feature courts in a given state. Using the “Select
courts …” link in the left margin of Scholar, you can select which courts, state
and federal, you want to see in the result set. Google already knows what state
you are in, so it will feature that state court more highly in the result list.
A nice feature incorporated into the federal and state cases in Google Scholar
is “star pagination.” Because the text shown in Scholar is displayed in HTML
format and not PDF format, page numbers are not obvious. Scholars need to cite
to specific page numbers when citing cases, either in legal publications or other
kinds of social science scholarship. Page numbers in Scholar are denoted with an
asterisk or star so that researchers can tell exactly where a new page begins. This
has long been the practice of licensed online legal databases, but now has been
incorporated into Google Scholar as well.
One hint on legal research deserves mention here. Many times researchers
only want a legal perspective on an issue, that is, a perspective from a law
review or law journal. HeinOnline is fully indexed in Google Scholar and is a
resource that contains the full text of most law reviews. Framing a search that
contains heinonline as one of the keywords will restrict the results to all and only
content contained in the HeinOnline content—a very powerful limiter.
SEARCHING PATENTS
A patent is a protection on an invention so that the inventor can prevent others
from making, using, or selling the invention. So important are patents that they
are part of the U.S. Constitution. Congress is empowered “To promote the
Progress of Science and useful Arts, by securing for limited Times to Authors
and Inventors the exclusive Right to their respective Writings and Discoveries”
[Constitution, Article I, Section 8]. Patents are interesting from a research
perspective because they are simultaneously legal documents and engineering
documents. A professional patent researcher will have a strong background in
law as well as in engineering.
The problem for the researcher is that sometimes patents are written with
obfuscation in mind. A “hide the ball” attitude is sometimes built into patent
applications so that others will have a more difficult time uncovering them.
Combine this with one of the most detailed and complicated classification
systems imaginable, and you have a real challenge. This section on patents is not
intended as a full legal or technical discussion, but merely as a brief introduction
to a colorful part of the history of technology as it relates to general research.
Patents are one of three areas known as intellectual property that includes
patents, trademarks, and copyrights. Copyrights are handled by the Library of
Congress, whereas patents and trademarks are the purview of the U.S. Patent and
Trademark Office (PTO). Patents from 1976 onward are fully searchable through
the U.S. PTO databases, even searching into the full text of these patents. Before
then, from 1790 to 1995, although the full images of patents had been digitized
by the U.S. PTO, the only intellectual access to them are by patent number or
classification code, not a very user-friendly way to search.
But then along comes Google, taking the images that were already in the
public domain and performing their own OCR on them. Now patents from 1790
to present are all full-text searchable. The OCR is far from perfect, but Google
does seem to correct the OCR text for patents that are often accessed.
TABLE 8.2. World Patent Office Content in Google Patents (as of December 16, 2016).
Patent Office Patents Granted Patent Applications
JP Japan 7,898,919 18,179,932
CN China 8,990,822 6,911,241
US United States 10,343,415 4,737,851
EP European Patent Office 1,500,257 4,206,017
WO World Intellectual Property Organization (WIPO) 0 3,644,348
DE Germany 4,678,577 2,816,209
GB Great Britain 1,271,621 2,380,956
KR South Korea 1,997,371 2,236,991
FR France 2,065,869 1,034,096
CA Canada 2,125,610 933,663
ES Spain 803,352 617,339
RU Russia 848,083 494,766
NL Netherlands 206,706 434,920
FI Finland 220,993 367,188
DK Denmark 446,203 113,578
LU Luxembourg 953 62,633
BE Belgium 586,727 0
Source: https://patents.google.com/
REFERENCES
Groves, Michael, and Klaus Mundt. 2015. “Friend or Foe? Google Translate in Language for Academic
Purposes.” English for Specific Purposes 37: 112–121.
Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, Joseph P.
Pickett, … Erez Lieberman Aiden. 2011. “Quantitative Analysis of Culture Using Millions of Digitized
Books.” Science 331, no. 6014: 176–182.
9
Case Studies in Academic Research
This chapter attempts to put into practice all the Google research skills
mentioned in the book so far. This is done with case studies of actual research
topics I have encountered in the course of research consultations. In each case
study a question I pose a question; then we explore how Google Web, Google
Scholar, and Google Books can be used to help answer the question. But we
don’t stop there. Google cannot do everything. We look to licensed, specialized
library resources typically found in academic libraries to see how we can go
beyond Google for further assistance.
Google Scholar
We will use Google Scholar to find scholarly, peer-reviewed articles. A search
for “human trafficking” “sex trafficking” yielded about 9,380 results. Limiting
to articles since 2010 cut the results down to 6,940. Because labor trafficking is
often linked to sex trafficking, we can add to the search terms the idea of labor to
further restrict the results: “human trafficking” “sex trafficking” labor. A
relevant article appears to be:
Alvarez, M. B., & Alessi, E. J. (2012). Human Trafficking Is More than Sex Trafficking and
Prostitution Implications for Social Work. Affilia, 27(2), 142–152. [citation derived by clicking “cite”
link within Google Scholar].
[Click the “Cite” button under article listing.] This article is cited by 43 subsequent sources. User
examines these citations.
[Click the “Related articles” button under article listing.] Also note that the 100 related articles reveal a
few more gems.
Google Web
Now on to primary sources. User searches Google Web: “human trafficking”
“sex trafficking” labor and retrieves about 403,000 results. Looking through the
first few pages of results, we try to see what agencies are concerned about this
issue, looking especially for international organizations and U.S. government
agencies. We make a list of nongovernmental organizations (NGOs),
intergovernmental organizations (IGOs), and U.S. government agencies. Not
seeing enough UN resources toward the top, we amend the search to: “human
trafficking” “sex trafficking” labor united nations. Table 9.1 shows some of the
stakeholder agencies arranged by category.
Now that we have a tentative working list of “who cares” about this topic, we
need to get to work with site-specific searching. Searching site:unodc.org sex
trafficking yields 2,290 results. Noting that many of these results are HTML
pages, we limit the search by PDF file type like this because substantive
materials very often appear in PDF format: site:unodc.org sex trafficking
filetype:pdf. We keep doing this kind of hunt for primary sources from
authoritative agencies, carrying it over to U.S. agencies, like this: site:state.gov
sex trafficking filetype:pdf. This will take some time with so many agencies to
investigate.
Google Books
We saw a few book results in Google Scholar, but they were buried down
around the 25th result. Searching Google Books will focus things a bit more on
book content. Doing the same search “human trafficking” “sex trafficking”
labor in Google Books yields 6,360 results. One particular title jumped out: Sex
Trafficking: Inside the Business of Modern Slavery. This is a limited preview
book, but clicking some of the markers on the right gave us enough indication
that this book will be useful (see Figure 9.1).
TABLE 9.1. Finding Web Domains from Entities That Care about the Topic.
FIGURE 9.1. Google Books Highlights Page Hits. Google and the Google logo are registered trademarks of Google
Inc., used with permission.
We then search that book title in the University of Denver library catalog and
see that the library has both a print version and an online version of the book.
Note that the book was discovered using Google Books, something that could
not be adequately done using just the library catalog, but eventually this led us
back to the library catalog for “fulfillment.”
Google Web
Let’s start with Google Web to try to find some primary sources. First we need
to find the top-level domain (TLD) for Tanzania and South Africa. By typing tld
into Google Web, then clicking on the “List of Internet top-level domains—
Wikipedia” link, we discover that the TLD for Tanzania is .tz and the TLD for
South Africa is .za.
Starting with Tanzania, let’s search like this: site:tz religious schools. We
really don’t intend to go to any of the results we see here. We just want to see if
we can identify Internet domains coming from the Tanzanian government to try
to get an official perspective on the issue. One of the top results is a Web site
from the domain .go.tz, obviously a government site. Now we can amend our
Google search to restrict it to only government sites and perhaps also to get
down to the funding level. We frame our search like this: site:go.tz religious
schools funding. We further want to restrict results to more substantive things, so
we restrict results to PDF format: site:go.tz religious schools funding
filetype:pdf. Hopefully reading through some of these documents will point us in
the right direction.
Next, let’s do the same thing for South Africa, beginning with the search:
site:za religious schools. We had to look a bit further down the page to discover
that the government domain for South Africa is .gov.za. We can thus amend our
search to: site:gov.za religious schools. Again, to restrict to more important
reports, amend the search to: site:gov.za religious schools filetype:pdf.
We should also search the respective government domains for information
about the budget or finance ministries. Searches like site:go.tz budget and
site:gov.za finance ministry will be helpful in this regard.
Even though we have searched for primary sources from the governments of
Tanzania and South Africa on the topic, there may be international bodies that
have some discussions. Because many international bodies use .org as their
TLD, let’s do an initial search like this: “religious schools” funding “south
africa” government policy site:org. Ignoring the Wikipedia entries at the top of
the result list and looking further down, we see results from .worldbank.org and
.unesco.org. Although there are likely other organizations we should call
attention to, let’s just go with these two for this case study. We next search:
“religious schools” funding “south africa” government policy
site:worldbank.org. After optionally limiting to filetype:pdf and examining these
results, we can then search for: “religious schools” funding “south africa”
government policy site:unesco.org. Some international organizations use an .int
Internet domain, so it would be advisable to search like this as well: “religious
schools” funding “south africa” government policy site:int. Then, of course, do
the same thing all over again for Tanzania.
Google Scholar
Let’s now shift from primary sources from the two African governments to
secondary sources that are from scholarly sources by searching Google Scholar.
Although the search should ultimately be done for Tanzania and South Africa
separately, let’s first search for articles that might deal with both countries at the
same time by searching Scholar like this: religious schools funding tanzania
“south africa”. We place South Africa in quotes because we know that we can
force the phrase—in this case, the invariant name of a country. We can restrict
the results to more recent dates, say from 2010 to the present, by placing 2010 in
the first box of “Custom Range,” and leaving the second box blank. Not seeing
any obvious titles in the first page of the result set, we can opt to place religious
schools in quotes to force a greater degree of relevancy: “religious schools”
funding tanzania “south africa”.
Let’s shift the focus to just Tanzania and amend the search just to articles that
speak about government policy: “religious schools” funding tanzania
government policy. We can then amend the search to articles from 2012 onward.
In the result list we see an article that seems to be on point: “ ‘Affordable’
Private Schools in South Africa. Affordable for Whom?” We notice that it is
available through our university, so we know we can access the full text of the
article. But while we are in Google Scholar, let’s save the citation by clicking the
Cite button. If the professor requires APA style, then the citation would look like
this:
Languille, S. (2016). ‘Affordable’ Private Schools in South Africa. Affordable for Whom? Oxford
Review of Education, 42(5), 528–542.
Google Books
Let’s now move on to Google Books. By searching “religious schools”
funding “south africa” government policy in Google Books, attention is drawn to
Law and Religion in Africa: The Quest for the Common Good in Pluralistic
Societies. The books retrieved seemed to have some lengthy discussions in the
context of South Africa. Because we cannot read much of the book in Google
Books, we need to find it in our local academic library or pursue borrowing it
from another library. But first we need a citation to the book so that we can
include it in the bibliography. Depending on the way Google Books features a
book, there may be several ways to get the citation. The first way is to see if
Google Books has a “Find in a library” link somewhere on the page. In this case,
there is such a link. This links us over to Worldcat.org. At the top right of the
Worldcat.org page is a “cite/export” link. From this link we can get the citation
styles we need. A second way to get a citation is to search for the book title in
Google Scholar. In this case, however, that strategy did not work.
Google Web
Let’s start first with Google Web. We can start with a very direct search: tango
houses paris 1920s. We find several excellent results with results from BBC
News and several other sites that look excellent but likely are not of academic
quality. We need to formulate a search that accounts for hidden Internet content
—content that would not have enough direct metadata exposed to Google.
Perhaps we can find a searchable database that specialized in fashion images by
searching: fashion image database.
An initial search for sites located in France did not appear to be directly
fruitful: historic fashion paris site:fr. However, we note that one of the results is
from Palais Galliera, Museum of Fashion, apparently a fashion museum in Paris
with an Internet domain of parismusees.paris.fr. This paris.fr domain may help
us to narrow things down. Many French Web sites only have content in French.
For this reason we can use Google Translate to help us frame a search in French.
When placing paris tango fashion history into Google Translate, we get “Paris
tango histoire de la mode.” Back at Google Web we can now search like this:
site:paris.fr Paris tango histoire de la mode. Among the results is a Web domain
parismuseescollections.paris.fr that has sketches of tango fashions.
Google Scholar
Now on to Google Scholar. Doing the search: paris tango houses fashion
history brings up a typical mixture of scholarly articles and book content
bleeding through from Google Books. An article that seems relevant is titled
“Globalization and the Tango.” In addition to getting the citation by clicking the
“cite” button in Scholar, we can click the “cited by” button to find other scholars
that cite this article in subsequent publications. In addition the “related articles”
button will point to 100 other items that may assist in this question.
Google Books
Google Books has some results when searching: paris tango houses fashion
history. A couple of reference works come up in the search results: Historical
Dictionary of the Fashion Industry and The Greenwood Encyclopedia of
Clothing through World History: 1801 to the Present. We should know by now
that by checking our local library catalog, we can possibly get the full text, either
in print or in electronic format.
But now let’s do something different with Google Books. With our results
from the previous search on the screen, click the “search tools” link in the
Google Books navigation menu. A new menu now appears with the options:
“Any books; Any document; Any time; and Sorted by.” If you select the “Any
document” pull-down menu, you will see selections possible for “magazines.”
When we try our search under the “magazines” limit we see some results. Now
we can adjust the time frame using the “Any time” pull-down feature.
Unfortunately this did not produce any results. So we need to rethink our
strategy. Instead of restricting to magazines with a limit to the 1920s, let’s focus
on books with a limit of dates between 1920 and 1930. This time we get some
results from the time period we are interested in examining. Most of these will
be after the 1923 international copyright cutoff date, so their content, including
images, could still be under copyright restrictions. However, we can see if our
academic library has access to these in print or online somehow.
Google Web
You didn’t say if the data you are interested in is for the United States, for
another country, or internationally. So I will address all of these areas with some
strategies. Let’s begin by using Google Web to find some primary sources, both
research reports and statistical sources. To find out what U.S. agencies are
interested in this, do a general search like this: site:gov geothermal heat pumps. I
note that some prominent U.S. agencies are energy.gov, nrel.gov, energystar.gov,
epa.gov, and pnnl.gov. There are others, but we’ll start with these. Now take
each of these domains in turn and drill down with more focused Google Web
searching: site:energy.gov geothermal heat pumps statistics; site:nrel.gov
geothermal heat pumps data; site:nrel.gov geothermal heat pumps filetype:pdf;
and site:gov geothermal heat pumps. Notice that I searched by closely related
terms (data vs. statistics) and that I sometimes limited to the PDF file type to
discover more substantive materials.
If you wanted to do the same kind of digging for international organizations,
first find the stakeholder entities by searching like this: site:org geothermal heat
pumps statistics international. Relevant domains from this search appear to be
iea.org; worldenergy.org; iea-gia.org; and unep.org. Now do the same kind of
searching as we did for the .gov sites. These results should uncover major
research reports, as well as statistical and data sources from primary
stakeholders.
Google Scholar
Now let’s try to find scholarly secondary sources using Google Scholar. Not
only do we want to find articles on the topic, we also want to discover data
sources that were used in the research to see if these might be useful to us as
well. Searching Scholar like this: geothermal heat pumps statistics “united
states” produces what appears to be a highly relevant article, “Analysis of
renewable energy development to power generation in the United States.”
Clicking the “cited by” link takes us to subsequent articles that have interacted
with this content. Then clicking “related articles” gives us 100 additional
citations that may share relevant keywords. Don’t forget to use the “cite” feature
to save your citations in the desired citation format.
Google Books
Moving to Google Books, we can initially search the same way: geothermal
heat pumps statistics “united states”. One of the prominent results is the Census
Bureau’s Statistical Abstract of the United States. Rather than struggling to use
the Google Books version of the Statistical Abstract, we can go to Google Web
and find this publication on the Census Bureau Web site. It should be noted,
however, that the Census Bureau ceased publication of this in 2012. We will
need the library resources (see next) to get more recent data.
10
Searching for Statistics
Many librarians are deathly afraid of reference questions involving statistics. Yet
statistics are one of the most common categories of questions that users ask at
reference desks. This is very likely because of the numerical nature of the
results, the fact that the answers are very often in hidden Internet databases far
beyond the direct reach of Google, and the indirect search strategies that may
need to be employed to uncover them. This chapter will show how Google can
assist in alleviating this fear and actually work toward unearthing the desired
statistics. Because statistics involves numerical data, and because search engines
are strong at searching and delivering textual data, we have a problem: How can
we find meaningful statistics? How can we frame a search that works?
site:bjs.gov Data on noncitizens in state or federal prisons or jails and on persons held by Immigration and Customs
Enforcement (ICE)
site:gao.gov PDF reports on “criminal alien” statistics and other aspects of immigration over time
site:census.gov Publications on “foreign born” and the American FactFinder database with statistics from the Decennial
Census and the American Community Survey
Conclusion
Some readers may have the impression that this book, written by a reference
librarian, is implying that Google has everything and that we really don’t need
academic libraries any longer. Nothing can be further from the truth.
I can’t remember the last time I ever heard parents say that they based
selection of where their children attended college on the strength of library
holdings or databases subscribed to. Although this should be among the top
criteria, it rarely is. It’s true that libraries are becoming more “vanilla” in terms
of databases subscribed to. But the specialized resources are more and more
distinguishing the top schools from the mediocre ones.
Although Google Web can get to some current primary source materials, there
is so much that it cannot expose. As for Google Scholar, even though it exposes
a wide swath of scholarly journal articles to discover, library database products
expose the same content through different interfaces. I generally tell researchers
that it’s beneficial to examine the same material through the power of two or
more search engines, because they will all search, retrieve, and rank the same
results in a different manner. Also, Google Scholar, although deep in terms of
searching, is very narrow in breadth: it doesn’t cover trade journals, popular
magazines, newspapers, dissertations and theses, technical reports, and market
research reports, among other things. Each of these has its own place in the
research process. Google Books, as powerful as it is, has gaps. We may not even
know what all the gaps are, but there definitely are gaps.
I also have a running “wish list” of what I would like Google to change or
implement to make research easier. Here are some of the items on my list:
Pass-through between the three Googles. Being able to pass a search from Google Web to Google
Books or Google Scholar, and from Google Scholar to Google Books, etc.
Citation links under Google Books metadata from the initial browse screen, following the pattern set
forth in Google Scholar.
Better grouping of serial items in Google Books, following the way HathiTrust groups serial holdings
under a single serial title.
Ability to restrict searching in Google Scholar to abstracts only, as you can do when a date sort is done.
This function, available in a limited fashion, needs to be expanded to the entire interface.
Google is not the only place to go for academic information. It may not even
be the first place to search, but it cannot be ignored.
Index
Aggregator content, 3, 6, 44, 48, 53, 59, 91–93
aggregator databases, 93
ANSI/NISO Z39.88 standard, 53
Archival materials, 81, 85, 89–90
Keyword searching, 2, 3, 9, 16, 19, 21, 23, 43, 58, 59, 64, 65–66, 70, 77, 85, 86, 95, 103, 108, 119, 125, 126
Search engines, 5, 11–13, 17, 19–20, 29, 43, 67, 95, 100, 123, 129
crawling, 11–1;
indexing, 11–12
search engine bias, 29–30
search engine optimization, 17, 30
Searching, 8–9
“all and only” (retrieval), 95
“left-anchored” searching, 2
bibliographic references, 58–59
bottom-up search strategy, 16–17
commercial content, 39–40
depth of searching, 45–47
file type searching, 4, 8–9
foreign government content, 32–34
history of, 1–3
images, searching by, 104
images, searching for, 103–105
indirect, 17–18
international organization content, 31–32
local government content, 39
natural language, 20
nonprofit content, 40–41
phrase searching, 19–20, 72
search strategies, 9
searching from another country, 97–101
state government content, 36–39
statistical data, 17, 123–128
top-down search strategy, 16
U.S. government content, 34–36
Site-specific searching, 20–22, 30–41, 89, 97, 112, 124–125
State government content, 36–39, 88, 123–124
Statistics, 17–18, 31, 35, 119–120
keyword searching, 125–126
searching for, 17, 123–128
site-specific searching, 124–128
stakeholders, 123–124, 127
Streaming video and audio, 91
Subject descriptors, 4, 7–9, 86
postcoordination, 7
Subject headings, 1–2, 5–6, 7–8, 16, 23, 86–87
precoordination, 7
rule of three, 7
semantic notions, 7
Subscription databases, 92–94