Documente Academic
Documente Profesional
Documente Cultură
DOI:10.1145/ 3331166
Many practical implementations
Article development led by
queue.acm.org
impose constraints on the links in
knowledge graphs by defining a schema
or ontology. For example, a link from a
Five diverse technology companies movie to its director must connect an
show how it’s done. object of type Movie to an object of type
Person. In some cases the links them-
BY NATASHA NOY, YUQING GAO, ANSHU JAIN, selves might have their own properties:
a link connecting an actor and a movie
ANANT NARAYANAN, ALAN PATTERSON, AND JAMIE TAYLOR
might have the name of the specific
role the actor played. Similarly, a link
Industry-Scale
connecting a politician with a specific
role in government might have the time
period during which the politician held
Knowledge
that role.
Knowledge graphs and similar struc-
tures usually provide a shared substrate
of knowledge within an organization,
Graphs:
allowing different products and appli-
cations to use similar vocabulary and
to reuse definitions and descriptions
that others create. Furthermore, they
Lessons and
usually provide a compact formal rep-
resentation that developers can use to
infer new facts and build up the knowl-
Challenges
edge—for example, using the graph
connecting movies and actors to find
out which actors frequently appear in
movies together.
This article looks at the knowledge
graphs of five diverse tech companies,
comparing the similarities and differ-
ences in their respective experiences of
building and using the graphs, and dis-
cussing the challenges that all knowl-
KN OWLED GE GR APH S ARE critical to many enterprises edge-driven enterprises face today.
today: They provide the structured data and factual The collection of knowledge graphs
discussed here covers the breadth of
knowledge that drive many products and make them applications, from search, to product
more intelligent and “magical.” descriptions, to social networks:
˲˲ Both Microsoft’s Bing knowledge
In general, a knowledge graph describes objects graph and the Google Knowledge
of interest and connections between them. For Graph support search and answering
example, a knowledge graph may have nodes for a questions in search and during conver-
sations. Starting with the descriptions
movie, the actors in this movie, the director, and so and connections of people, places,
on. Each node may have properties such as an actor’s things, and organizations, these graphs
include general knowledge about the
name and age. There may be nodes for multiple world.
movies involving a particular actor. The user can then ˲˲ Facebook has the world’s largest
traverse the knowledge graph to collect information social graph, which also includes in-
formation about music, movies, celeb-
on all the movies in which the actor appeared or, if rities, and places that Facebook users
applicable, directed. care about.
The goal here is not to describe challenges are shared by all the enter- powers question answering on Bing. It
these knowledge graphs exhaustively, prises. The accompanying table sum- contains entities such as people, plac-
but rather to use the authors’ practi- marizes the properties of these knowl- es, things, organizations, locations,
cal experiences in building knowledge edge graphs. and so on, as well as the actions that a
graphs in some of the largest technol- Microsoft. Engineers and scientists user might take (for example, to play a
ogy companies today as a scaffolding at Microsoft have been working on video or buy a song). This is the largest
to highlight the challenges that any large-scale graphs for many years. This knowledge graph at Microsoft, as its
AU G U ST 2 0 1 9 | VO L. 6 2 | N O. 8 | C OM M U N IC AT ION S OF T HE ACM 37
practice
aim is to contain general knowledge is always effectively no, because devel- future scenario a user could say to Bing,
about the entire world. opers are always looking for new ways “Show me all the countries in the world
˲˲ The Academic graph is a collection to provide value to users and for new where it’s over 70 degrees Fahrenheit
of entities such as people, publications, sources of information. right now,” and once the system returns
fields of study, conferences, and loca- ˲˲ Correctness. Is the information cor- the answer, the user could say, “Show
tions. It allows a user to see connec- rect? How do you know if two sources me those within a two-hour flight.”
tions between researchers and pieces of information are actually about the You can take the same idea further
of research that may otherwise be hard same fact, and what do you do if they to enable a full conversational experi-
to determine. conflict? Answering these questions is ence. For example, a user could say, “I
˲˲ The LinkedIn graph contains enti- a huge area of study and investment by want to travel to NYC two days before
ties such as people, jobs, skills, compa- itself. Thanksgiving and stay for a week,” and
nies, locations, and so on. The Linke- ˲˲ Freshness. Is the content up to date? the system would use the underlying
dIn Economic graph is based on 590 It may have been correct at one time but knowledge graph to make sense of the
million members and 30 million com- gone stale. Freshness will vary for some- query and then request missing pieces
panies, and is used to find economy- thing that changes almost constantly (a of information. In this example, the
level insights for countries and regions. stock price) compared with something system needs to know that “NYC” could
The Bing search engine displays a that changes rarely (the capital of a mean “JFK Airport” and that Thanks-
knowledge panel from the Bing knowl- country), with many different kinds of giving is November 22. It then must
edge graph when there is additional information in between. know how to carry out a flight search,
useful information. For example, a To generate knowledge about the which requires a start location and a
search for the film director James Cam- world, data is ingested from multiple destination location. The system would
eron reveals information such as his sources, which may be very noisy and then have to know the next line of the
date of birth, height, movies and TV contradictory, and must be collated conversation must determine the start
shows he directed, previous romantic into a single, consistent, and accurate location, so it would say, “Okay, book-
partners, TED Talks he gave, and Red- graph. The final fact that a user sees is ing a flight to JFK from November 20 to
dit “Ask Me Anything” questions and the tip of an iceberg—a huge amount 27. Where will you be flying from?”
answers (through partnership with of work and complexity is hidden be- Google. With more than 70 billion
Reddit). A search for a different type low. For example, there are 200 Will assertions describing a billion entities,
of entity returns completely different Smiths in Wikipedia alone, and the the Google Knowledge Graph covers a
information—for example, searching Bing knowledge result for the actor Will wide swath of subject matter and is the
for “Woodblock restaurant” results in Smith is composed from 108,000 facts result of more than a decade of data-
an extract from the menu, professional taken from 41 websites. contribution activity from a diverse
critic and user reviews, as well as the From search to conversation. Knowl- set of individuals, most of whom have
option to book a table. edge graphs power advanced AI, allow- never had experience with knowledge-
All of these graph systems—as ing single queries to be turned into an management systems.
would probably be the case with any ongoing conversation. Specifically, this Perhaps more important, Knowl-
large graph system—have three key de- allows a user to have a conversation edge Graph serves as a long-term,
terminants of quality and usefulness: with the system and to have the system stable source of class and entity iden-
˲˲ Coverage. Does the graph have all maintain the context through each turn tity that many Google products and fea-
the required information? The answer of the conversation. For example, in a tures use behind the scenes. Outside
users and developers can observe these
Common characteristics of the knowledge graphs. features when they use services such as
YouTube and Google Cloud APIs. This
focus on identity has allowed Google
Data model Size of the graph Development stage
to transition to “things not strings.”
Microsoft The types of entities, relations, ~2 billion primary entities, Actively used in
and attributes in the graph are ~55 billion facts products Rather than simply returning the tra-
defined in an ontology. ditional “10 blue links,” Knowledge
Google Strongly typed entities, 1 billion entities, Actively used in Graph helps Google products interpret
relations with domain and 70 billion assertions products
range inference user requests as references to concepts
Facebook All of the attributes and ~50 million primary entities, Actively used in in the world of the user and to respond
relations are structured and ~500 million assertions products appropriately.
strongly typed, and optionally
indexed to enable efficient
Google’s Knowledge Graph is per-
retrieval, search, and traversal. haps most visible when users issue
eBay Entities and relation, well- Expect around 100 million Early stages of queries about entities and the search
structured and strongly typed products, >1 billion triples development and results include an array of facts about
deployment
the entities that are served from Knowl-
IBM Entities and relations Various sizes. Proven on Actively used in
with evidence information scales documents >100 products and edge Graph. For example, a query for
associated with them. million, relationships >5 by clients “I.M. Pei” produces a small panel in the
billion, entities >100 million
search results with information about
the architect’s education, awards, and
knowledge graph
finding and comprehending new sche-
At the scale of the Google Knowl- ma structures is simplified for internal
edge Graph, a single individual can not
remember, let alone manage, the de-
to make sense developers.
Facebook is known for having the
tailed structures used throughout the of the query world’s largest social graph. Facebook
graph. To ensure the system remains
consistent over time, Google built its
and then request engineers have built technology over
the past decade to enable rich connec-
Knowledge Graph from a basic set of missing pieces tions between people. Now they are
low-level structures. It replicated simi-
lar structures and reasoning mecha- of information. applying the same technology to build-
ing a deeper understanding of not just
nisms at different levels of abstraction, people, but also the things that people
conceptually bootstrapping the struc- care about.
ture from a number of basic assertions. By modeling the world in a struc-
For example, to check specific invari- tured manner and at scale, Facebook
ant constructions, Google leveraged engineers were able to unlock use cases
the idea that types were themselves that a social graph by itself could not
instances of types to introduce the no- fulfill. Even seemingly simple things,
tion of metatypes. It could then reason such as structured understanding of
about the metatypes to verify the finer- music and lyrics when combined with
grained types did not violate the invari- software that detects when people are
ants it was interested in. It can validate referencing them, can enable seren-
that time-independent identities are dipitous moments between individu-
not subclasses of structures, which are als. Many experiences in Facebook’s
time-dependent. This scalable level of products today, such as helping people
abstraction was relatively easy to add plan movie outings on Messenger, are
in a manner that worked out of the box powered by the knowledge graph.
because it was built upon the same low- Facebook’s knowledge graph focus-
level entailments on which the rest of es on the most socially relevant entities,
the system was based. such as those that are most commonly
This meta-level schema also allows discussed by its users: celebrities, plac-
validation of data at scale. For example, es, movies, and music. As the Facebook
you can validate that painters existed knowledge graph continues to grow,
before their works of art were created developers focus on those domains that
by identifying the painters as the “ori- have the greatest chance of delivering
gin” of their painted work “products” utility and delightful user experiences.
and applying a general check on all re- Coverage, correctness, structure,
lations between these metaclasses. and constant change all drive the de-
At a slightly higher conceptual level, sign of the Facebook knowledge graph:
Knowledge Graph “understands” that ˲˲ Coverage means being exhaustive
authors are distinct from their creative in a domain that is being modeled.
works, even though these entities are The default stance is multiprovider,
frequently conflated in colloquial ex- which means that the entire graph-
pressions. Similarly, creative works production system is built with the
may have multiple expressions that are assumption that data will be received
themselves distinct. This ontological from multiple sources, all providing
knowledge helps maintain the identity (sometimes conflicting) information
of entities as the graph grows. about overlapping sets of entities. The
Building the Knowledge Graph Facebook knowledge graph deals with
through these self-describing layers not the conflicting information in one of
only simplifies consistency checking by two ways: the information is deemed to
machines, but also makes the Knowl- be sufficiently low confidence to justify
edge Graph easier for internal users to dropping it; or conflicting views are in-
understand. Once new developers have corporated into the entity by retaining
AU G U ST 2 0 1 9 | VO L. 6 2 | N O. 8 | C OM M U N IC AT ION S OF T HE ACM 39
practice
products that go together as a set, say is already available in the sources avail- allowing communication between the
in bundles, kits, or even fashion outfits. able to the system, they are necessary stores through microservices, and al-
As with other knowledge graphs, but not sufficient for Discovery. Nonob- lowing ingestion of new knowledge
eBay must cope with scale. At any one vious discovery includes new links be- or reprocessing raw data in a way that
time there may be more than one bil- tween entities (for example, a new side does not require reloading or rebuild-
lion active listings across thousands effect of a drug, an emerging company ing the entire graph.
of categories. These listings might in- as an acquisition target or sales lead), ˲˲ Evidence must be primitive to the
clude hundreds of millions of products a potential new important entity in the system. The main link between the real
and tens of billions of attributes speci- domain (for example, a new material world (which developers often try to
fied for those products. for display technologies, a new inves- model) and the data structures hold-
There are several different users of tor for a particular investment area), ing the extracted knowledge is the “evi-
the eBay Knowledge Graph, and these or changing significance of an existing dence” of the knowledge. This evidence
users have very different service-level entity (an increasing stake by an inves- is often the raw documents, databases,
requirements. When the search service tor in an organization, or increasing in- dictionaries, or image, text, and video
needs to understand a user’s query, teraction between a person of interest files from which the knowledge is de-
the knowledge graph must power an and some criminal in an intelligence- rived. When it comes to making point-
answer that takes milliseconds. At the gathering scenario). ed and useful contextual queries during
other end of the scale, large graph que- Given its wide enterprise customer a discovery process, the metadata and
ries could take hours to run. base applying cognitive technologies in other associated information often play
To cope with these challenges, eBay various domains, IBM focused on cre- a role in inference of the knowledge.
engineers have designed an architec- ating a framework for clients and client Thus, it is critical not to lose the linkage
ture that provides them with flexibility, teams to build their own knowledge between the relationships stored in the
while ensuring that the data is consis- graphs. Industry teams at IBM leverage graph and where those relationships
tent. The knowledge graph uses a rep- this framework to build domain-spe- come from.
licated log for all writes and edits to the cific instances. Clients exist in several ˲˲ Push entity resolution to runtime
graph. The log provides a consistent domains ranging from consumer-ori- through context. Resolving ambiguous
ordered view of the data. This approach ented research in banking and finance, references to entities referenced by
enables multiple back-end data stores insurance, IT services, media and partial names, surface forms, or mul-
that meet different use cases. Spe- entertainment, retail, and customer tiple entities having the same names
cifically, there is a flattened document service, to industries focused almost is a classic problem in understanding
store for serving search queries with entirely on deep discovery—especially natural language. In the field of knowl-
low latency and a graph store for doing scientific domains such as life sciences, edge discovery, however, developers
long-running graph analysis. Each of oil and gas, chemicals and petroleum, often look for the nonobvious patterns
these stores simply appends its opera- defense, and space exploration. This where an entity is not behaving in its
tions to the write log and gets the addi- breadth requires the framework have well-understood form or appears in a
tions and edits to the graph in a guaran- all of the machinery that clients need to novel context. Thus, a disambiguation
teed order. As a result, each store will be build and manage a knowledge graph of an entity too early in the process of
consistent. themselves. Some of the key technolo- knowledge-graph creation conflicts
IBM developed its Knowledge Graph gies built into the framework include with the very goal of discovery. It is bet-
Framework, used by Watson Discovery document conversion, document ex- ter to leave those utterances unresolved
Services and its associated offerings, traction, passage storage, and entity or disambiguate them to multiple enti-
which have been deployed in many normalization. ties, and then during runtime use the
industry settings outside of IBM. IBM The following are some of the key context of the query to resolve the entity
Watson uses the Knowledge Graph insights and lessons that IBM engi- name.
Framework in two distinct ways: First, neers learned from both building the
the framework directly powers Wat- knowledge graph for Watson Discovery Challenges Ahead
son Discovery, which focuses on using and deploying the system in other in- The requirements, coverage, and ar-
structured and unstructured knowl- dustries. chitectures of the knowledge graphs
edge to discover new, nonobvious in- ˲˲ Polymorphic stores offer a solution. discussed here differ quite a bit, but
formation, and the associated vertical The IBM Watson Knowledge Graph many of the challenges appear consis-
offerings on top of Discovery; second, uses a polymorphic store, supporting tently across most implementations.
the framework allows others to build multiple indices, database structures, These include challenges of scale, dis-
their own knowledge graphs with the in-memory, and graph stores. This ar- ambiguation, extraction of knowledge
prebuilt knowledge graph as the core. chitecture splits the actual data (often from heterogeneous and unstructured
The Discovery use case creates new redundantly) into one or more of these sources, and managing knowledge evo-
knowledge that is not directly present stores, allowing each store to address lution. These challenges have been at
in domain documents or data sources. specific requirements and workloads. the forefront of research for years, yet
This new knowledge can be surprising IBM engineers and researchers ad- they continue to baffle industry prac-
and anomalous. While search and ex- dressed a number of challenges such as titioners. Some of the challenges are
ploration tools access knowledge that keeping these multiple stores in sync, present in some of the systems but may
AU G U ST 2 0 1 9 | VO L. 6 2 | N O. 8 | C OM M U N IC AT ION S OF T HE ACM 41
practice
be less relevant in other settings. identity for sports while also including from unstructured data in open do-
Entity disambiguation and manag- e-sports? mains.
ing identity. While entity disambigua- Managing changing knowledge. For example, in the eBay Product
tion and resolution is an active research An effective entity-linking system also Knowledge Graph, many graph rela-
area in the semantic Web, and now in needs to grow organically based on its tionships are extracted from unstruc-
knowledge graphs for several years, it ever-changing input data. For example, tured text in listings and seller catalogs;
is almost surprising that it continues companies may merge or split, and the IBM Discovery knowledge graph
to be one of the top challenges in the new scientific discoveries may break relies on documents as evidence for the
industry almost across the board. In its an existing entity into multiples. When facts represented in the graphs. Tra-
simplest form, the challenge is in as- a company acquires another company ditional supervised machine-learning
signing a unique normalized identity does the acquiring company change frameworks require labor-intensive hu-
and a type to an utterance or a mention identity? What about a division being man annotations to train knowledge-
of an entity. Many entities extracted au- spun out? Does identity follow the ac- extraction systems. This high cost can
tomatically have very similar surface quisition of the rights to a name? be alleviated or eliminated by adopting
forms, such as people with the same or While most knowledge-graph frame- fully unsupervised approaches (clus-
similar names, or movies, songs, and works are becoming efficient at storing tering with vector representations) or
books with the same or similar titles. a point-in-time version of a knowledge semi-supervised techniques (distant
Two products with similar names may graph and managing instantaneous supervision with existing knowledge,
refer to different listings. Without cor- changes to the knowledge graphs to multi-instance learning, active learn-
rect linking and disambiguation, enti- evolve the graph, there is a gap in being ing, and so on). Entity recognition,
ties will be incorrectly associated with able to manage highly dynamic knowl- classification, text, and entity embed-
wrong facts and result in incorrect in- edge in the graphs.4 A fundamental dings all prove useful tools to link our
ference downstream. understanding of temporal constructs, unstructured text to entities we know
While these problems might seem history, and change with history is about in the graph.3
obvious in smaller systems, when iden- needed to capture these changes. Fur- Managing operations at scale. It is
tity management must be done with a thermore, the ability to manage up- probably not surprising that all of the
heterogeneous contributor base and dates through multiple stores (for ex- knowledge-graph systems described
at scale, the problem becomes much ample, IBM’s polymorphic stores) is here face the challenge of managing
more challenging. How can identity be necessary. the graphs at scale. This dimension of-
described in a way that different teams There are a lot of considerations ten makes the problems that have been
can agree on it and know what the other around the integrity of the update pro- addressed in multiple forms in the aca-
teams are describing? How can devel- cess, eventual consistency, conflicting demic and research community (such
opers be sure to have enough human- updates, and, simply, runtime perfor- as disambiguation and unstructured
readable information to adjudicate mance. There may be an opportunity data extraction) present new challeng-
conflicts? to think of different variations of exist- es in industry settings. Managing scale
Type membership and resolution. ing distributed data stores designed to is the underlying challenge that affects
Most current knowledge-graph systems handle incremental cascade updates. several operations related to perfor-
allow each entity to have multiple types, It is also critical to manage changing mance and workload directly. It also
and the specific type may matter in dif- schemas and type systems, without cre- manifests itself indirectly as it affects
ferent circumstances. For example, ating inconsistencies with the knowl- other operations, such as managing
Barack Obama is a person, but also edge already in the system. Google, for fast incremental updates to large-scale
a politician and actor—a vastly more example, addresses this problem by knowledge graphs as at IBM or man-
popular politician and not a very well- conceptualizing the metamodel layer aging consistency on a large evolving
known actor. Cuba can be a country or into multiple layers. The basic lower knowledge graph as at Google.1
may refer to its government. In some layers remain fairly constant and high-
cases, knowledge-graph systems defer er levels are built through the notion of Other Key Challenges
the type assignment to runtime: Each metatypes (which are really instances In addition to these truly pervasive
entity describes its attributes, and the of types), which can be used to enrich challenges, the following challenges
application uses a specific type and col- the type system. will be critical to the efforts described
lection of attributes depending on the Knowledge extraction from multiple in this article. These are interesting and
user task. structured and unstructured sources. intriguing subjects for research and ac-
While criteria for class membership Despite the recent advances in natural ademic communities.
might be straightforward early on, as language understanding, the extrac- Knowledge-graph semantic em-
the universe of instances grows, enforc- tion of structured knowledge (which in- beddings. With a large-scale knowl-
ing these criteria while maintaining se- cludes entities, their types, attributes, edge graph, developers can build
mantic stability becomes challenging. and relationships) remains a challenge high-dimensional representations of
For example, when Google defined the across the board. Growing the graphs entities and relations. The resulting
category for “sports” in its knowledge at scale requires not only manual ap- embeddings will greatly benefit many
graph, e-sports did not exist. So, how proaches, but also unsupervised and machine-learning, NLP, and AI tasks as
does Google maintain the category semi-supervised knowledge extraction sources of features and constraints, and
can form the basis for more sophisticat- to a small enough size to be shippable The article summarizes and expands
ed inferences and ways to curate train- to mobile devices. This will allow devel- on a panel discussion the authors con-
ing data. Deep-learning techniques can opers to keep providing user value in ducted at the International Semantic Web
be applied to problems of entity dedu- a privacy-respecting manner by doing Conference in Asilomar, CA, in Oct. 2018
plication and attribute inference.2 more on-device learning and computa- (https://bit.ly/2ZYVLJh). The discussion
Knowledge inference and verifica- tion, over local small knowledge-graph is based on practical experiences and rep-
tion. Making sure that facts are cor- instances. (We are eager to collaborate resents the views of the authors and not
rect is a core task in constructing a with the research community in pur- necessarily their employers.
knowledge graph, and with a huge suit of this goal.)
scale it is not remotely possible to ver- Multilingual knowledge systems.
Related articles
ify everything manually. This requires A comprehensive knowledge graph on queue.acm.org
an automated approach: advances in must cover facts expressed in multiple
Schema.org: Evolution of
knowledge representation and reason- languages and conflate the concepts
Structured Data on the Web
ing, probabilistic graphical models, expressed in those languages into a co- R.V. Guha, D. Brickley, and S. Macbeth
and natural language inferences can hesive set. In addition to the challenges https://queue.acm.org/detail.cfm?id=2857276
be used to construct an automatic or in knowledge extraction from multi- Hazy: Making it Easier to Build
semi-automatic system for consistency lingual sources, different cultures may and Maintain Big-data Analytics
checking and fact verification. conceptualize the world in subtly dif- A. Kumar, F. Niu, and C. Ré
Federation of global, domain-specif- ferent ways, which poses challenges in https://queue.acm.org/detail.cfm?id=2431055
ic, and customer-specific knowledge. the design of the ontology as well. A Primer on Provenance
In a case like IBM clients, who build L. Carata, et al.
their own custom knowledge graphs, Conclusion https://queue.acm.org/detail.cfm?id=2602651
the clients are not expected to tell the The natural question from our discus- References
graph about basic knowledge. For ex- sion in this article is whether different 1. Höffner, K., Walter, S., Marx, E., Usbeck, R., Lehmann,
J. and Ngonga Ngomo, A.C. Survey on challenges of
ample, a cancer researcher is not going knowledge graphs can someday share question answering in the semantic Web. Semantic
to teach the knowledge graph that skin certain core elements, such as descrip- Web 8, 6 (2017), 895–920.
2. Lin, Y., Liu, Z., Sun, M., Liu, Y. and Zhu, X. Learning
is a form of tissue, or that St. Jude is a tions of people, places, and similar entity and relation embeddings for knowledge
hospital in Memphis, Tennessee. This entities. One of the avenues toward graph completion. In Proceedings of the Assoc.
Advancement of Artificial Intelligence 15, (2015),
is known as “general knowledge,” cap- sharing these descriptions could be to 2181–2187.
tured in a general knowledge graph. contribute them to Wikidata as a com- 3. Nickel, M., Murphy, K., Tresp, V. and Gabrilovich, E.
2016. A review of relational machine learning for
The next level of information is mon, multilingual core. In the nearer knowledge graphs. In Proceedings of the IEEE 104, 1
knowledge that is well known to any- term, we hope to continue sharing the (2016), 11–33.
4. Paulheim, H., Knowledge graph refinement: a survey
body in the domain—for example, car- results of research that each of us may of approaches and evaluation methods. Semantic Web
cinoma is a form of cancer or NHL more have done with researchers and practi- 8, 3 (2017), 489–508.
AU G U ST 2 0 1 9 | VO L. 6 2 | N O. 8 | C OM M U N IC AT ION S OF T HE ACM 43