Industry-Scale Knowledge Graphs:: Lessons and Challenges

practice
DOI:10.1145/ 3331166
Many practical implementations

Article development led by
queue.acm.org
impose constraints on the links in
knowledge graphs by defining a schema
or ontology. For example, a link from a
Five diverse technology companies movie to its director must connect an
show how it’s done. object of type Movie to an object of type
Person. In some cases the links them-
BY NATASHA NOY, YUQING GAO, ANSHU JAIN, selves might have their own properties:
a link connecting an actor and a movie
ANANT NARAYANAN, ALAN PATTERSON, AND JAMIE TAYLOR
might have the name of the specific
role the actor played. Similarly, a link
Industry-Scale
connecting a politician with a specific
role in government might have the time
period during which the politician held
Knowledge
that role.
Knowledge graphs and similar struc-
tures usually provide a shared substrate
of knowledge within an organization,
Graphs:
allowing different products and appli-
cations to use similar vocabulary and
to reuse definitions and descriptions
that others create. Furthermore, they
Lessons and
usually provide a compact formal rep-
resentation that developers can use to
infer new facts and build up the knowl-
Challenges
edge—for example, using the graph
connecting movies and actors to find
out which actors frequently appear in
movies together.
This article looks at the knowledge
graphs of five diverse tech companies,
comparing the similarities and differ-
ences in their respective experiences of
building and using the graphs, and dis-
cussing the challenges that all knowl-
KN OWLED GE GR APH S ARE critical to many enterprises edge-driven enterprises face today.
today: They provide the structured data and factual The collection of knowledge graphs
discussed here covers the breadth of
knowledge that drive many products and make them applications, from search, to product
more intelligent and “magical.” descriptions, to social networks:
˲˲ Both Microsoft’s Bing knowledge
In general, a knowledge graph describes objects graph and the Google Knowledge
of interest and connections between them. For Graph support search and answering
example, a knowledge graph may have nodes for a questions in search and during conver-
sations. Starting with the descriptions
movie, the actors in this movie, the director, and so and connections of people, places,
on. Each node may have properties such as an actor’s things, and organizations, these graphs
include general knowledge about the
name and age. There may be nodes for multiple world.
movies involving a particular actor. The user can then ˲˲ Facebook has the world’s largest
traverse the knowledge graph to collect information social graph, which also includes in-
formation about music, movies, celeb-
on all the movies in which the actor appeared or, if rities, and places that Facebook users
applicable, directed. care about.
36 COMM UNICATIO NS O F THE ACM | AU GU ST 201 9 | VO L . 62 | NO. 8

˲˲ The Product Knowledge Graph at enterprise-scale knowledge graph will work included building the end-to-end
eBay, currently under development, face and where some innovative re- system from the underlying research,
will encode semantic knowledge about search is needed. as well as a global-scale service for hun-
products, entities, and the relation- dreds of millions of users. Across the
ships between them and the external What’s In a Graph? company, there are several major graph
world. Design Decisions systems, each bringing specific chal-
˲˲ The Knowledge Graph Framework Let’s start by describing the five knowl- lenges around creating the graph and
for IBM’s Watson Discovery offerings edge graphs and the decisions that keeping it up to date. Many different
addresses two requirements: one fo- went into each design and determining products can use a knowledge graph to
cusing on the use case of discovering the scope of each graph. The different bring value to consumers. The follow-
nonobvious information, the other on applications and product goals for each ing are some of the graphs at Microsoft:
offering a “Build your own knowledge one resulted in different approaches ˲˲ The Bing knowledge graph con-
graph” framework. and architectures, though many of the tains information about the world and
IMAGE BY AD EMPERCEM/SH UTT ERSTOCK
The goal here is not to describe challenges are shared by all the enter- powers question answering on Bing. It
these knowledge graphs exhaustively, prises. The accompanying table sum- contains entities such as people, plac-
but rather to use the authors’ practi- marizes the properties of these knowl- es, things, organizations, locations,
cal experiences in building knowledge edge graphs. and so on, as well as the actions that a
graphs in some of the largest technol- Microsoft. Engineers and scientists user might take (for example, to play a
ogy companies today as a scaffolding at Microsoft have been working on video or buy a song). This is the largest
to highlight the challenges that any large-scale graphs for many years. This knowledge graph at Microsoft, as its
AU G U ST 2 0 1 9 | VO L. 6 2 | N O. 8 | C OM M U N IC AT ION S OF T HE ACM 37
practice
aim is to contain general knowledge is always effectively no, because devel- future scenario a user could say to Bing,
about the entire world. opers are always looking for new ways “Show me all the countries in the world
˲˲ The Academic graph is a collection to provide value to users and for new where it’s over 70 degrees Fahrenheit
of entities such as people, publications, sources of information. right now,” and once the system returns
fields of study, conferences, and loca- ˲˲ Correctness. Is the information cor- the answer, the user could say, “Show
tions. It allows a user to see connec- rect? How do you know if two sources me those within a two-hour flight.”
tions between researchers and pieces of information are actually about the You can take the same idea further
of research that may otherwise be hard same fact, and what do you do if they to enable a full conversational experi-
to determine. conflict? Answering these questions is ence. For example, a user could say, “I
˲˲ The LinkedIn graph contains enti- a huge area of study and investment by want to travel to NYC two days before
ties such as people, jobs, skills, compa- itself. Thanksgiving and stay for a week,” and
nies, locations, and so on. The Linke- ˲˲ Freshness. Is the content up to date? the system would use the underlying
dIn Economic graph is based on 590 It may have been correct at one time but knowledge graph to make sense of the
million members and 30 million com- gone stale. Freshness will vary for some- query and then request missing pieces
panies, and is used to find economy- thing that changes almost constantly (a of information. In this example, the
level insights for countries and regions. stock price) compared with something system needs to know that “NYC” could
The Bing search engine displays a that changes rarely (the capital of a mean “JFK Airport” and that Thanks-
knowledge panel from the Bing knowl- country), with many different kinds of giving is November 22. It then must
edge graph when there is additional information in between. know how to carry out a flight search,
useful information. For example, a To generate knowledge about the which requires a start location and a
search for the film director James Cam- world, data is ingested from multiple destination location. The system would
eron reveals information such as his sources, which may be very noisy and then have to know the next line of the
date of birth, height, movies and TV contradictory, and must be collated conversation must determine the start
shows he directed, previous romantic into a single, consistent, and accurate location, so it would say, “Okay, book-
partners, TED Talks he gave, and Red- graph. The final fact that a user sees is ing a flight to JFK from November 20 to
dit “Ask Me Anything” questions and the tip of an iceberg—a huge amount 27. Where will you be flying from?”
answers (through partnership with of work and complexity is hidden be- Google. With more than 70 billion
Reddit). A search for a different type low. For example, there are 200 Will assertions describing a billion entities,
of entity returns completely different Smiths in Wikipedia alone, and the the Google Knowledge Graph covers a
information—for example, searching Bing knowledge result for the actor Will wide swath of subject matter and is the
for “Woodblock restaurant” results in Smith is composed from 108,000 facts result of more than a decade of data-
an extract from the menu, professional taken from 41 websites. contribution activity from a diverse
critic and user reviews, as well as the From search to conversation. Knowl- set of individuals, most of whom have
option to book a table. edge graphs power advanced AI, allow- never had experience with knowledge-
All of these graph systems—as ing single queries to be turned into an management systems.
would probably be the case with any ongoing conversation. Specifically, this Perhaps more important, Knowl-
large graph system—have three key de- allows a user to have a conversation edge Graph serves as a long-term,
terminants of quality and usefulness: with the system and to have the system stable source of class and entity iden-
˲˲ Coverage. Does the graph have all maintain the context through each turn tity that many Google products and fea-
the required information? The answer of the conversation. For example, in a tures use behind the scenes. Outside
users and developers can observe these
Common characteristics of the knowledge graphs. features when they use services such as
YouTube and Google Cloud APIs. This
focus on identity has allowed Google
Data model Size of the graph Development stage
to transition to “things not strings.”
Microsoft The types of entities, relations, ~2 billion primary entities, Actively used in
and attributes in the graph are ~55 billion facts products Rather than simply returning the tra-
defined in an ontology. ditional “10 blue links,” Knowledge
Google Strongly typed entities, 1 billion entities, Actively used in Graph helps Google products interpret
relations with domain and 70 billion assertions products
range inference user requests as references to concepts
Facebook All of the attributes and ~50 million primary entities, Actively used in in the world of the user and to respond
relations are structured and ~500 million assertions products appropriately.
strongly typed, and optionally
indexed to enable efficient
Google’s Knowledge Graph is per-
retrieval, search, and traversal. haps most visible when users issue
eBay Entities and relation, well- Expect around 100 million Early stages of queries about entities and the search
structured and strongly typed products, >1 billion triples development and results include an array of facts about
deployment
the entities that are served from Knowl-
IBM Entities and relations Various sizes. Proven on Actively used in
with evidence information scales documents >100 products and edge Graph. For example, a query for
associated with them. million, relationships >5 by clients “I.M. Pei” produces a small panel in the
billion, entities >100 million
search results with information about
the architect’s education, awards, and
38 COM MUNICATIO NS O F TH E AC M | AU GU ST 201 9 | VO L . 62 | NO. 8

practice
the significant structures he designed. been trained on the fundamentals of

The Knowledge Graph also recogniz- Knowledge Graph organization, they
es that certain kinds of interactions can can understand the full extent of its
take place with different entities. A que- inventory of structures. Similarly, by
ry for “The Russian Tea Room” provides
a button to make a reservation, while a The system would keeping the structure of the graph tied
to a few core principles and exposing
query for “Rita Ora” provides links to
her music on various music services.
use the underlying meta-relations explicitly in schemas,
knowledge graph
finding and comprehending new sche-
At the scale of the Google Knowl- ma structures is simplified for internal
edge Graph, a single individual can not
remember, let alone manage, the de-
to make sense developers.
Facebook is known for having the
tailed structures used throughout the of the query world’s largest social graph. Facebook
graph. To ensure the system remains
consistent over time, Google built its
and then request engineers have built technology over
the past decade to enable rich connec-
Knowledge Graph from a basic set of missing pieces tions between people. Now they are
low-level structures. It replicated simi-
lar structures and reasoning mecha- of information. applying the same technology to build-
ing a deeper understanding of not just
nisms at different levels of abstraction, people, but also the things that people
conceptually bootstrapping the struc- care about.
ture from a number of basic assertions. By modeling the world in a struc-
For example, to check specific invari- tured manner and at scale, Facebook
ant constructions, Google leveraged engineers were able to unlock use cases
the idea that types were themselves that a social graph by itself could not
instances of types to introduce the no- fulfill. Even seemingly simple things,
tion of metatypes. It could then reason such as structured understanding of
about the metatypes to verify the finer- music and lyrics when combined with
grained types did not violate the invari- software that detects when people are
ants it was interested in. It can validate referencing them, can enable seren-
that time-independent identities are dipitous moments between individu-
not subclasses of structures, which are als. Many experiences in Facebook’s
time-dependent. This scalable level of products today, such as helping people
abstraction was relatively easy to add plan movie outings on Messenger, are
in a manner that worked out of the box powered by the knowledge graph.
because it was built upon the same low- Facebook’s knowledge graph focus-
level entailments on which the rest of es on the most socially relevant entities,
the system was based. such as those that are most commonly
This meta-level schema also allows discussed by its users: celebrities, plac-
validation of data at scale. For example, es, movies, and music. As the Facebook
you can validate that painters existed knowledge graph continues to grow,
before their works of art were created developers focus on those domains that
by identifying the painters as the “ori- have the greatest chance of delivering
gin” of their painted work “products” utility and delightful user experiences.
and applying a general check on all re- Coverage, correctness, structure,
lations between these metaclasses. and constant change all drive the de-
At a slightly higher conceptual level, sign of the Facebook knowledge graph:
Knowledge Graph “understands” that ˲˲ Coverage means being exhaustive
authors are distinct from their creative in a domain that is being modeled.
works, even though these entities are The default stance is multiprovider,
frequently conflated in colloquial ex- which means that the entire graph-
pressions. Similarly, creative works production system is built with the
may have multiple expressions that are assumption that data will be received
themselves distinct. This ontological from multiple sources, all providing
knowledge helps maintain the identity (sometimes conflicting) information
of entities as the graph grows. about overlapping sets of entities. The
Building the Knowledge Graph Facebook knowledge graph deals with
through these self-describing layers not the conflicting information in one of
only simplifies consistency checking by two ways: the information is deemed to
machines, but also makes the Knowl- be sufficiently low confidence to justify
edge Graph easier for internal users to dropping it; or conflicting views are in-
understand. Once new developers have corporated into the entity by retaining
practice
provenance and an inferred confidence structured sources of data to achieve the

level about the assertion. goals of a clean, structured knowledge
˲˲ Correctness does not mean the graph. A useful tool for Facebook has
knowledge graph always knows the been to think of the graph as the model
“right” value for an attribute, but rather
that it is always able to explain why a The Discovery and a Facebook page as the view—a pro-
jection of an entity or collection of enti-
certain assertion was made. Therefore,
it keeps provenance for all data that
use case creates ties that reside in the graph.
eBay is building its Product Knowl-
flows through the system, from data ac- new knowledge edge Graph, which will encode semantic
quisition to the serving layer.
˲˲ Structure means the knowledge
that is not knowledge about products, entities, and
their relationships with each other and
graph must be self-describing. If a piece directly present in the external world. This knowledge will
of data is not strongly typed or does not
fit the schema describing the entity,
domain documents be key to understanding what a seller is
offering and a buyer is looking for and
then the graph attempts to do one of or data sources. intelligently connecting the two, a key
the following: convert the data into the part of eBay’s marketplace technology.
expected type (for example, performing For example, eBay’s knowledge
simple type coercion, handling incor- graph can relate products to real-world
rectly formatted dates); extract struc- entities, defining the identity of a prod-
tured data that matches the type (for uct and why it might be valuable to a
example, run natural language process- buyer. A basketball jersey for the Chi-
ing, NLP) on unstructured text such as cago Bulls is one product, but if it is
user reviews to convert into typed slots); signed by Michael Jordan, it is a very dif-
or leave it out entirely. ferent product. A postcard from 1940 in
˲˲ Lastly, the Facebook knowledge Paris might be just a postcard; knowing
graph is designed for constant change. that Paris is in France and that 1940 is
The graph is not a single representation during World War II changes the prod-
in a database that is updated when new uct entirely.
information is received. Instead, the Entities in the knowledge graph
graph is built from scratch, from the can also relate products to each other.
sources, every day, and the build system If a user searches for memorabilia of
is idempotent—producing a complete Lionel Messi and the graph indicates
graph at the end of it. that Lionel Messi plays for Futbol Club
An obvious place for a Facebook Barcelona, then, maybe, merchandise
knowledge graph to start is the Face- for that club is of interest, too. Perhaps
book pages ecosystem. Businesses and memorabilia for other famous Barce-
people create pages on Facebook to lona players will be of interest to this
represent a huge range of ideas and in- shopper. Related merchandise should
terests. Furthermore, having the owner include soccer-based products such as
of an entity make assertions about it is signed shirts, strips, boots, and balls.
a valuable source of data. As with any This idea can extend from sports to mu-
crowd-sourced data, however, it is not sic, film, literature, historical events,
without its challenges. and much more.
Facebook pages are very public fac- Just as important as entity relations
ing, and millions of people interact is understanding the products them-
with them every day. Thus, the inter- selves and their relationships. Know-
ests of a page owner don’t always align ing that one product is an iPhone and
with the requirements of a knowledge another is a case for an iPhone is obvi-
graph. ously important. But the case might fit
Most commonly, pages and entities some phones and not others, so eBay
do not have a strict 1:1 mapping, as needs to model the parts and accessory
pages can represent collections of en- sizes. Knowing the many variants and
tities (for example, movie franchises). relationships of products is also impor-
Data can also be incomplete or very un- tant: Which products are manufacturer
structured (blobs of text), which makes variants of one product? Do they come
it more difficult to use in the context of in different sizes, capacities, or colors?
a knowledge graph. Which are comparable—meaning they
Facebook’s biggest challenge has have mostly the same specifications
been to leverage data found on its pag- but perhaps different brands or colors?
es and to combine it with other more The system also needs to understand
40 COM MUNICATIO NS O F TH E ACM | AU GU ST 201 9 | VO L . 62 | NO. 8

practice
products that go together as a set, say is already available in the sources avail- allowing communication between the
in bundles, kits, or even fashion outfits. able to the system, they are necessary stores through microservices, and al-
As with other knowledge graphs, but not sufficient for Discovery. Nonob- lowing ingestion of new knowledge
eBay must cope with scale. At any one vious discovery includes new links be- or reprocessing raw data in a way that
time there may be more than one bil- tween entities (for example, a new side does not require reloading or rebuild-
lion active listings across thousands effect of a drug, an emerging company ing the entire graph.
of categories. These listings might in- as an acquisition target or sales lead), ˲˲ Evidence must be primitive to the
clude hundreds of millions of products a potential new important entity in the system. The main link between the real
and tens of billions of attributes speci- domain (for example, a new material world (which developers often try to
fied for those products. for display technologies, a new inves- model) and the data structures hold-
There are several different users of tor for a particular investment area), ing the extracted knowledge is the “evi-
the eBay Knowledge Graph, and these or changing significance of an existing dence” of the knowledge. This evidence
users have very different service-level entity (an increasing stake by an inves- is often the raw documents, databases,
requirements. When the search service tor in an organization, or increasing in- dictionaries, or image, text, and video
needs to understand a user’s query, teraction between a person of interest files from which the knowledge is de-
the knowledge graph must power an and some criminal in an intelligence- rived. When it comes to making point-
answer that takes milliseconds. At the gathering scenario). ed and useful contextual queries during
other end of the scale, large graph que- Given its wide enterprise customer a discovery process, the metadata and
ries could take hours to run. base applying cognitive technologies in other associated information often play
To cope with these challenges, eBay various domains, IBM focused on cre- a role in inference of the knowledge.
engineers have designed an architec- ating a framework for clients and client Thus, it is critical not to lose the linkage
ture that provides them with flexibility, teams to build their own knowledge between the relationships stored in the
while ensuring that the data is consis- graphs. Industry teams at IBM leverage graph and where those relationships
tent. The knowledge graph uses a rep- this framework to build domain-spe- come from.
licated log for all writes and edits to the cific instances. Clients exist in several ˲˲ Push entity resolution to runtime
graph. The log provides a consistent domains ranging from consumer-ori- through context. Resolving ambiguous
ordered view of the data. This approach ented research in banking and finance, references to entities referenced by
enables multiple back-end data stores insurance, IT services, media and partial names, surface forms, or mul-
that meet different use cases. Spe- entertainment, retail, and customer tiple entities having the same names
cifically, there is a flattened document service, to industries focused almost is a classic problem in understanding
store for serving search queries with entirely on deep discovery—especially natural language. In the field of knowl-
low latency and a graph store for doing scientific domains such as life sciences, edge discovery, however, developers
long-running graph analysis. Each of oil and gas, chemicals and petroleum, often look for the nonobvious patterns
these stores simply appends its opera- defense, and space exploration. This where an entity is not behaving in its
tions to the write log and gets the addi- breadth requires the framework have well-understood form or appears in a
tions and edits to the graph in a guaran- all of the machinery that clients need to novel context. Thus, a disambiguation
teed order. As a result, each store will be build and manage a knowledge graph of an entity too early in the process of
consistent. themselves. Some of the key technolo- knowledge-graph creation conflicts
IBM developed its Knowledge Graph gies built into the framework include with the very goal of discovery. It is bet-
Framework, used by Watson Discovery document conversion, document ex- ter to leave those utterances unresolved
Services and its associated offerings, traction, passage storage, and entity or disambiguate them to multiple enti-
which have been deployed in many normalization. ties, and then during runtime use the
industry settings outside of IBM. IBM The following are some of the key context of the query to resolve the entity
Watson uses the Knowledge Graph insights and lessons that IBM engi- name.
Framework in two distinct ways: First, neers learned from both building the
the framework directly powers Wat- knowledge graph for Watson Discovery Challenges Ahead
son Discovery, which focuses on using and deploying the system in other in- The requirements, coverage, and ar-
structured and unstructured knowl- dustries. chitectures of the knowledge graphs
edge to discover new, nonobvious in- ˲˲ Polymorphic stores offer a solution. discussed here differ quite a bit, but
formation, and the associated vertical The IBM Watson Knowledge Graph many of the challenges appear consis-
offerings on top of Discovery; second, uses a polymorphic store, supporting tently across most implementations.
the framework allows others to build multiple indices, database structures, These include challenges of scale, dis-
their own knowledge graphs with the in-memory, and graph stores. This ar- ambiguation, extraction of knowledge
prebuilt knowledge graph as the core. chitecture splits the actual data (often from heterogeneous and unstructured
The Discovery use case creates new redundantly) into one or more of these sources, and managing knowledge evo-
knowledge that is not directly present stores, allowing each store to address lution. These challenges have been at
in domain documents or data sources. specific requirements and workloads. the forefront of research for years, yet
This new knowledge can be surprising IBM engineers and researchers ad- they continue to baffle industry prac-
and anomalous. While search and ex- dressed a number of challenges such as titioners. Some of the challenges are
ploration tools access knowledge that keeping these multiple stores in sync, present in some of the systems but may
practice
be less relevant in other settings. identity for sports while also including from unstructured data in open do-
Entity disambiguation and manag- e-sports? mains.
ing identity. While entity disambigua- Managing changing knowledge. For example, in the eBay Product
tion and resolution is an active research An effective entity-linking system also Knowledge Graph, many graph rela-
area in the semantic Web, and now in needs to grow organically based on its tionships are extracted from unstruc-
knowledge graphs for several years, it ever-changing input data. For example, tured text in listings and seller catalogs;
is almost surprising that it continues companies may merge or split, and the IBM Discovery knowledge graph
to be one of the top challenges in the new scientific discoveries may break relies on documents as evidence for the
industry almost across the board. In its an existing entity into multiples. When facts represented in the graphs. Tra-
simplest form, the challenge is in as- a company acquires another company ditional supervised machine-learning
signing a unique normalized identity does the acquiring company change frameworks require labor-intensive hu-
and a type to an utterance or a mention identity? What about a division being man annotations to train knowledge-
of an entity. Many entities extracted au- spun out? Does identity follow the ac- extraction systems. This high cost can
tomatically have very similar surface quisition of the rights to a name? be alleviated or eliminated by adopting
forms, such as people with the same or While most knowledge-graph frame- fully unsupervised approaches (clus-
similar names, or movies, songs, and works are becoming efficient at storing tering with vector representations) or
books with the same or similar titles. a point-in-time version of a knowledge semi-supervised techniques (distant
Two products with similar names may graph and managing instantaneous supervision with existing knowledge,
refer to different listings. Without cor- changes to the knowledge graphs to multi-instance learning, active learn-
rect linking and disambiguation, enti- evolve the graph, there is a gap in being ing, and so on). Entity recognition,
ties will be incorrectly associated with able to manage highly dynamic knowl- classification, text, and entity embed-
wrong facts and result in incorrect in- edge in the graphs.4 A fundamental dings all prove useful tools to link our
ference downstream. understanding of temporal constructs, unstructured text to entities we know
While these problems might seem history, and change with history is about in the graph.3
obvious in smaller systems, when iden- needed to capture these changes. Fur- Managing operations at scale. It is
tity management must be done with a thermore, the ability to manage up- probably not surprising that all of the
heterogeneous contributor base and dates through multiple stores (for ex- knowledge-graph systems described
at scale, the problem becomes much ample, IBM’s polymorphic stores) is here face the challenge of managing
more challenging. How can identity be necessary. the graphs at scale. This dimension of-
described in a way that different teams There are a lot of considerations ten makes the problems that have been
can agree on it and know what the other around the integrity of the update pro- addressed in multiple forms in the aca-
teams are describing? How can devel- cess, eventual consistency, conflicting demic and research community (such
opers be sure to have enough human- updates, and, simply, runtime perfor- as disambiguation and unstructured
readable information to adjudicate mance. There may be an opportunity data extraction) present new challeng-
conflicts? to think of different variations of exist- es in industry settings. Managing scale
Type membership and resolution. ing distributed data stores designed to is the underlying challenge that affects
Most current knowledge-graph systems handle incremental cascade updates. several operations related to perfor-
allow each entity to have multiple types, It is also critical to manage changing mance and workload directly. It also
and the specific type may matter in dif- schemas and type systems, without cre- manifests itself indirectly as it affects
ferent circumstances. For example, ating inconsistencies with the knowl- other operations, such as managing
Barack Obama is a person, but also edge already in the system. Google, for fast incremental updates to large-scale
a politician and actor—a vastly more example, addresses this problem by knowledge graphs as at IBM or man-
popular politician and not a very well- conceptualizing the metamodel layer aging consistency on a large evolving
known actor. Cuba can be a country or into multiple layers. The basic lower knowledge graph as at Google.1
may refer to its government. In some layers remain fairly constant and high-
cases, knowledge-graph systems defer er levels are built through the notion of Other Key Challenges
the type assignment to runtime: Each metatypes (which are really instances In addition to these truly pervasive
entity describes its attributes, and the of types), which can be used to enrich challenges, the following challenges
application uses a specific type and col- the type system. will be critical to the efforts described
lection of attributes depending on the Knowledge extraction from multiple in this article. These are interesting and
user task. structured and unstructured sources. intriguing subjects for research and ac-
While criteria for class membership Despite the recent advances in natural ademic communities.
might be straightforward early on, as language understanding, the extrac- Knowledge-graph semantic em-
the universe of instances grows, enforc- tion of structured knowledge (which in- beddings. With a large-scale knowl-
ing these criteria while maintaining se- cludes entities, their types, attributes, edge graph, developers can build
mantic stability becomes challenging. and relationships) remains a challenge high-dimensional representations of
For example, when Google defined the across the board. Growing the graphs entities and relations. The resulting
category for “sports” in its knowledge at scale requires not only manual ap- embeddings will greatly benefit many
graph, e-sports did not exist. So, how proaches, but also unsupervised and machine-learning, NLP, and AI tasks as
does Google maintain the category semi-supervised knowledge extraction sources of features and constraints, and
42 COMM UNICATIO NS O F THE ACM | AU GU ST 201 9 | VO L . 62 | NO. 8

practice
can form the basis for more sophisticat- to a small enough size to be shippable The article summarizes and expands
ed inferences and ways to curate train- to mobile devices. This will allow devel- on a panel discussion the authors con-
ing data. Deep-learning techniques can opers to keep providing user value in ducted at the International Semantic Web
be applied to problems of entity dedu- a privacy-respecting manner by doing Conference in Asilomar, CA, in Oct. 2018
plication and attribute inference.2 more on-device learning and computa- (https://bit.ly/2ZYVLJh). The discussion
Knowledge inference and verification, over local small knowledge-graph is based on practical experiences and rep-
tion. Making sure that facts are cor- instances. (We are eager to collaborate resents the views of the authors and not
rect is a core task in constructing a with the research community in pur- necessarily their employers.
knowledge graph, and with a huge suit of this goal.)
scale it is not remotely possible to ver- Multilingual knowledge systems.
Related articles
ify everything manually. This requires A comprehensive knowledge graph on queue.acm.org
an automated approach: advances in must cover facts expressed in multiple
Schema.org: Evolution of
knowledge representation and reason- languages and conflate the concepts
Structured Data on the Web
ing, probabilistic graphical models, expressed in those languages into a co- R.V. Guha, D. Brickley, and S. Macbeth
and natural language inferences can hesive set. In addition to the challenges https://queue.acm.org/detail.cfm?id=2857276
be used to construct an automatic or in knowledge extraction from multi- Hazy: Making it Easier to Build
semi-automatic system for consistency lingual sources, different cultures may and Maintain Big-data Analytics
checking and fact verification. conceptualize the world in subtly dif- A. Kumar, F. Niu, and C. Ré
Federation of global, domain-specif- ferent ways, which poses challenges in https://queue.acm.org/detail.cfm?id=2431055
ic, and customer-specific knowledge. the design of the ontology as well. A Primer on Provenance
In a case like IBM clients, who build L. Carata, et al.
their own custom knowledge graphs, Conclusion https://queue.acm.org/detail.cfm?id=2602651
the clients are not expected to tell the The natural question from our discus- References
graph about basic knowledge. For ex- sion in this article is whether different 1. Höffner, K., Walter, S., Marx, E., Usbeck, R., Lehmann,
J. and Ngonga Ngomo, A.C. Survey on challenges of
ample, a cancer researcher is not going knowledge graphs can someday share question answering in the semantic Web. Semantic
to teach the knowledge graph that skin certain core elements, such as descrip- Web 8, 6 (2017), 895–920.
2. Lin, Y., Liu, Z., Sun, M., Liu, Y. and Zhu, X. Learning
is a form of tissue, or that St. Jude is a tions of people, places, and similar entity and relation embeddings for knowledge
hospital in Memphis, Tennessee. This entities. One of the avenues toward graph completion. In Proceedings of the Assoc.
Advancement of Artificial Intelligence 15, (2015),
is known as “general knowledge,” cap- sharing these descriptions could be to 2181–2187.
tured in a general knowledge graph. contribute them to Wikidata as a com- 3. Nickel, M., Murphy, K., Tresp, V. and Gabrilovich, E.
2016. A review of relational machine learning for
The next level of information is mon, multilingual core. In the nearer knowledge graphs. In Proceedings of the IEEE 104, 1
knowledge that is well known to any- term, we hope to continue sharing the (2016), 11–33.
4. Paulheim, H., Knowledge graph refinement: a survey
body in the domain—for example, car- results of research that each of us may of approaches and evaluation methods. Semantic Web
cinoma is a form of cancer or NHL more have done with researchers and practi- 8, 3 (2017), 489–508.
often stands for non-Hodgkin lym- tioners outside of our companies.

Natasha Noy is a scientist at Google, where she works
phoma than National Hockey League Knowledge representation is a diffi- on making structured data accessible and leads Google
(though in some contexts it may still cult skill to learn on the job. The pace Dataset Search. Previously, she worked on ontology
engineering and semantic Web at Stanford University,
mean that—say, in the patient record of of development and the scale at which Stanford, CA, USA.
an NHL player). The client should need knowledge-representation choices im- Yuqing Gao is the general manager of Microsoft’s
to input only the private and confiden- pact users and data do not foster an en- Artificial Intelligence – Knowledge Graph organization.
She has been a key leader behind intelligent features for
tial knowledge or any knowledge that vironment in which to understand and Microsoft Office products, Bing Entity Search, and other
the system does not yet know. Isolation, explore its principles and alternatives. prominent AI knowledge-driven Microsoft technologies.
federation, and online updates of the The importance of knowledge repre- Anshu Jain works at IBM Watson, where he is
responsible for the architecture of the core knowledge and
base and domain layers are some of the sentation in diverse industry settings, language capabilities, including Knowledge Graph, natural
major issues that surface because of as evidenced by the discussion in this language understanding, and Watson Knowledge Studio,
among others.
this requirement. article, should reinforce the idea that
Anant Narayanan is an engineering manager at Facebook,
Security and privacy for personal- knowledge representation should be where he helps build knowledge platforms to develop
ized, on-device knowledge graphs. a fundamental part of a computer sci- a deeper understanding of entities and relationships.
Previously, he led the development of large-scale data
Knowledge graphs by definition are ence curriculum—as fundamental as pipelines at Ozlo to support conversational AI systems.
enormous, since they aspire to create data structures and algorithms. Alan Patterson is a Distinguished Engineer at eBay,
an entity for every noun in the world, Finally, we all agree that AI systems heading up eBay’s efforts to build a product knowledge
graph that contains eBay’s knowledge of products, as well
and thus can only reasonably run in the will unlock new opportunities for orga- as organizations, brands, people, places, and standards.
cloud. Realistically, however, most peo- nizations in how they interact with cus- Previously, he worked at the startup True Knowledge (also
Evi.com).
ple do not care about all entities that tomers, provide unique value in their
exist in the world, but rather a small space, and transform their operations Jamie Taylor manages the Schema Team for Google’s
Knowledge Graph. The team’s responsibilities include
fraction or subset that is personally rel- and workforces. To realize this prom- extending KG’s underlying semantic representation,
growing coverage of the ontology, and enforcing semantic
evant to them. There is a lot of promise ise, these organizations must figure out policy. Previously, he worked for Metaweb Technologies.
in the area of personalizing knowledge how to build new systems that unlock
graphs for individual users, perhaps knowledge to make them truly intelli- Copyright held by authors/owners.
even to the extent that they can shrink gent organizations. Publication rights licensed to ACM.

Industry-Scale Knowledge Graphs:: Lessons and Challenges

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Industry-Scale Knowledge Graphs:: Lessons and Challenges

Încărcat de

Drepturi de autor:

Formate disponibile

practice

36 COMM UNICATIO NS O F THE ACM | AU GU ST 201 9 | VO L . 62 | NO. 8

38 COM MUNICATIO NS O F TH E AC M | AU GU ST 201 9 | VO L . 62 | NO. 8

the significant structures he designed. been trained on the fundamentals of

provenance and an inferred confidence structured sources of data to achieve the

40 COM MUNICATIO NS O F TH E ACM | AU GU ST 201 9 | VO L . 62 | NO. 8

42 COMM UNICATIO NS O F THE ACM | AU GU ST 201 9 | VO L . 62 | NO. 8

often stands for non-Hodgkin lym- tioners outside of our companies.

S-ar putea să vă placă și