Documente Academic
Documente Profesional
Documente Cultură
Vector Space
Kelvin Guu, John Miller, Percy Liang
This talk is about how to traverse knowledge graphs in vector space and the surprising benefits you get from doing so.
Knowledge graphs
One of their most powerful aspects is that they support compositional queries.
For example, we can take a natural question like this and convert it into a path query, which can then be executed on a knowledge graph such as Freebase via graph
traversal.
You start with with the entity Portugal, traverse to all the people located in Portugal, then traverse to all the languages they speak.
Knowledge graphs
What languages are spoken by people in Portugal?
One of their most powerful aspects is that they support compositional queries.
For example, we can take a natural question like this and convert it into a path query, which can then be executed on a knowledge graph such as Freebase via graph
traversal.
You start with with the entity Portugal, traverse to all the people located in Portugal, then traverse to all the languages they speak.
Knowledge graphs
What languages are spoken by people in Portugal?
portugal/location/language
One of their most powerful aspects is that they support compositional queries.
For example, we can take a natural question like this and convert it into a path query, which can then be executed on a knowledge graph such as Freebase via graph
traversal.
You start with with the entity Portugal, traverse to all the people located in Portugal, then traverse to all the languages they speak.
Knowledge graphs
What languages are spoken by people in Portugal?
portugal/location/language
portugal
One of their most powerful aspects is that they support compositional queries.
For example, we can take a natural question like this and convert it into a path query, which can then be executed on a knowledge graph such as Freebase via graph
traversal.
You start with with the entity Portugal, traverse to all the people located in Portugal, then traverse to all the languages they speak.
Knowledge graphs
What languages are spoken by people in Portugal?
portugal/location/language
fernando_pessoa
portugal
jorge_sampaio
One of their most powerful aspects is that they support compositional queries.
For example, we can take a natural question like this and convert it into a path query, which can then be executed on a knowledge graph such as Freebase via graph
traversal.
You start with with the entity Portugal, traverse to all the people located in Portugal, then traverse to all the languages they speak.
Knowledge graphs
What languages are spoken by people in Portugal?
portugal/location/language
fernando_pessoa
english
portugal
french
jorge_sampaio
portuguese
One of their most powerful aspects is that they support compositional queries.
For example, we can take a natural question like this and convert it into a path query, which can then be executed on a knowledge graph such as Freebase via graph
traversal.
You start with with the entity Portugal, traverse to all the people located in Portugal, then traverse to all the languages they speak.
Knowledge graphs
fernando_pessoa
english
portugal
location
french
jorge_sampaio
portuguese
Here, each fact in the knowledge graph is simply a (subject, predicate, object) triple, depicted as a labeled edge in the knowledge graph. I will use the terms fact, edge
and triple interchangeably for the rest of the talk.
Knowledge graphs
(fernando_pessoa, location, portugal)
fernando_pessoa
english
portugal
location
french
jorge_sampaio
portuguese
Here, each fact in the knowledge graph is simply a (subject, predicate, object) triple, depicted as a labeled edge in the knowledge graph. I will use the terms fact, edge
and triple interchangeably for the rest of the talk.
Knowledge graphs
(fernando_pessoa, location, portugal)
fact = edge = triple
fernando_pessoa
english
portugal
location
french
jorge_sampaio
portuguese
Here, each fact in the knowledge graph is simply a (subject, predicate, object) triple, depicted as a labeled edge in the knowledge graph. I will use the terms fact, edge
and triple interchangeably for the rest of the talk.
Knowledge graphs
93.8% of persons from Freebase have no place
of birth, and 78.5% have no nationality [Min et al, 2013]
fernando_pessoa
english
portugal
location
french
jorge_sampaio
portuguese
To grasp the magnitude of the problem, consider that in 2013, 93% of people in Freebase had no place of birth, and 78% had no nationality
So, knowledge graphs are good for compositionality, but suer from incompleteness.
Knowledge graphs
93.8% of persons from Freebase have no place
of birth, and 78.5% have no nationality [Min et al, 2013]
strength:
fernando_pessoa
compositionality
portugal
english
location
french
jorge_sampaio
portuguese
To grasp the magnitude of the problem, consider that in 2013, 93% of people in Freebase had no place of birth, and 78% had no nationality
So, knowledge graphs are good for compositionality, but suer from incompleteness.
Knowledge graphs
93.8% of persons from Freebase have no place
of birth, and 78.5% have no nationality [Min et al, 2013]
strength:
fernando_pessoa
compositionality
portugal
english
location
weakness:
incompleteness
jorge_sampaio
french
portuguese
To grasp the magnitude of the problem, consider that in 2013, 93% of people in Freebase had no place of birth, and 78% had no nationality
So, knowledge graphs are good for compositionality, but suer from incompleteness.
barack
obama
united
states
portugal
fernando
pessoa
Many methods have been developed to infer missing facts. One interesting class of methods is to embed the entire knowledge graph in vector space.
Each entity in the knowledge graph is represented by a point in vector space, and the relationships that hold between entities are reflected in the spatial relationships
between points.
By forcing a vector space model to squeeze a large number of facts into a low-dimensional vector space, we force the model to compress knowledge in a way that
makes it predict new, previously unseen facts.
location
barack
obama
united
states
location
portugal
fernando
pessoa
Many methods have been developed to infer missing facts. One interesting class of methods is to embed the entire knowledge graph in vector space.
Each entity in the knowledge graph is represented by a point in vector space, and the relationships that hold between entities are reflected in the spatial relationships
between points.
By forcing a vector space model to squeeze a large number of facts into a low-dimensional vector space, we force the model to compress knowledge in a way that
makes it predict new, previously unseen facts.
location
united
location
portugal
fernando
pessoa
Many methods have been developed to infer missing facts. One interesting class of methods is to embed the entire knowledge graph in vector space.
Each entity in the knowledge graph is represented by a point in vector space, and the relationships that hold between entities are reflected in the spatial relationships
between points.
By forcing a vector space model to squeeze a large number of facts into a low-dimensional vector space, we force the model to compress knowledge in a way that
makes it predict new, previously unseen facts.
location
united
location
compression > generalization
portugal
fernando
pessoa
Many methods have been developed to infer missing facts. One interesting class of methods is to embed the entire knowledge graph in vector space.
Each entity in the knowledge graph is represented by a point in vector space, and the relationships that hold between entities are reflected in the spatial relationships
between points.
By forcing a vector space model to squeeze a large number of facts into a low-dimensional vector space, we force the model to compress knowledge in a way that
makes it predict new, previously unseen facts.
There has been a significant amount of work on this topic, and Ive just listed a few related papers here.
They all excel at handling incompleteness, but none of them directly address how to answer compositional queries, which was the original strength of knowledge graphs.
There has been a significant amount of work on this topic, and Ive just listed a few related papers here.
They all excel at handling incompleteness, but none of them directly address how to answer compositional queries, which was the original strength of knowledge graphs.
There has been a significant amount of work on this topic, and Ive just listed a few related papers here.
They all excel at handling incompleteness, but none of them directly address how to answer compositional queries, which was the original strength of knowledge graphs.
Graph databases
Compositional
queries
Handle
incompleteness
So, we have seen two dierent ways to store knowledge, each with their own pros and cons.
Graph databases are very precise and can handle compositional queries, but are largely incomplete.
Vector space models can infer missing facts through compression, but so far it is not clear how they can support compositional queries.
This talk
Graph databases
Compositional
queries
Handle
incompleteness
This talk is about making it possible for vector space models to also handle compositional queries, despite the inherent errors introduced by compression.
And surprisingly, we show that when we extend vector space models to handle compositional queries, they also improve at their original purpose of inferring missing
facts.
This talk
Graph databases
Compositional
queries
Handle
incompleteness
This talk is about making it possible for vector space models to also handle compositional queries, despite the inherent errors introduced by compression.
And surprisingly, we show that when we extend vector space models to handle compositional queries, they also improve at their original purpose of inferring missing
facts.
This talk
Graph databases
Compositional
queries
Handle
incompleteness
So, roughly speaking, this talk will have two parts, one for each contribution.
PART I
PART II
Outline
Outline
PART I
Outline
PART I
Outline
PART I
Outline
PART I
Outline
PART I
Outline
PART I
PART II
Outline
PART I
PART II
Outline
PART I
PART II
Outline
PART I
PART II
Path queries
We will focus on how to answer a particular kind of compositional query, path queries.
Path queries
What languages are spoken by people in Portugal?
portugal/location/language
We will focus on how to answer a particular kind of compositional query, path queries.
Path queries
portugal
{portugal}
We start with a single token, denoting a set containing just one entity, portugal.
Path queries
portugal
{portugal}
location
{fernando_pessoa,
jorge_sampaio,
vasco_da_gama}
At every stage, we compute a set of entities, which is all we need to compute the next set.
Path queries
portugal
{portugal}
location
{fernando_pessoa,
jorge_sampaio,
vasco_da_gama}
language
{portuguese,
spanish,
english}
location
{fernando_pessoa,
jorge_sampaio,
vasco_da_gama}
language
{portuguese,
spanish,
english}
You can imagine that each of these sets is represented as a sparse vector
We can then traverse from one set to another by multiplying a set by a relations adjacency matrix
The result is another vector whose non-zero entries represent the new set.
Now, the point of this is to connect graph traversal with vector space models.
Since vector space models rely on limiting their dimensionality to achieve generalization, lets imagine that we can take these sparse adjacency matrices and somehow
compress them into dense matrices
In the resulting model, we can still identify these guys as set vectors.
location
{fernando_pessoa,
jorge_sampaio,
vasco_da_gama}
language
{portuguese,
spanish,
english}
sparse
vectors
You can imagine that each of these sets is represented as a sparse vector
We can then traverse from one set to another by multiplying a set by a relations adjacency matrix
The result is another vector whose non-zero entries represent the new set.
Now, the point of this is to connect graph traversal with vector space models.
Since vector space models rely on limiting their dimensionality to achieve generalization, lets imagine that we can take these sparse adjacency matrices and somehow
compress them into dense matrices
In the resulting model, we can still identify these guys as set vectors.
location
{fernando_pessoa,
jorge_sampaio,
vasco_da_gama}
language
{portuguese,
spanish,
english}
sparse
vectors
You can imagine that each of these sets is represented as a sparse vector
We can then traverse from one set to another by multiplying a set by a relations adjacency matrix
The result is another vector whose non-zero entries represent the new set.
Now, the point of this is to connect graph traversal with vector space models.
Since vector space models rely on limiting their dimensionality to achieve generalization, lets imagine that we can take these sparse adjacency matrices and somehow
compress them into dense matrices
In the resulting model, we can still identify these guys as set vectors.
sparse
vectors
location
{fernando_pessoa,
jorge_sampaio,
vasco_da_gama}
language
{portuguese,
spanish,
english}
->
You can imagine that each of these sets is represented as a sparse vector
We can then traverse from one set to another by multiplying a set by a relations adjacency matrix
The result is another vector whose non-zero entries represent the new set.
Now, the point of this is to connect graph traversal with vector space models.
Since vector space models rely on limiting their dimensionality to achieve generalization, lets imagine that we can take these sparse adjacency matrices and somehow
compress them into dense matrices
In the resulting model, we can still identify these guys as set vectors.
sparse
vectors
location
{fernando_pessoa,
jorge_sampaio,
vasco_da_gama}
->
language
{portuguese,
spanish,
english}
->
You can imagine that each of these sets is represented as a sparse vector
We can then traverse from one set to another by multiplying a set by a relations adjacency matrix
The result is another vector whose non-zero entries represent the new set.
Now, the point of this is to connect graph traversal with vector space models.
Since vector space models rely on limiting their dimensionality to achieve generalization, lets imagine that we can take these sparse adjacency matrices and somehow
compress them into dense matrices
In the resulting model, we can still identify these guys as set vectors.
sparse
vectors
dense,
low-dim
vectors
location
{fernando_pessoa,
jorge_sampaio,
vasco_da_gama}
->
->
language
{portuguese,
spanish,
english}
->
->
You can imagine that each of these sets is represented as a sparse vector
We can then traverse from one set to another by multiplying a set by a relations adjacency matrix
The result is another vector whose non-zero entries represent the new set.
Now, the point of this is to connect graph traversal with vector space models.
Since vector space models rely on limiting their dimensionality to achieve generalization, lets imagine that we can take these sparse adjacency matrices and somehow
compress them into dense matrices
In the resulting model, we can still identify these guys as set vectors.
sparse
vectors
dense,
low-dim
vectors
location
{fernando_pessoa,
jorge_sampaio,
vasco_da_gama}
set->vectors!
x
->
language
{portuguese,
spanish,
english}
->
->
You can imagine that each of these sets is represented as a sparse vector
We can then traverse from one set to another by multiplying a set by a relations adjacency matrix
The result is another vector whose non-zero entries represent the new set.
Now, the point of this is to connect graph traversal with vector space models.
Since vector space models rely on limiting their dimensionality to achieve generalization, lets imagine that we can take these sparse adjacency matrices and somehow
compress them into dense matrices
In the resulting model, we can still identify these guys as set vectors.
sparse
vectors
dense,
low-dim
vectors
location
{fernando_pessoa,
jorge_sampaio,
vasco_da_gama}
traversal
->
x
operators
->
language
{portuguese,
spanish,
english}
->
->
sparse
vectors
dense,
low-dim
vectors
location
{fernando_pessoa,
jorge_sampaio,
vasco_da_gama}
language
{portuguese,
spanish,
english}
membership operator:
->
x
dot product
->
->
->
xenglish
xenglish
In the sparse vector setup, you can do this by dot-producting the set vector with a one-hot vector representing English. If the resulting score is non-zero, English is in the
set.
In the new dense setup, you can still analogously dot product with a dense vector representing English.
If the score is high, we say that English is in the set. If the score is low, then it is not.
q = s / r1 / r2 / / rk
Now lets switch over to looking at what we just covered, in more abstract form.
To compute the answer set, we started with x_s and multiplied by a sequence of compressed adjacency matrices.
To check whether an entity t is in the answer set, we dot product the answer vector with x_t to get a membership score.
So now the question is: where do we get these dense entity vectors x from and these compressed adjacency matrices W?
q = s / r1 / r2 / / rk
x>
s W r1 W r2 . . . Wrk
Now lets switch over to looking at what we just covered, in more abstract form.
To compute the answer set, we started with x_s and multiplied by a sequence of compressed adjacency matrices.
To check whether an entity t is in the answer set, we dot product the answer vector with x_t to get a membership score.
So now the question is: where do we get these dense entity vectors x from and these compressed adjacency matrices W?
q = s / r1 / r2 / / rk
is t an answer to q?
x>
s W r1 W r2 . . . Wrk
Now lets switch over to looking at what we just covered, in more abstract form.
To compute the answer set, we started with x_s and multiplied by a sequence of compressed adjacency matrices.
To check whether an entity t is in the answer set, we dot product the answer vector with x_t to get a membership score.
So now the question is: where do we get these dense entity vectors x from and these compressed adjacency matrices W?
q = s / r1 / r2 / / rk
is t an answer to q?
x
score(q, t) = x>
s W r1 W r2 . . . Wrk t
Now lets switch over to looking at what we just covered, in more abstract form.
To compute the answer set, we started with x_s and multiplied by a sequence of compressed adjacency matrices.
To check whether an entity t is in the answer set, we dot product the answer vector with x_t to get a membership score.
So now the question is: where do we get these dense entity vectors x from and these compressed adjacency matrices W?
q = s / r1 / r2 / / rk
is t an answer to q?
x
score(q, t) = x>
s W r1 W r2 . . . Wrk t
Now lets switch over to looking at what we just covered, in more abstract form.
To compute the answer set, we started with x_s and multiplied by a sequence of compressed adjacency matrices.
To check whether an entity t is in the answer set, we dot product the answer vector with x_t to get a membership score.
So now the question is: where do we get these dense entity vectors x from and these compressed adjacency matrices W?
Training
Training examples:
Margin:
Objective:
Algorithm:
Suppose we have a bunch of query-answer pairs to train on (q is the query, t is one answer).
We can define a margin to be the dierence between the score of a correct answer and the score of an incorrect answer.
Optimization of that objective can be achieved in many ways, but we just choose SGD, or one of its variants.
So, Ive just finished fully describing one possible model for answering path queries. But I said earlier that I would show how you can generalize many existing
vector space models to answer path queries.
Training
Training examples:
Margin:
Objective:
Algorithm:
(q, t)
Suppose we have a bunch of query-answer pairs to train on (q is the query, t is one answer).
We can define a margin to be the dierence between the score of a correct answer and the score of an incorrect answer.
Optimization of that objective can be achieved in many ways, but we just choose SGD, or one of its variants.
So, Ive just finished fully describing one possible model for answering path queries. But I said earlier that I would show how you can generalize many existing
vector space models to answer path queries.
Training
Training examples:
Margin:
Objective:
Algorithm:
(q, t)
score (q, t0 )
Suppose we have a bunch of query-answer pairs to train on (q is the query, t is one answer).
We can define a margin to be the dierence between the score of a correct answer and the score of an incorrect answer.
Optimization of that objective can be achieved in many ways, but we just choose SGD, or one of its variants.
So, Ive just finished fully describing one possible model for answering path queries. But I said earlier that I would show how you can generalize many existing
vector space models to answer path queries.
Training
Training examples:
Margin:
Objective:
(q, t)
[1
score (q, t0 )
i=1 t0 2N (qi )
Algorithm:
Suppose we have a bunch of query-answer pairs to train on (q is the query, t is one answer).
We can define a margin to be the dierence between the score of a correct answer and the score of an incorrect answer.
Optimization of that objective can be achieved in many ways, but we just choose SGD, or one of its variants.
So, Ive just finished fully describing one possible model for answering path queries. But I said earlier that I would show how you can generalize many existing
vector space models to answer path queries.
Training
Training examples:
Margin:
Objective:
(q, t)
[1
score (q, t0 )
i=1 t0 2N (qi )
Algorithm:
SGD
Suppose we have a bunch of query-answer pairs to train on (q is the query, t is one answer).
We can define a margin to be the dierence between the score of a correct answer and the score of an incorrect answer.
Optimization of that objective can be achieved in many ways, but we just choose SGD, or one of its variants.
So, Ive just finished fully describing one possible model for answering path queries. But I said earlier that I would show how you can generalize many existing
vector space models to answer path queries.
Training
Training examples:
Margin:
Objective:
(q, t)
[1
score (q, t0 )
i=1 t0 2N (qi )
Algorithm:
SGD
Suppose we have a bunch of query-answer pairs to train on (q is the query, t is one answer).
We can define a margin to be the dierence between the score of a correct answer and the score of an incorrect answer.
Optimization of that objective can be achieved in many ways, but we just choose SGD, or one of its variants.
So, Ive just finished fully describing one possible model for answering path queries. But I said earlier that I would show how you can generalize many existing
vector space models to answer path queries.
The existing models that we will generalize all have the following form
The score should be high if the triple is true, and low otherwise.
You train the model to discriminate between true or false triples, and then at test time you predict missing facts by classifying unseen triples.
The main connection I want to make is that scoring triples is just a special case of answering path queries of length 1.
Its easy to see this pictorially.
The score should be high if the triple is true, and low otherwise.
You train the model to discriminate between true or false triples, and then at test time you predict missing facts by classifying unseen triples.
The main connection I want to make is that scoring triples is just a special case of answering path queries of length 1.
Its easy to see this pictorially.
= LOW
The score should be high if the triple is true, and low otherwise.
You train the model to discriminate between true or false triples, and then at test time you predict missing facts by classifying unseen triples.
The main connection I want to make is that scoring triples is just a special case of answering path queries of length 1.
Its easy to see this pictorially.
= LOW
The score should be high if the triple is true, and low otherwise.
You train the model to discriminate between true or false triples, and then at test time you predict missing facts by classifying unseen triples.
The main connection I want to make is that scoring triples is just a special case of answering path queries of length 1.
Its easy to see this pictorially.
special united
case ofstates) = HIGH
score (obama, nationality,
answering
score (obama, nationality,
germany)
= LOW
path queries
score (obama, nationality, france)?
The score should be high if the triple is true, and low otherwise.
You train the model to discriminate between true or false triples, and then at test time you predict missing facts by classifying unseen triples.
The main connection I want to make is that scoring triples is just a special case of answering path queries of length 1.
Its easy to see this pictorially.
t
r
s
Existing vector space models predict whether the triple (s, r, t) is true
r1
r2
r3
r4
t
The new path query models we will propose predict whether a path exists between s and t.
r1
r2
r3
r4
t
Predict whether
a path>
exists
between s and t
single-hop
multi-hop
The new path query models we will propose predict whether a path exists between s and t.
New model
(single edge)
(path)
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
New model
(single edge)
(path)
Bilinear
(Nickel+, 2012)
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
Original model
New model
(single edge)
(path)
x>
s Wr xt
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
Original model
New model
(single edge)
(path)
x>
s Wr xt
x>
s W r1 W r2 . . . Wrk x t
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
Original model
New model
(single edge)
(path)
x>
s Wr xt
x>
s W r1 W r2 . . . Wrk x t
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
Original model
New model
(single edge)
(path)
x>
s Wr xt
x>
s W r1 W r2 . . . Wrk x t
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
Original model
New model
(single edge)
(path)
x>
s Wr xt
x>
s W r1 W r2 . . . Wrk x t
TransE
(Bordes+, 2013)
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
TransE
(Bordes+, 2013)
Original model
New model
(single edge)
(path)
x>
s Wr xt
x>
s W r1 W r2 . . . Wrk x t
kxs + wr
xt k22
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
TransE
(Bordes+, 2013)
Original model
New model
(single edge)
(path)
x>
s Wr xt
x>
s W r1 W r2 . . . Wrk x t
kxs + wr
xt k22
xt k22
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
TransE
(Bordes+, 2013)
Original model
New model
(single edge)
(path)
x>
s Wr xt
x>
s W r1 W r2 . . . Wrk x t
kxs + wr
xt k22
xt k22
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
TransE
(Bordes+, 2013)
Original model
New model
(single edge)
(path)
x>
s Wr xt
x>
s W r1 W r2 . . . Wrk x t
kxs + wr
xt k22
xt k22
More generally
(??, 2015)
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
Original model
New model
(single edge)
(path)
x>
s Wr xt
x>
s W r1 W r2 . . . Wrk x t
TransE
(Bordes+, 2013)
kxs + wr
More generally
(??, 2015)
M (Tr (xs ) , xt )
xt k22
xt k22
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
Original model
New model
(single edge)
(path)
x>
s Wr xt
x>
s W r1 W r2 . . . Wrk x t
TransE
(Bordes+, 2013)
kxs + wr
More generally
(??, 2015)
M (Tr (xs ) , xt )
xt k22
xt k22
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
Original model
New model
(single edge)
(path)
x>
s Wr xt
x>
s W r1 W r2 . . . Wrk x t
TransE
(Bordes+, 2013)
kxs + wr
More generally
(??, 2015)
M (Tr (xs ) , xt )
xt k22
xt k22
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
Original model
New model
(single edge)
(path)
x>
x>
s Wr xt
s W r1 W r2 . . . Wrk x t
answer path queries
TransE
(Bordes+, 2013)
kxs + wr
More generally
(??, 2015)
M (Tr (xs ) , xt )
xt k22
xt k22
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
Bilinear
(Nickel+, 2012)
Original model
New model
(single edge)
(path)
x>
x>
s Wr xt
s W r1 W r2 . . . Wrk x t
answer path queries
TransE
(Bordes+, 2013)
More generally
(??, 2015)
M (Tr (xs ) , xt )
xt k22
The bilinear model proposed by Nickel et al has a triple scoring function that looks like this.
This turns out to be the length-1 special case of the model we proposed 5 slides ago.
You can see that we identify W as a traversal operator and repeatedly apply it.
The TransE model proposed by Bordes et al is another triple scoring function. Again, we can identify translation by the vector w_r as the traversal operator and
repeatedly apply it to answer path queries.
More generally, any triple-scoring function that decomposes into a traversal operator and a membership operator can be generalized to answer path queries.
So, now we can generalize any of these existing vector space models to answer path queries.
But perhaps more importantly, this generalization gives us a new way to train existing models.
t
r
s
In the old way of training these models, you trained them to score single edges correctly. We will call this single-edge training.
r1
r2
r3
r4
t
We can now instead train these models to score full paths, for which edges are one special case. We will call this path training.
Training
Training examples:
Margin:
Objective:
(q, t)
[1
i=1 t0 2N (qi )
Algorithm:
score (q, t0 )
SGD
Outline
PART I
PART II
Were now on to the second part of the talk, where we will see the surprising benefits of path training, and why it is better than single-edge training.
Experiments
TRAINING
PATH
SINGLEEDGE
PATH
TASK
SINGLEEDGE
Here is a simple way to think about the experimental results you are about to see.
One is the path querying task, where we evaluate a models ability to answer path queries, just like the running example weve been using throughout this talk.
The second is the single-edge prediction task. This measures a models ability to correctly classify triples as true or false. This is the original task that all of the
previously mentioned vector space models were designed for.
Experiments
TRAINING
PATH
SINGLEEDGE
PATH
portugal/location/language?
TASK
SINGLEEDGE
Here is a simple way to think about the experimental results you are about to see.
One is the path querying task, where we evaluate a models ability to answer path queries, just like the running example weve been using throughout this talk.
The second is the single-edge prediction task. This measures a models ability to correctly classify triples as true or false. This is the original task that all of the
previously mentioned vector space models were designed for.
Experiments
TRAINING
PATH
SINGLEEDGE
PATH
portugal/location/language?
TASK
SINGLEEDGE
Here is a simple way to think about the experimental results you are about to see.
One is the path querying task, where we evaluate a models ability to answer path queries, just like the running example weve been using throughout this talk.
The second is the single-edge prediction task. This measures a models ability to correctly classify triples as true or false. This is the original task that all of the
previously mentioned vector space models were designed for.
Experiments
TRAINING
PATH
PATH
PATH-1
portugal/location/language?
TASK
PATH-1
But remember that the single-edge prediction task is just equivalent to answering path queries of length 1.
PATH
SINGLEEDGE
portugal/location/language?
TASK
SINGLEEDGE
Note that in checkmarked cells, the training example distribution will match the test example distribution, whereas it will not in the o-diagonal cells.
Datasets
[Chen et al, 2013]
WordNet
Freebase
39,000
75,000
11
13
Train edges
113,000
316,000
Test edges
11,000
24,000
Entities
Relations
We work with the knowledge base completion datasets released by Chen and Socher et al in 2013.
WordNet is a knowledge graph where each node is a word, and the edges indicate important relationships between words
Freebase, as you have already seen, contains common facts about important entities.
Below are some examples of facts found in the two knowledge graphs.
For both knowledge graphs, roughly 90% of the facts were used for training and 10% were held out to create a test set.
Datasets
[Chen et al, 2013]
WordNet
39,000
75,000
11
13
Train edges
113,000
316,000
Test edges
11,000
24,000
Entities
Relations
Freebase
We work with the knowledge base completion datasets released by Chen and Socher et al in 2013.
WordNet is a knowledge graph where each node is a word, and the edges indicate important relationships between words
Freebase, as you have already seen, contains common facts about important entities.
Below are some examples of facts found in the two knowledge graphs.
For both knowledge graphs, roughly 90% of the facts were used for training and 10% were held out to create a test set.
Datasets
[Chen et al, 2013]
WordNet
Freebase
39,000
75,000
11
13
Train edges
113,000
316,000
Test edges
11,000
24,000
Entities
Relations
We work with the knowledge base completion datasets released by Chen and Socher et al in 2013.
WordNet is a knowledge graph where each node is a word, and the edges indicate important relationships between words
Freebase, as you have already seen, contains common facts about important entities.
Below are some examples of facts found in the two knowledge graphs.
For both knowledge graphs, roughly 90% of the facts were used for training and 10% were held out to create a test set.
Datasets
[Chen et al, 2013]
WordNet
Freebase
39,000
75,000
11
13
Train edges
113,000
316,000
Test edges
11,000
24,000
Entities
Relations
We work with the knowledge base completion datasets released by Chen and Socher et al in 2013.
WordNet is a knowledge graph where each node is a word, and the edges indicate important relationships between words
Freebase, as you have already seen, contains common facts about important entities.
Below are some examples of facts found in the two knowledge graphs.
For both knowledge graphs, roughly 90% of the facts were used for training and 10% were held out to create a test set.
WordNet
Freebase
113,000
316,000
2,000,000
6,000,000
Single-edge test
11,000
24,000
Path test
45,000
110,000
Single-edge training
Path training
An important factor to note is that if a knowledge graph has several hundred thousand edges, it can contain over 100 times as many paths, formed from those edges.
We sampled just a small subset of those paths to train and test on.
But that small subset is stil 20 times more data than the original edges.
Path sampling
r1
r2
r3
r4
t
To better establish the eect of path training, we applied it to three dierent vector space models. All of them were previously reported to achieve state-of-the-art results
under dierent setups.
Ive listed the traversal operators that each model corresponds to.
Evaluation metric
portugal
location
language
Given a path query, we will score all possible answers. This results in a ranking over all answers.
We define the quantile to be the percentage of negatives ranked after the correct answer.
In this case, we would score 75% because three quarters of the wrong answers are ranked after the right one.
Evaluation metric
portugal
location
1.
2.
3.
4.
5.
language
klingon
portuguese
latin
javascript
chinese
Given a path query, we will score all possible answers. This results in a ranking over all answers.
We define the quantile to be the percentage of negatives ranked after the correct answer.
In this case, we would score 75% because three quarters of the wrong answers are ranked after the right one.
Evaluation metric
portugal
location
1.
2.
3.
4.
5.
language
klingon
portuguese
latin
javascript
chinese
Given a path query, we will score all possible answers. This results in a ranking over all answers.
We define the quantile to be the percentage of negatives ranked after the correct answer.
In this case, we would score 75% because three quarters of the wrong answers are ranked after the right one.
Evaluation metric
portugal
location
1.
2.
3.
4.
5.
klingon
portuguese
latin
javascript
chinese
language
75%
Given a path query, we will score all possible answers. This results in a ranking over all answers.
We define the quantile to be the percentage of negatives ranked after the correct answer.
In this case, we would score 75% because three quarters of the wrong answers are ranked after the right one.
Experiments
TRAINING
PATH
PATH
SINGLEEDGE
portugal/location/language?
TASK
SINGLEEDGE
Mean quantile
100
90
80
70
60
50
Edge Path
Bilinear
Looking first at the Bilinear models performance on Freebase, we see that path training performs significantly better on the path task.
WordNet
Freebase
Edge Path
Bilinear-Diag
Edge Path
TransE
Edge Path
Bilinear
Edge Path
Bilinear-Diag
Edge Path
TransE
100
90
80
70
60
50
Expanding our view to both datasets and all three models, we see that this trend holds up across the board.
Path training definitely helps all these models improve at the path query task.
The first answer is not hard. Our train and test distributions match.
But I think theres still a second question, which is, why should single-edge training be so bad at path queries? After all, if we know about all the individual edges,
shouldnt we know about the paths that they form?
The first answer is not hard. Our train and test distributions match.
But I think theres still a second question, which is, why should single-edge training be so bad at path queries? After all, if we know about all the individual edges,
shouldnt we know about the paths that they form?
The first answer is not hard. Our train and test distributions match.
But I think theres still a second question, which is, why should single-edge training be so bad at path queries? After all, if we know about all the individual edges,
shouldnt we know about the paths that they form?
The first answer is not hard. Our train and test distributions match.
But I think theres still a second question, which is, why should single-edge training be so bad at path queries? After all, if we know about all the individual edges,
shouldnt we know about the paths that they form?
Cascading errors
Who is Tad Lincolns grandparent?
We hypothesize that the problem is due to cascading errors, which arise as a side-eect of how vector space models try to compress knowledge into lowdimensional space.
Suppose that you want to know who Tad Lincolns grandparent is.
And suppose that the parent traversal operation is represented by a translation to the right, as in this figure.
The single-edge training objective forces the vector for abraham_lincoln to be close to where it should be, denoted by the red circle. But because we are compressing
facts into a low-dimensional space, and because were using a ranking objective, abraham lincoln never makes it to exactly where he should be. This results in some
error, shown by the blue noise cloud.
After another step of traversal from abe lincoln to thomas_lincoln, youve accumulated more error.
As you traverse, the correct answer drifts farther and farther away from where you expect it to be.
Cascading errors
Who is Tad Lincolns grandparent?
We hypothesize that the problem is due to cascading errors, which arise as a side-eect of how vector space models try to compress knowledge into lowdimensional space.
Suppose that you want to know who Tad Lincolns grandparent is.
And suppose that the parent traversal operation is represented by a translation to the right, as in this figure.
The single-edge training objective forces the vector for abraham_lincoln to be close to where it should be, denoted by the red circle. But because we are compressing
facts into a low-dimensional space, and because were using a ranking objective, abraham lincoln never makes it to exactly where he should be. This results in some
error, shown by the blue noise cloud.
After another step of traversal from abe lincoln to thomas_lincoln, youve accumulated more error.
As you traverse, the correct answer drifts farther and farther away from where you expect it to be.
Cascading errors
Who is Tad Lincolns grandparent?
path
training
In contrast, the path training objective is directly sensitive to the gap between where thomas_lincoln is, and where he should be. This error is quite large and easier for the
model to detect, compared to many small errors.
Example
We can actually check this hypothesis by looking at how much drift is happening in a path-trained model versus a single-edge trained model.
We can do this because each prefix of a path query is itself a path query. So, we can just look at the quality of the results at each stage.
TRAINING
PATH
SINGLEEDGE
PATH
portugal/location/language?
TASK
SINGLEEDGE
Okay, so we finished looking at the path task and saw that path training was better.
But you might suspect that it doesnt fare so well on the single-edge task, since the training and test distribution no longer match.
We certainly had serious doubts about this ourselves, because multi-task training is notoriously tricky to get right.
Surprisingly, path training actually does better on the single-edge task as well, across both datasets and all models.
The one exception is TransE on Freebase, where path training doesnt hurt, but doesnt help much.
WordNet
Freebase
Edge Path
Bilinear-Diag
Edge Path
TransE
Edge Path
Bilinear
Edge Path
Bilinear-Diag
Edge Path
TransE
100
90
80
70
60
50
Surprisingly, path training actually does better on the single-edge task as well, across both datasets and all models.
The one exception is TransE on Freebase, where path training doesnt hurt, but doesnt help much.
TRAINING
PATH
PATH
SINGLEEDGE
portugal/location/language?
TASK
SINGLEEDGE
Now, we definitely want an explanation for why path training does better even when the train and test distributions dont match.
So, weve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.
This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.
To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.
Then, this greatly increases our prior belief that perhaps C is As child as well.
We see that the head of the Horn clause is actually a path from A to C.
So, weve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.
This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.
To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.
Then, this greatly increases our prior belief that perhaps C is As child as well.
We see that the head of the Horn clause is actually a path from A to C.
So, weve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.
This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.
To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.
Then, this greatly increases our prior belief that perhaps C is As child as well.
We see that the head of the Horn clause is actually a path from A to C.
So, weve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.
This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.
To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.
Then, this greatly increases our prior belief that perhaps C is As child as well.
We see that the head of the Horn clause is actually a path from A to C.
So, weve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.
This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.
To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.
Then, this greatly increases our prior belief that perhaps C is As child as well.
We see that the head of the Horn clause is actually a path from A to C.
spouse
child
So, weve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.
This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.
To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.
Then, this greatly increases our prior belief that perhaps C is As child as well.
We see that the head of the Horn clause is actually a path from A to C.
spouse
child
child?
So, weve known for a while that paths in a knowledge graph are actually important features that can help infer missing edges.
This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.
To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.
Then, this greatly increases our prior belief that perhaps C is As child as well.
We see that the head of the Horn clause is actually a path from A to C.
spouse
child
child?
This was addressed in several papers, including the Path Ranking Algorithm by Ni Lao, and more recent works by Nickel et al and Neelakantan et al.
To give a quick example, suppose we know that A and B are married, and furthermore B has a child C.
Then, this greatly increases our prior belief that perhaps C is As child as well.
We see that the head of the Horn clause is actually a path from A to C.
The next point is that if you cannot even model these paths correctly, then you have no chance at using them to infer missing edges.
If you cant assert the body of a Horn clause, you cant infer the head of the Horn clause.
And we know from earlier results that the single-edge model is not very good at modeling paths.
The next point is that if you cannot even model these paths correctly, then you have no chance at using them to infer missing edges.
If you cant assert the body of a Horn clause, you cant infer the head of the Horn clause.
And we know from earlier results that the single-edge model is not very good at modeling paths.
TRAINING
PATH
SINGLEEDGE
PATH
TASK
SINGLEEDGE
The next point is that if you cannot even model these paths correctly, then you have no chance at using them to infer missing edges.
If you cant assert the body of a Horn clause, you cant infer the head of the Horn clause.
And we know from earlier results that the single-edge model is not very good at modeling paths.
Based on this hunch, we did an experiment to assess whether path training was really helping models learn Horn clauses.
We grouped a lot of Horn clauses into 3 categories: high precision, low precision, and zero coverage.
Then, for each Horn clause, we measured how similar the traversal operation representing the body of the Horn clause was to the head of the Horn clause.
When the head and body are very similar, then the model is in some sense implementing the Horn clause.
Based on this hunch, we did an experiment to assess whether path training was really helping models learn Horn clauses.
We grouped a lot of Horn clauses into 3 categories: high precision, low precision, and zero coverage.
Then, for each Horn clause, we measured how similar the traversal operation representing the body of the Horn clause was to the head of the Horn clause.
When the head and body are very similar, then the model is in some sense implementing the Horn clause.
Based on this hunch, we did an experiment to assess whether path training was really helping models learn Horn clauses.
We grouped a lot of Horn clauses into 3 categories: high precision, low precision, and zero coverage.
Then, for each Horn clause, we measured how similar the traversal operation representing the body of the Horn clause was to the head of the Horn clause.
When the head and body are very similar, then the model is in some sense implementing the Horn clause.
Based on this hunch, we did an experiment to assess whether path training was really helping models learn Horn clauses.
We grouped a lot of Horn clauses into 3 categories: high precision, low precision, and zero coverage.
Then, for each Horn clause, we measured how similar the traversal operation representing the body of the Horn clause was to the head of the Horn clause.
When the head and body are very similar, then the model is in some sense implementing the Horn clause.
Based on this hunch, we did an experiment to assess whether path training was really helping models learn Horn clauses.
We grouped a lot of Horn clauses into 3 categories: high precision, low precision, and zero coverage.
Then, for each Horn clause, we measured how similar the traversal operation representing the body of the Horn clause was to the head of the Horn clause.
When the head and body are very similar, then the model is in some sense implementing the Horn clause.
We found that when a Horn clause has high precision, path training pulls the body and head of the Horn clause closer together.
When it is low precision, path training does not care too much.
And when a Horn clause has zero coverage, path training almost has no eect in most cases.
So this definitely provides some interesting evidence in favor of our hypothesis, but we certainly think this could be investigated more carefully.
To recap
Graph databases
Compositional
queries
Handle
incompleteness
So, to recap! Weve demonstrated how to take existing vector space models and generalize them to handle path queries.
In the process, we also discovered that path training leads to stronger performance on predicting missing facts.
We have some theories on why this is, but there is still more to investigate.
Connections
and
speculative thoughts
Since we still have a little time, Id like to end with some connections to other work, and speculative thoughts.
RNNs
Repeated application of traversal operators can be thought of as implementing a recurrent neural network.
Theres been a lot of work on RNNs, and we think this helps build a connection between RNNs and knowledge bases.
Neelakantan et al also previously explored using RNNs for knowledge base completion and saw great results.
RNNs
Repeated application of traversal operators can be thought of as implementing a recurrent neural network.
Theres been a lot of work on RNNs, and we think this helps build a connection between RNNs and knowledge bases.
Neelakantan et al also previously explored using RNNs for knowledge base completion and saw great results.
RNNs
T rk
. . . T r2
Tr1 (xs )
Repeated application of traversal operators can be thought of as implementing a recurrent neural network.
Theres been a lot of work on RNNs, and we think this helps build a connection between RNNs and knowledge bases.
Neelakantan et al also previously explored using RNNs for knowledge base completion and saw great results.
RNNs
T rk
. . . T r2
Tr1 (xs )
Repeated application of traversal operators can be thought of as implementing a recurrent neural network.
Theres been a lot of work on RNNs, and we think this helps build a connection between RNNs and knowledge bases.
Neelakantan et al also previously explored using RNNs for knowledge base completion and saw great results.
RNNs
T rk
. . . T r2
Tr1 (xs )
Repeated application of traversal operators can be thought of as implementing a recurrent neural network.
Theres been a lot of work on RNNs, and we think this helps build a connection between RNNs and knowledge bases.
Neelakantan et al also previously explored using RNNs for knowledge base completion and saw great results.
Matrix factorization
ET
E
[Nickel et al, 2013], [Bordes et al, 2011]
One angle we did not cover much at all is the interpretation of knowledge graph embedding as low rank tensor or matrix factorization.
In this view, we are taking an adjacency matrix and factorizing it into 3 parts.
Matrix factorization
3
ET
With path training, it as if we raised the adjacency matrix to a few powers, and are now performing low-rank matrix factorization on the higher-degree adjacency matrix.
Graph-like data
and it would be interesting to see if path training can be helpful there too.
Graph-like data
entailment
negation
the boy happily ate ice cream
the child hated dessert
textual entailment
[Bowman et al, 2015], [Dagan et al, 2009]
Textual entailment is another domain that has graph like structure.
Here, each node in the graph is actually a sentence, and each edge is an entailment relation.
Since we can embed sentences into vector space, path training might serve as a useful form of regularization on the sentence embedding function.
Graph-like data
word co-occurrences
[Levy, 2014], [Pennington, 2014], [Mikolov, 2013]
Finally, word co-occurrence probabilities form a very dense graph structure, and many word embedding models have been linked to matrix factorization, so it would be
interesting to see if path training could be helpful there as well.
Thank you!