Documente Academic
Documente Profesional
Documente Cultură
am giving a
talk
at
CMU
(end)
P(e1=talk|F) = 0.03
P(e1=it|F) = 0.01
...
e1=
...
e2=
am
...
e3= giving
...
e4=
...
e5= talk
...
e6= (end)
a
3
I +1
P( EF )=i =1 P(eiF , e
i 1
1
A Translation Algorithm
i=0
while ei is not equal to (end):
i i+1
ei argmaxe P(ei|F, e1,i-1)
Amazing results:
Within three years of invention, outperforming models
developed over the past 15 years, and deployed in
commercial systems
Incredibly simple implementation:
Traditional machine translation (e.g. 6k lines of Python)
Neural machine translation (e.g. 280 lines of Python)
Machine translation as machine learning:
Easy to apply new machine techniques directly
Predicting Probabilities
P( EF )=i =1 P(eiF , e
i 1
1
i1
1
i1
1
)
7
Predicting by Counting
c (w1 wi )
P( wiw 1 w i1)=
c (w 1 w i1)
i live in pittsburgh . </s>
i am a graduate student . </s>
my home is in michigan . </s>
P(live | <s> i) = c(<s> i live)/c(<s> i) = 1 / 2 = 0.5
P(am | <s> i) = c(<s> i am)/c(<s> i) = 1 / 2 = 0.5
8
Training:
Test:
3.0
2.5
b = -0.2
0.1
1.2
w1,a =
-6.0
-5.1
0.2
0.1
0.6
How likely
Words How likely
given the
we're
are they?
previous
predicting
word is a?
w2,giving =
-0.2
-0.3
1.0
2.0
-1.2
How likely
given two
words before
is giving?
-3.2
-2.9
s = 1.0
2.2
0.6
Total
score
10
p (ei=xe
p(eie
a
the
talk
gift
hat
i 1
i n+1
i1
in+1
-3.2
-2.9
s = 1.0
2.2
0.6
)=
s(ei= xei1
in +1)
~x e
s(e i =~
xe ii 1
n+1)
)=softmax (s(eie
softmax
p=
i 1
i n+1
0.002
0.003
0.329
0.444
0.090
))
11
d
i1
=
p (eiein+1)
dw
(gradient of
the probability)
w w +
12
w2,1,farmers,eat =
2.0
-2.1
w2,1,cows,eat =
-1.2
2.9
Neural Networks
14
15
0[1]
0[0]
0[1]
1
0[0]
w0,0
1
1
step
1[0]
step
1[1]
-1
b0,0
w0,1
0[0] -1
0[1] -1
1
-1
b0,1
16
Example
1[1]
1[0]
1
1
-1
1[0]
-1
-1
-1
1[1]
O
1(x2) = {1, -1}
17
Example
O
0[0]
1
1
1
2[0] = y
1[0]
1[1]
1(x3) = {-1, 1} O
1(x1) = {-1, -1}
X
1(x4) = {-1, -1}
1[1]
1[0]
O 1(x2) = {1, -1}
18
Example
0[1]
-1
0[0]
-1
0[1]
-1
tanh
1[0] 1
1
tanh
tanh
2[0]
1[1]
-1
1 1
19
20
W1
W2
b
soft
max
pi
n1
pi=softmax b+ k =1 W k e ik
21
tanh
Wh
hi
soft
max
pi
n1
hi=tanh b+k =1 W k ei k
pi=softmax ( W h hi )
1
tanh
-4 -3 -2 -1 0
-1
4
22
eat
consume
ingest
cows W [1]=
2
horses
sheep
goats
-1
-1
-1
1 W [1]=
1
1
1
1
1
1
1
-1
-1
-1
-1
<s>
this
is
pen
</s>
W1
W2
tanh
Wh
hi
p
soft
max
pi
1
25
26
Pass output of the hidden layer from the last time step
back to the hidden layer in the next time step
ei-1 W1
W2
ei-2
b
Wr
tanh
Wh
hi
soft
max
pi
Example:
He doesn't have very much confidence in himself
She doesn't have very much confidence in herself
27
y2
y3
y4
NET
NET
NET
NET
x1
x2
x3
x4
28
<s>
this
is
pen
</s>
29
o,1
y2
o,2
y3
o,3
y4
o,4
NET
NET
NET
NET
x1
x2
x3
x4
30
y2
tiny
y3
y4
small
medium
NET
NET
NET
NET
x1
x2
x3
x4
o,4
31
32
33
Encoder-Decoder
Translation Model
[Kalchbrenner+ 13, Sutskever+ 14]
34
is
J
1
J
1
i 1
1
35
Example of Generation
this
is
argmax e P(eif , e
i
i1
1
36
BLEU
33.30
26.17
34.81
Ref:
an animal
<s>
animal
<s>
dog
animal
an
dog
the
animal
38
Trick 2: Ensembling
a
the
talk
gift
hat
Model 1
Model 2
0.002
0.003
p1 = 0.329
0.444
0.090
0.030
0.021
p2 = 0.410
0.065
0.440
Model 1+2
0.016
0.012
pe = 0.370
0.260
0.265
39
BLEU
RIBES
Moses PBMT
38.6
80.3
Encoder-Decoder
39.0
82.9
40
you 'll have to have a cast .
i have a .
you have to have a chance .
Repeating:
Input:
True:
PBMT:
EncDec:
Attentional Models
43
PBMT
[Pouget-Abadie+ 2014]
RNN
44
45
Looking Carefully at
(One Step of) Attention
this
is
a1
a2
a3
pen
a4
softmax
1*
+ 2*
+ 3*
+ 4*
*W+b
softmax()
= P(e1 | F)
46
Exciting Results!
IWSLT 2015:
Best results on de-en
WMT 2016:
Best results on most language pairs
WAT 2016:
Best results on most language pairs
st
(NAIST/CMU model 1 on ja-en)
47
48
49
50
Solutions:
51
Methods:
52
53
Conclusion/Tutorials/Papers
54
Conclusion
Neural MT is exciting!
55
Tutorials
56
References
P. Arthur, G. Neubig, and S. Nakamura. Incorporating discrete translation lexicons into neural machine translation.
In Proc. EMNLP, 2016.
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In Proc.
ICLR, 2015.
Y. Bengio, H. Schwenk, J.-S. Sen ecal, F. Morin, and J.-L. Gauvain. Neural probabilistic language models. In
Innovations in Machine Learning, volume 194, pages 137186. 2006.
S. F. Chen and R. Rosenfeld. A survey of smoothing techniques for me models. Speech and Audio Processing,
IEEE Transactions on, 8(1):3750, Jan 2000.
J. Chung, K. Cho, and Y. Bengio. A character-level decoder without explicit segmentation for neural machine
translation. arXiv preprint arXiv:1603.06147,
2016.
O. Firat, K. Cho, and Y. Bengio. Multi-way, multilingual neural machine translation with a shared attention
mechanism. In Proc. NAACL, pages 866875, 2016.
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):17351780, 1997.
S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabulary for neural machine translation.
In Proc. ACL, pages 110, 2015.
M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Vi egas, M. Wattenberg, G. Corrado, et
al. Googles multilingual neural machine translation system: Enabling zero-shot translation. arXiv preprint
arXiv:1611.04558, 2016.
N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In Proc. EMNLP, pages 17001709,
Seattle, Washington, USA, 2013. Association for Computational Linguistics.
M.-T. Luong, I. Sutskever, Q. Le, O. Vinyals, and W. Zaremba. Addressing the rare word problem in neural machine
translation. In Proc. ACL, pages 1119, 2015.
T. Luong, R. Socher, and C. Manning. Better word representations with recursive neural networks for morphology.
In Proc. CoNLL, pages 104113, 2013.
57
References
T. Mikolov, M. Karafi at, L. Burget, J. Cernocky`, and S. Khudanpur. Recurrent neural network based language
model. In Proc. InterSpeech, pages 10451048,
2010.
A. Mnih and Y. W. Teh. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint
arXiv:1206.6426, 2012.
M. Nakamura, K. Maruyama, T. Kawabata, and K. Shikano. Neural network approach to word category prediction
for English texts. In Proc. COLING, 1990.
M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. Proc.
ICLR, 2016.
R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. In Proc. ACL,
pages 17151725, 2016.
S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu. Minimum risk training for neural machine translation.
In Proc. ACL, pages 16831692, 2016.
R. Socher, C. C. Lin, C. Manning, and A. Y. Ng. Parsing natural scenes and natural language with recursive neural
networks. pages 129136, 2011.
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Proc. NIPS, pages
31043112, 2014.
A. Vaswani, Y. Zhao, V. Fossum, and D. Chiang. Decoding with large-scale neural language models improves
translation. In Proc. EMNLP, pages 13871392, 2013.
S. Wiseman and A. M. Rush. Sequence-to-sequence learning as beam-search optimization. In Proc. EMNLP,
pages 12961306, 2016.
B. Zoph, D. Yuret, J. May, and K. Knight. Transfer learning for low-resource neural machine translation. In Proc.
EMNLP, pages 15681575, 2016.
58