Documente Academic
Documente Profesional
Documente Cultură
8
Recent success with neural networks
Voice Image
N.Net Transcription N.Net Text caption
signal
Game
N.Net Next move
State
• Associationism
– Humans learn through association
• 400BC-1900AD: Plato, David Hume, Ivan Pavlov..
What are “Associations”
• Association!
Observation: The Brain
• In 1873, Bain postulated that there must be one million neurons and
5 billion connections relating to 200,000 “acquisitions”
• In 1883, Bain was concerned that he hadn’t taken into account the
number of “partially formed associations” and the number of neurons
responsible for recall/learning
PROGRAM
PROCESSOR NETWORK
DATA
Processing Memory
unit
Dendrites
Axon
Dendrite of neuron Y
• Frank Rosenblatt
– Psychologist, Logician
– Inventor of the solution to everything, aka the Perceptron (1958)
Simplified mathematical model
1 𝑖𝑓 𝑤𝑖 𝑥𝑖 + 𝑏 > 0
𝑌=൞
𝑖
0 𝑒𝑙𝑠𝑒
His “Simple” Perceptron
• Originally assumed could represent any Boolean circuit and
perform any logic
– “the embryo of an electronic computer that [the Navy] expects
will be able to walk, talk, see, write, reproduce itself and be
conscious of its existence,” New York Times (8 July) 1958
– “Frankenstein Monster Designed by Navy That Thinks,” Tulsa,
Oklahoma Times 1958
Also provided a learning algorithm
𝐰 = 𝐰 + 𝜂 𝑑 𝐱 − 𝑦(𝐱) 𝐱
Sequential Learning:
𝑑 𝑥 is the desired output in response to input 𝑥
𝑦 𝑥 is the actual output in response to 𝑥
• Boolean tasks
• Update the weights whenever the perceptron
output is wrong
• Proved convergence
Perceptron
X 1
-1
2 X∧Y X 0 ഥ
X
1
Y
X 1
1 X∨Y
1
? X⨁Y
?
2 X⨁Y
1
1
-1
-1 ഥ
X∨ഥ
Y
Y
Hidden Layer
• XOR
– The first layer is a “hidden” layer
– Also originally suggested by Minsky and Papert, 1968
48
A more generic model
ത
( 𝐴&𝑋&𝑍 | 𝐴&𝑌ത )&( 𝑋 & 𝑌 | 𝑋&𝑍 )
21
1 1
01 1
1 -1 1 1
21 2
1 1 1
1 1 1 -1 1 -1
1 1
1
X Y Z A
• A “multi-layer” perceptron
• Can compose arbitrarily complicated Boolean
functions!
– More on this in the next part
Story so far
• Neural networks began as computational models of the brain
• Neural network models are connectionist machines
– The comprise networks of neural units
• McCullough and Pitt model: Neurons as Boolean threshold units
– Models the brain as performing propositional logic
– But no learning rule
• Hebb’s learning rule: Neurons that fire together wire together
– Unstable
• Rosenblatt’s perceptron : A variant of the McCulloch and Pitt neuron with
a provably convergent learning rule
– But individual perceptrons are limited in their capacity (Minsky and Papert)
• Multi-layer perceptrons can model arbitrarily complex Boolean functions
But our brain is not Boolean
x1
x2
x3
xN
x2
x3
1
x2 w1x1+w2x2=T
xN
0
1 𝑖𝑓 𝑤𝑖 x𝑖 ≥ 𝑇
x1
𝑦= ൞
𝑖
0 𝑒𝑙𝑠𝑒
• A perceptron operates on x2
x1
real-valued vectors
– This is a linear classifier 54
Boolean functions with a real
perceptron
0,1 1,1 0,1 1,1 0,1 1,1
X X Y
x1
56
Booleans over the reals
x2
x1
x2 x1
x2
x1
x2 x1
x2
x1
x2 x1
x2
x1
x2 x1
x2
x1
x2 x1
y𝑖 ≥ 5?
x2
4 𝑖=1
4 AND
3 3
5
y1 y2 y3 y4 y5
4 x1
4
3 3
4 x2 x1
OR
AND AND
x2
x1 x1 x2
64
Complex decision boundaries
784 dimensions
𝑵𝒐𝒕 𝟐
(MNIST)
784 dimensions
65
Story so far
• MLPs are connectionist computational models
– Individual perceptrons are computational equivalent of neurons
– The MLP is a layered composition of many perceptrons
66
So what does the perceptron really
model?
• Is there a “semantic” interpretation?
Lets look at the weights
1 𝑖𝑓 𝑤𝑖 x𝑖 ≥ 𝑇
𝑦= ൞
𝑖
x1 0 𝑒𝑙𝑠𝑒
x2
x3
𝑇
𝑦 = ቊ1 𝑖𝑓 𝐱 𝐰 ≥ 𝑇
0 𝑒𝑙𝑠𝑒
xN
T1
T1
1 1 f(x)
x + T1 T2 x
1 -1
T2
T2
+ 𝑧
𝑦
𝑖
.
𝑤𝑁
𝑦= ቊ
1 𝑖𝑓 z ≥ 0
0 𝑒𝑙𝑠𝑒
x𝑁 −𝑇
• A threshold unit
– “Fires” if the weighted sum of inputs and the
“bias” T is positive
The “soft” perceptron
x1 𝑤1
𝑤2 𝑧 = w𝑖 x𝑖 − 𝑇
x2
..
𝑖
x3 𝑤3
𝑧
+ 𝑦
.𝑤𝑁
𝑦=
1
1 + 𝑒𝑥𝑝 −𝑧
x𝑁 −𝑇
𝑤𝑁
x𝑁 𝑏
sigmoid tanh
• A network of perceptrons
– Generally “layered”
Aside: Note on “depth”
X Y Z A
Y
X 1
1 X∨Y
1
X ?
? X⨁Y
?
2 X⨁Y
1
1
-1
-1 ഥ
X∨ഥ
Y
Y
Hidden Layer
21 2
1 1 1
1 1 1 -1 1 -1
1 1
1
X Y Z A
• MLPs can compute more complex Boolean functions
• MLPs can compute any Boolean function
– Since they can emulate individual gates
• MLPs are universal Boolean functions
MLP as Boolean Functions
ത
( 𝐴&𝑋&𝑍 | 𝐴&𝑌ത )&( 𝑋 & 𝑌 | 𝑋&𝑍 )
21
1 1
01 1
1 -1 1 1
21 2
1 1 1
1 1 1 -1 1 -1
1 1
1
X Y Z A
X1 X2 X3 X4 X5 Y
0 0 1 1 0 1
0 1 0 1 1 1
0 1 1 0 0 1
1 0 0 0 1 1
1 0 1 1 1 1
1 1 0 0 1 1
• DNF form:
– Find groups
– Express as reduced DNF
Reducing a Boolean Function
YZ
WX 00 01 11 10
00 Basic DNF formula will require 7 terms
01
11
10
Reducing a Boolean Function
YZ
WX 00 01 11 10
𝑂 = 𝑌ത 𝑍ҧ + 𝑊𝑋
ഥ 𝑌ത + 𝑋𝑌
ത 𝑍ҧ
00
01
11
10
01
11
10
01
11
10
01
11
10
01
11
10
10 11
01
00 YZ
UV 00 01 11 10
• How many neurons in a DNF (one-hidden-
layer) MLP for this Boolean function of 6
variables?
Width of a single-layer Boolean MLP
YZ
WX
00
Can be generalized: Will require 2N-1
perceptrons
01 in hidden layer
Exponential
11 in N
10
10 11
01
00 YZ
UV 00 01 11 10
• How many neurons in a DNF (one-hidden-
layer) MLP for this Boolean function
Width of a single-layer Boolean MLP
YZ
WX
00
Can be generalized: Will require 2N-1
perceptrons
01 in hidden layer
Exponential
11 in N
10
10 11
01
How manyUVunits
00 01if 11we 10
use00multiple
YZ layers?
• How many neurons in a DNF (one-hidden-
layer) MLP for this Boolean function
Width of a deep MLP
YZ
WX 00 01 11 10
YZ
WX
00
00
01 01
11 11
10
11
10 01
10 00 YZ
UV 00 01 11 10
𝑂 =𝑊⊕𝑋⊕𝑌⊕𝑍 𝑂 =𝑈⊕𝑉⊕𝑊⊕𝑋⊕𝑌⊕𝑍
Multi-layer perceptron XOR
X 1
X∨Y
1
-1 1
2 X⨁Y
1
1
-1
-1 ഥ
X∨ഥ
Y
Y
Hidden Layer
00
01
9 perceptrons
11
10
𝑂 =𝑊⊕𝑋⊕𝑌⊕𝑍
W X Y Z
00
01
11
10
11
10 01
00 YZ
UV 00 01 11 10
𝑂 =𝑈⊕𝑉⊕𝑊⊕𝑋⊕𝑌⊕𝑍
U V W X Y Z 15 perceptrons
00
01
11
10
11
10 01
00 YZ
UV 00 01 11 10
𝑂 =𝑈⊕𝑉⊕𝑊⊕𝑋⊕𝑌⊕𝑍
More generally, the XOR of N
U V W X Y Z variables will require 3(N-1)
perceptrons!!
𝑂 = 𝑋1 ⊕ 𝑋2 ⊕ ⋯ ⊕ 𝑋𝑁
𝑋1 𝑋𝑁
…… 𝑂 = 𝑋1 ⊕ 𝑋2 ⊕ ⋯ ⊕ 𝑋𝑁
𝑍1 𝑍𝑀
= 𝑍1 ⊕ 𝑍2 ⊕ ⋯ ⊕ 𝑍𝑀
𝑋1 𝑋𝑁
• Using only K hidden layers will require O(2(N-K/2)) neurons in the Kth layer
– Because the output can be shown to be the XOR of all the outputs of the K-1th
hidden layer
– I.e. reducing the number of layers below the minimum will result in an
exponentially sized network to express the function fully
– A network with fewer than the required number of neurons cannot model the
function
Recap: The need for depth
• Deep Boolean MLPs that scale linearly with
the number of inputs …
• … can become exponentially large if recast
using only one layer
• It gets worse..
The need for depth
𝑎⊕𝑏⊕𝑐⊕𝑑⊕𝑒⊕𝑓
a b c d e f
X1 X2 X3 X4 X5
• The wide function can happen at any layer
• Having a few extra layers can greatly reduce network
size
Depth vs Size in Boolean Circuits
• The XOR is really a parity problem
127
Network size: summary
• An MLP is a universal Boolean function
784 dimensions
𝑵𝒐𝒕 𝟐
(MNIST)
784 dimensions
x2
x3
1
x2 w1x1+w2x2=T
xN
0
1 𝑖𝑓 𝑤𝑖 x𝑖 ≥ 𝑇
x1
𝑦= ൞
𝑖
0 𝑒𝑙𝑠𝑒
• A perceptron operates on x2
x1
real-valued vectors
– This is a linear classifier 132
Booleans over the reals
3 𝑁
y𝑖 ≥ 5?
x2
4 𝑖=1
4 AND
3 3
5
y1 y2 y3 y4 y5
4 x1
4
3 3
4 x2 x1
OR
AND AND
x2
x1 x1 x2
OR
AND
x1 x2
• Can compose arbitrarily complex decision
boundaries
136
Complex decision boundaries
OR
AND
x1 x2
• Can compose arbitrarily complex decision boundaries
– With only one hidden layer!
– How?
137
Exercise: compose this with one
hidden layer
x2
x1 x1 x2
2 4 2
y𝑖 ≥ 4?
𝑖=1
x2 x1
139
Composing a pentagon
2
2
3
4 4
3 3
5
4 4
4 2
2
3 3
5
y𝑖 ≥ 5?
𝑖=1
2 y1 y2 y3 y4 y5
140
Composing a hexagon
3
3 4 3
5
5 5
4 6 4
5 5
3 5 3
4 4
3
𝑁
y𝑖 ≥ 6?
𝑖=1
y1 y2 y3 y4 y5 y6
141
How about a heptagon
y𝑖 ≥ 𝑁?
𝑖=1
y1 y2 y3 y4 y5
x2 x1
y𝑖 ≥ 𝑁?
𝑖=1
y1 y2 y3 y4 y5
x2 x1
N/2
1 𝑟𝑎𝑑𝑖𝑢𝑠
• σ𝑖 𝑦𝑖 = 𝑁 1 − 𝑎𝑟𝑐𝑐𝑜𝑠 𝑚𝑖𝑛 1,
𝜋 𝐱−𝑐𝑒𝑛𝑡𝑒𝑟
N/2
−𝑁/2
𝐲𝒊 − 𝑵 > 𝟎?
𝒊=𝟏
𝑲𝑵
𝑲𝑵
𝐲𝒊 − > 𝟎?
𝟐
𝒊=𝟏
𝐾→∞
𝑲𝑵
𝑲𝑵
𝐲𝒊 − > 𝟎?
𝟐
𝒊=𝟏
𝐾→∞
x2
x1 x1 x2
154
Optimal depth in generic nets
• We look at a different pattern:
– “worst case” decision boundaries
155
Optimal depth
𝑲𝑵
𝑲𝑵
𝐲𝒊 − > 𝟎?
𝟐
𝒊=𝟏
𝐾→∞
𝑌5 𝑌6 𝑌7 𝑌8
𝑌1 𝑌2 𝑌3 𝑌16
𝑌5 𝑌6 𝑌7 𝑌8
𝑌1 𝑌2 𝑌3 𝑌16
𝐾→∞
𝑌1 𝑌2 𝑌3
…. 𝑌64
….
….
• Two hidden layers: 608 hidden neurons
– 64 in layer 1
– 544 in layer 2
• 609 total neurons (including output neuron)
Optimal depth
….
….
….
….
….
….
T1
T1
1 1 f(x)
x + T1 T2 x
1 -1
T2
T2
N/2
+
-N/2
1
0
× ℎ1 + × ℎ𝑛
ℎ𝑛
× ℎ2
178
Sufficiency of architecture
A network with 16 or more
neurons in the first layer is
….. capable of representing the
figure to the right perfectly
Why?
185
Sufficiency of architecture
This effect is because we
use the threshold activation
It gates information in
the input from later layers
186
Sufficiency of architecture
This effect is because we
use the threshold activation
It gates information in
the input from later layers
Continuous activation functions result in graded output at the layer
187
Sufficiency of architecture
This effect is because we
use the threshold activation
It gates information in
the input from later layers
Continuous activation functions result in graded output at the layer
188
Width vs. Activations vs. Depth
• Narrow layers can still pass information to
subsequent layers if the activation function is
sufficiently graded
189
Sufficiency of architecture
• A network with insufficient capacity cannot exactly model a function that requires
a greater minimal number of convex hulls than the capacity of the network
– But can approximate it with error
190
The “capacity” of a network
• VC dimension
• A separate lecture
– Koiran and Sontag (1998): For “linear” or threshold units, VC
dimension is proportional to the number of weights
• For units with piecewise linear activation it is proportional to the
square of the number of weights
– Harvey, Liaw, Mehrabian “Nearly-tight VC-dimension bounds for
piecewise linear neural networks” (2017):
• For any 𝑊, 𝐿 s.t. 𝑊 > 𝐶𝐿 > 𝐶 2 , there exisits a RELU network with ≤ 𝐿
𝑊𝐿 𝑊
layers, ≤ 𝑊 weights with VC dimension ≥ 𝐶 log 2 ( 𝐿 )
– Friedland, Krell, “A Capacity Scaling Law for Artificial Neural
Networks” (2017):
• VC dimension of a linear/threshold net is 𝒪(𝑀𝐾), 𝑀 is the overall
number of hidden neurons, 𝐾 is the weights per neuron
191
Lessons
• MLPs are universal Boolean function
• MLPs are universal classifiers
• MLPs are universal function approximators
193
Learning the network
2
𝐸 = 𝑦𝑖 − 𝑓(𝐱 𝑖 , 𝐖)
𝑖
• Deeper networks seem to learn better, for the same 10000 training instances
number of total neurons
– Implicit smoothness constraints
• As opposed to explicit constraints from more conventional
classification models
x2
x1
x2
• In reality
– In general not really cleanly separated
• So what is the function we learn?
A trivial MLP: a single perceptron
x2
x2 1 x1
0 x1
206
The simplest MLP: a single perceptron
x2
x2
x1
x1
x2
x3
..
𝑤3
+ 𝑧
𝑦
x1
.
𝑤𝑁
x𝑁 𝑏
• Given a number of input output pairs, learn the weights and bias
1 𝑖𝑓 σ𝑁𝑖=1 𝑤𝑖 𝑋𝑖 − 𝑏 ≥ 0
– 𝑦=ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
– Learn 𝑊 = 𝑤1 . . 𝑤𝑁 and 𝑏, given several (X, y) pairs
208
Restating the perceptron
x1
x2
x3
xN
WN+1
xN+1=1
𝑦 = 1 𝑖𝑓 𝑤𝑖 𝑋𝑖 ≥ 0
𝑖=1
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where 𝑋𝑁+1 = 1
209
The Perceptron Problem
𝑁+1
• Find the hyperplane 𝑖=1 𝑤𝑖 𝑋𝑖 = 0 that
σ
perfectly separates the two groups of points
210
A simple learner: Perceptron Algorithm
• Given 𝑁 training instances 𝑋1 , 𝑌1 , 𝑋2 , 𝑌2 , … , (𝑋𝑁 , 𝑌𝑁 )
– 𝑌𝑖 = +1 or −1 (instances are either positive or negative)
212
The Perceptron Algorithm
+1 (blue)
-1(Red)
Initialization
+1 (blue)
-1(Red)
214
Perceptron Algorithm
+1 (blue)
-1(Red)
215
Perceptron Algorithm
+1 (blue)
-1(Red)
216
Perceptron Algorithm
𝑜𝑙𝑑 𝑊
𝑛𝑒𝑤 𝑊
217
Perceptron Algorithm
+1 (blue) -1(Red)
Updated hyperplane
218
Perceptron Algorithm
Misclassified instance, negative class
+1 (blue) -1(Red)
219
Perceptron Algorithm
+1 (blue) -1(Red)
220
Perceptron Algorithm
𝑊𝑜𝑙𝑑
+1 (blue) -1(Red)
𝑊𝑜𝑙𝑑
-1(Red)
+1 (blue)
Updated hyperplane
222
Perceptron Algorithm
-1(Red)
+1 (blue)
224
In reality: Trivial linear example
x2
x1
225
• Two-dimensional example
– Blue dots (on the floor) on the “red” side
– Red dots (suspended at Y=1) on the “blue” side
– No line will cleanly separate the two colors 225
Non-linearly separable data: 1-D example
y
10 instances
10 instances
y=0
x
1
P( y x)
1 e ( w wx )
𝑤0
𝑤1
x2
x1
247
• Two-dimensional example
– Blue dots (on the floor) on the “red” side
– Red dots (suspended at Y=1) on the “blue” side
– No line will cleanly separate the two colors 247
Logistic regression
1 Decision: y > 0.5?
𝑃 𝑌=1𝑋 =
1 + exp −(σ𝑖 𝑤𝑖 𝑥𝑖 + 𝑤0 )
𝑦
𝑤0 x2
𝑤1 𝑤2
𝑥1 𝑥2
x
1
P( y x) f ( x)
1 e ( w wx )
251
Estimating the model
• Likelihood
1
𝑃 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑑𝑎𝑡𝑎 = ෑ 𝑇 𝑃 𝑋𝑖
𝑖
1 + 𝑒 −𝑦𝑖 (𝑤0+𝑤 𝑋𝑖 )
• Log likelihood
log 𝑃 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑑𝑎𝑡𝑎 =
−𝑦𝑖 (𝑤0 +𝑤 𝑇 𝑋𝑖 )
log 𝑃 𝑋𝑖 − log 1 + 𝑒
𝑖 𝑖
252
Maximum Likelihood Estimate
𝑤
ෝ0, 𝑤
ෝ1 = argmax log 𝑃 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑑𝑎𝑡𝑎
𝑤0 ,𝑤1
• Equals (note argmin rather than argmax)
−𝑦𝑖 (𝑤0 +𝑤 𝑇 𝑋𝑖 )
𝑤
ෝ0, 𝑤
ෝ1 = argmin log 1 + 𝑒
𝑤0 ,𝑤
𝑖
• Identical to minimizing the KL divergence
between the desired output 𝑦 and actual output
1
− (𝑤 +𝑤𝑇𝑋 )
1+𝑒 0 𝑖
x2
• Non-linear classifiers..
First consider the separable case..
x2
x1
x2
x1 x2
x1
x2
x1 x2
x1
x2
x1 x2
x1
y2
x1 x2
y1
y2
x1 x2
y1
y2
x1 x2
y1
• The rest of the network may be viewed as a transformation that transforms data
from non-linear classes to linearly separable features
– We can now attach any linear classifier above it for perfect classification
– Need not be a perceptron
– In fact, for binary classifiers an SVM on top of the features may be more generalizable!
First consider the separable case..
𝑦1 𝑦2
y2
x1 x2
y1
• This is true of any sufficient structure
– Not just the optimal one
• For insufficient structures, the network may attempt to transform the inputs to
linearly separable features
– Will fail to separate
– Still, for binary problems, using an SVM with slack may be more effective than a final perceptron!
Mathematically..
𝑦𝑜𝑢𝑡
𝑦1 𝑦2
𝑓(𝑋)
x1 x2
1 1
• 𝑦𝑜𝑢𝑡 = =
1+exp 𝑏+𝑊 𝑇 𝑌 1+exp 𝑏+𝑊 𝑇 𝑓(𝑋)
• The data are (almost) linearly separable in the space of 𝑌
• The network until the second-to-last layer is a non-linear function
𝑓(𝑋) that converts the input space of 𝑋 into the feature space
𝑌 where the classes are maximally linearly separable
Story so far
• A classification MLP actually comprises two
components
– A “feature extraction network” that converts the
inputs into linearly separable features
• Or nearly linearly separable features
– A final linear classifier that operates on the
linearly separable features
How about the lower layers?
𝑦1 𝑦2
x1 x2
• How do the lower layers respond?
– They too compute features
– But how do they look
• Manifold hypothesis: For separable classes, the classes are linearly separable on a
non-linear manifold
• Layers sequentially “straighten” the data manifold
– Until the final layer, which fully linearizes it
The behavior of the layers
• CIFAR
The behavior of the layers
• CIFAR
When the data are not separable and
boundaries are not linear..
x2
x1
y2
x1 x2
y1
x2
x1
x
1
P( y x) f ( x)
1 e ( w w
T
271 x)
x2
x1 x2
x2 x2
x1 x2
Recall: The basic perceptron
1 𝑖𝑓 𝑤𝑖 x𝑖 ≥ 𝑇
𝑦= ൞
𝑖
x1 0 𝑒𝑙𝑠𝑒
x2
x3
𝑇
𝑦 = ቊ1 𝑖𝑓 𝐱 𝐰 ≥ 𝑇
0 𝑒𝑙𝑠𝑒
xN
• The perceptron fires if the input is within a specified angle of the weight
– Represents a convex region on the surface of the sphere!
– The network is a Boolean function over these regions.
• The overall decision region can be arbitrarily nonconvex
• Neuron fires if the input vector is close enough to the weight vector.
– If the input pattern matches the weight pattern closely enough
278
Recall: The weight as a template
W X X
𝑾𝑻
𝒀
𝑿
• The signal could be (partially) reconstructed using these features
– Will retain all the significant components of the signal
• Simply recompose the detected features
– Will this work? 281
Making it explicit
𝑿
𝑾𝑻
𝒀
𝑿
• The signal could be (partially) reconstructed using these features
– Will retain all the significant components of the signal
• Simply recompose the detected features
– Will this work? 282
Making it explicit: an autoencoder
𝑿
𝑾𝑻
𝒀
𝑿
• A neural network can be trained to predict the input itself
• This is an autoencoder
• An encoder learns to detect all the most significant patterns in the signals
• A decoder recomposes the signal from the patterns 283
The Simplest Autencoder
𝑿
𝑾𝑻
𝑿
• A single hidden unit
• Hidden unit has linear activation
• What will this learn? 284
The Simplest Autencoder
Training: Learning 𝑊 by minimizing
𝐱ො L2 divergence
𝒘𝑻 xො = 𝑤 𝑇 𝑤x
2
𝑑𝑖𝑣 xො , x = x − xො = x − w 𝑇 𝑤x 2
𝒘
= argmin 𝐸 𝑑𝑖𝑣 xො , x
𝑊
𝑊
𝐱 = argmin 𝐸 x − w 𝑇 𝑤x
𝑊 2
𝑊
285
The Simplest Autencoder
𝐱ො
𝒘𝑻
𝒘𝑻
287
The Simplest Autencoder
𝐱ො
𝒖𝑻
𝐱
• Simply varying the hidden representation will result in
an output that lies along the major axis
• This will happen even if the learned output weight is
separate from the input weight
– The minimum-error direction is the principal eigen vector
288
For more detailed AEs without a non-
linearity
𝑿
𝑾𝑻
𝒀
𝑾
𝑿
𝐘 = 𝐖𝐗 = 𝐖𝑇 𝐘
𝐗 𝐸 = 𝐗 − 𝐖 𝑇 𝐖𝐗 2
Find W to minimize Avg[E]
𝑾𝑻
𝒀
𝑾
𝑿
ENCODER
• Terminology:
– Encoder: The “Analysis” net which computes the hidden
representation
– Decoder: The “Synthesis” which recomposes the data from the
hidden representation
290
Introducing nonlinearity
DECODER
𝑿
𝑾𝑻
𝒀
𝑾
𝑿
ENCODER
• When the hidden layer has a linear activation the decoder represents the best linear manifold to fit
the data
– Varying the hidden value will move along this linear manifold
• When the hidden layer has non-linear activation, the net performs nonlinear PCA
– The decoder represents the best non-linear manifold to fit the data
– Varying the hidden value will move along this non-linear manifold 291
The AE
DECODER
ENCODER
• With non-linearity
– “Non linear” PCA
– Deeper networks can capture more complicated manifolds
• “Deep” autoencoders
292
Some examples
• 2-D input
• Encoder and decoder have 2 hidden layers of 100 neurons, but
hidden representation is unidimensional
• Extending the hidden “z” value beyond the values seen in training
extends the helix linearly
Some examples
ENCODER
• When the hidden representation is of lower dimensionality
than the input, often called a “bottleneck” network
– Nonlinear PCA
– Learns the manifold for the data
• If properly trained
295
The AE
DECODER
DECODER
DECODER
DECODER
Compose
Compose
302
Dictionary-based techniques
Guitar + Drum
music music
Compose Compose
Guitar + Drum
music music
Compose Compose
305
Learning Dictionaries
𝐷1 (0, 𝑡) … 1 (𝐹, 𝑡)
𝐷 2 (0, 𝑡)
𝐷 … 2 (𝐹, 𝑡)
𝐷
…
𝑓DE1 () 𝑓DE2 ()
𝑓EN1 () 𝑓EN2 ()
… …
𝐷1 (0, 𝑡) … 𝐷1 (𝐹, 𝑡) 𝐷2 (0, 𝑡) … 𝐷2 (𝐹, 𝑡)
𝛼 𝛽
𝛼 𝛽
𝛼 𝛽
… …
𝑓DE1 () 𝑓DE2 ()
… …
𝛼 𝛽
𝛼 𝛽
𝛼 𝛽
… …
𝑓DE1 () 𝑓DE2 ()
… …
Original Original
• Separating music
309
Story for the day
• Classification networks learn to predict the a posteriori
probabilities of classes
– The network until the final layer is a feature extractor that
converts the input data to be (almost) linearly separable
– The final layer is a classifier/predictor that operates on linearly
separable data