FCV Learn LeCun

5yearsfromnow,
5yearsfromnow,
everyonewilllearn
everyonewilllearn
theirfeatures
theirfeatures
(youmightaswellstartnow)
(youmightaswellstartnow)
YannLeCun
YannLeCun
CourantInstituteofMathematicalSciences
CourantInstituteofMathematicalSciences
and
and
CenterforNeuralScience,
CenterforNeuralScience,
NewYorkUniversity
NewYorkUniversity
Yann LeCun
IIHave
HaveaaTerrible
TerribleConfession
Confessionto
toMake
Make
I'm interested in vision, but no more in vision than in audition or in
other perceptual modalities.
I'm interested in perception (and in control).
I'd like to find a learning algorithm and architecture that could work
(with minor changes) for many modalities
Nature seems to have found one.
Almost all natural perceptual signals have a local structure (in space
and time) similar to images and videos
Heavy correlation between neighboring variables
Local patches of variables have structure, and are representable
by feature vectors.
I like vision because it's challenging, it's useful, it's fun, we have data
the image recognition community is not yet stuck in a deep
local minimum like the speech recognition community.
Yann LeCun
The
The Unity
Unity of
of
Recognition
Recognition
Architectures
Architectures
Yann LeCun
Most
MostRecognition
RecognitionSystems
SystemsAre
AreBuilt
Builton
onthe
theSame
SameArchitecture
Architecture
Filter
Non
feature
Norma
Bank
Linearity
Pooling
lization
Filter
Non
Bank
Lin
Pool
Norm
Classifier
Filter
Non
Bank
Lin
Pool
Norm
Classifier
First stage: dense SIFT, HOG, GIST, sparse coding, RBM, auto-encoders.....
Second stage: K-means, sparse coding, LCC....
Pooling: average, L2, max, max with bias (elastic templates).....
Convolutional Nets: same architecture, but everything is trained.
Yann LeCun
Filter
FilterBank
Bank++Non-Linearity
Non-Linearity++Pooling
Pooling++Normalization
Normalization
Filter
Bank
Non
Linearity
Spatial
Pooling
This model of a feature extraction stage is biologically-inspired

...whether you like it or not (just ask David Lowe)
Inspired by [Hubel and Wiesel 1962]
The use of this module goes back to Fukushima's Neocognitron
(and even earlier models in the 60's).
Yann LeCun
How
Howwell
welldoes
doesthis
thiswork?
work?
Filter
Non
feature
Filter
Non
feature
Bank
Linearity
Pooling
Bank
Linearity
Pooling
Oriented
Edges
Winner Histogram
Kmeans
Takes (sum)
Or
All
SparseCoding
SIFT
Pyramid
Classifier
SVMor
Histogram. Another
Elasticparts Simple
Models,...
classifier
Some results on C101 (I know, I know....)

SIFT->K-means->Pyramid pooling->SVM intersection kernel: >65%
[Lazebnik et al. CVPR 2006]
SIFT->Sparse coding on Blocks->Pyramid pooling->SVM: >75%

[Boureau et al. CVPR 2010] [Yang et al. 2008]
SIFT->Local Sparse coding on Block->Pyramid pooling->SVM: >77%

[Boureau et al. ICCV 2011]
(Small) supervised ConvNet with sparsity penalty: >71%

[rejected from CVPR,ICCV,etc] REAL TIME
Yann LeCun
Convolutional
ConvolutionalNetworks
Networks(ConvNets)
(ConvNets)fits
fitsthat
thatmodel
model
Yann LeCun
Why
Whydo
dotwo
twostages
stageswork
workbetter
betterthan
thanone
onestage?
stage?
Filter
Non
Bank
Lin
Pool
Norm
Filter
Non
Bank
Lin
Pool
Norm
The second stage extracts mid-level features

Having multiple stages helps the selectivity-invariance dilemma
Yann LeCun
Classifier
Learning
LearningHierarchical
HierarchicalRepresentations
Representations
Trainable
Feature
Transform
Trainable
Feature
Transform
Trainable
Classifier
LearnedInternalRepresentation
I agree with David Lowe: we should learn the features
It worked for speech, handwriting, NLP.....
In a way, the vision community has been running a ridiculously
inefficient evolutionary learning algorithm to learn features:
Mutation: tweak existing features in many different ways
Selection: Publish the best ones at CVPR
Reproduction: combine several features from the last CVPR
Iterate. Problem: Moore's law works against you
Yann LeCun
Sometimes,
Sometimes,
Biology
Biology gives
gives you
you
good
good hints
hints
example:
example:
contrast
contrast normalization
normalization
Yann LeCun
Harsh
HarshNon-Linearity
Non-Linearity++Contrast
ContrastNormalization
Normalization++Sparsity
Sparsity
CConvolutions(filterbank)
SoftThresholding+Abs
NSubtractiveandDivisiveLocalNormalization
PPoolingdownsamplinglayer:averageormax?
subtractive+divisive
contrastnormalization
Pooling,subsampling
Rectification
Thresholding
Convolutions
Yann LeCun
THISISONESTAGEOFTHECONVNET
Soft
SoftThresholding
ThresholdingNon-Linearity
Non-Linearity
Yann LeCun
Local
LocalContrast
ContrastNormalization
Normalization
Performed on the state of every layer, including
the input
Subtractive Local Contrast Normalization
Subtracts from every value in a feature a
Gaussian-weighted average of its
neighbors (high-pass filter)
Divisive Local Contrast Normalization
Divides every value in a layer by the
standard deviation of its neighbors over
space and over all feature maps
Subtractive + Divisive LCN performs a kind of
approximate whitening.
Yann LeCun
C101
C101Performance
Performance(I(Iknow,
know,IIknow)
know)
Small network: 64 features at stage-1, 256 features at stage-2:

Tanh non-linearity, No Rectification, No normalization:
29%
Tanh non-linearity, Rectification, normalization:
65%
Shrink non-linearity, Rectification, norm, sparsity penalty 71%
Yann LeCun
Results
Resultson
onCaltech101
Caltech101with
withsigmoid
sigmoidnon-linearity
non-linearity
likeHMAXmodel
Yann LeCun
Feature
Feature Learning
Learning
Works
Works Really
Really Well
Well
on
on everything
everything but
but C101
C101
Yann LeCun
C101
C101isisvery
veryunfavorable
unfavorableto
tolearning-based
learning-basedsystems
systems
Because it's so small. We are switching to ImageNet
Some results on NORB
Nonormalization
Randomfilters
Unsupfilters
Supfilters
Unsup+Supfilters
Yann LeCun
Sparse
SparseAuto-Encoders
Auto-Encoders
Inference by gradient descent starting from the encoder output
E Y , Z =Y W d Z Z g e W e ,Y j z j
i
Z =argmin z E Y , z ; W
i
INPUT
Z
i
ge W e ,Y
Yann LeCun
j .
WdZ
Y Y
2
Z Z
z j
FEATURES
Using
UsingPSD
PSDto
toTrain
TrainaaHierarchy
Hierarchyof
ofFeatures
Features
Phase 1: train first layer using PSD
2
Y i Y
j .
WdZ
Z
ge W e ,Y i
z j
2
Z Z
FEATURES
Yann LeCun
Using
UsingPSD
PSDto
toTrain
TrainaaHierarchy
Hierarchyof
ofFeatures
Features
Phase 2: use encoder + absolute value as feature extractor
z j
ge W e ,Y i
FEATURES
Yann LeCun
Using
UsingPSD
PSDto
toTrain
TrainaaHierarchy
Hierarchyof
ofFeatures
Features
Phase 3: train the second layer using PSD
2
Y i Y
z j
ge W e ,Y i
j .
WdZ
Z
ge W e ,Y i
z j
2
Z Z
FEATURES
Yann LeCun
Using
UsingPSD
PSDto
toTrain
TrainaaHierarchy
Hierarchyof
ofFeatures
Features
Phase 4: use encoder + absolute value as 2 nd feature extractor
z j
ge W e ,Y i
z j
ge W e ,Y i
FEATURES
Yann LeCun
Using
UsingPSD
PSDto
toTrain
TrainaaHierarchy
Hierarchyof
ofFeatures
Features
Phase 4: use encoder + absolute value as 2 nd feature extractor
Phase 5: train a supervised classifier on top
Phase 6 (optional): train the entire system with supervised back-propagation
z j
ge W e ,Y i
z j
classifier
ge W e ,Y i
FEATURES
Yann LeCun
Learned
LearnedFeatures
Featureson
onnatural
naturalpatches:
patches:V1-like
V1-likereceptive
receptivefields
fields
Yann LeCun
Using
UsingPSD
PSDFeatures
Featuresfor
forObject
ObjectRecognition
Recognition
64 filters on 9x9 patches trained with PSD
with Linear-Sigmoid-Diagonal Encoder
Yann LeCun
ConvolutionalSparseCoding
ConvolutionalSparseCoding
[Kavukcuogluetal.NIPS2010]:convolutionalPSD
[Zeiler,Krishnan,Taylor,Fergus,CVPR2010]:DeconvolutionalNetwork
[Lee,Gross,Ranganath,Ng,ICML2009]:ConvolutionalBoltzmannMachine
[Norouzi,Ranjbar,Mori,CVPR2009]:ConvolutionalBoltzmannMachine
[Chen,Sapiro,Dunson,Carin,Preprint2010]:DeconvolutionalNetworkwith
automaticadjustmentofcodedimension.
Yann LeCun
Convolutional
ConvolutionalTraining
Training
Problem:
With patch-level training, the learning algorithm must reconstruct
the entire patch with a single feature vector
But when the filters are used convolutionally, neighboring feature
vectors will be highly redundant
Patchleveltrainingproduces
lotsoffiltersthatareshifted
versionsofeachother.
Yann LeCun
Convolutional
ConvolutionalSparse
SparseCoding
Coding
Replace the dot products with dictionary element by convolutions.
Input Y is a full image
Each code component Zk is a feature map (an image)
Each dictionary element is a convolution kernel
Regular sparse coding
Convolutional S.C.
k .
Zk
Wk
deconvolutional networks [Zeiler, Taylor, Fergus CVPR 2010]

Yann LeCun
Convolutional
ConvolutionalPSD:
PSD:Encoder
Encoderwith
withaasoft
softsh()
sh()Function
Function
Convolutional Formulation
Extend sparse coding from PATCH to IMAGE
PATCH based learning
Yann LeCun
CONVOLUTIONAL learning
Cifar-10
Cifar-10Dataset
Dataset
Dataset of tiny images
Images are 32x32 color images
10 object categories with 50000 training and 10000 testing
Example Images
Yann LeCun
Comparative
ComparativeResults
Resultson
onCifar-10
Cifar-10Dataset
Dataset
* Krizhevsky. Learning multiple layers of features from tiny images. Masters thesis, Dept of CS U of Toronto
**Ranzato and Hinton. Modeling pixel means and covariances using a factorized third order boltzmann machine.
2010
YannCVPR
LeCun
Road
RoadSign
SignRecognition
RecognitionCompetition
Competition
GTSRB Road Sign Recognition Competition (phase 1)
32x32 images
The 13 of the top 14 entries are ConvNets, 6 from NYU, 7 from IDSIA
No 6 is humans!
Yann LeCun
Pedestrian
PedestrianDetection
Detection(INRIA
(INRIADataset)
Dataset)
Yann LeCun
[Sermanetetal.,RejectedfromICCV2011]]
Pedestrian
PedestrianDetection:
Detection:Examples
Examples
Yann LeCun
[Kavukcuogluetal.NIPS2010]
Learning
Learning
InvariantFeatures
InvariantFeatures
Yann LeCun
Why
Whyjust
justpool
poolover
overspace?
space?Why
Whynot
notover
overorientation?
orientation?
Using an idea from Hyvarinen: topographic square pooling (subspace ICA)
1. Apply filters on a patch (with suitable non-linearity)
2. Arrange filter outputs on a 2D plane
3. square filter outputs
4. minimize sqrt of sum of blocks of squared filter outputs
Yann LeCun
Why
Whyjust
justpool
poolover
overspace?
space?Why
Whynot
notover
overorientation?
orientation?
The filters arrange
themselves spontaneously so
that similar filters enter the
same pool.
The pooling units can be seen
as complex cells
They are invariant to local
transformations of the input
For some it's translations,
for others rotations, or
other transformations.
Yann LeCun
Pinwheels?
Pinwheels?
Does that look
pinwheely to
you?
Yann LeCun
Sparsity
Sparsity through
through
Lateral
Lateral Inhibition
Inhibition
Yann LeCun
Invariant
InvariantFeatures
FeaturesLateral
LateralInhibition
Inhibition
Replace the L1 sparsity term by a lateral inhibition matrix
Yann LeCun
Invariant
InvariantFeatures
FeaturesLateral
LateralInhibition
Inhibition
Zeros I S matrix have tree structure
Yann LeCun
Invariant
InvariantFeatures
FeaturesLateral
LateralInhibition
Inhibition
Non-zero values in S form a ring in a 2D topology
Input patches are high-pass filtered
Yann LeCun
Invariant
InvariantFeatures
FeaturesLateral
LateralInhibition
Inhibition
Non-zero values in S form a ring in a 2D topology
Left: non high-pass filtering of input
Right: patch-level mean removal
Yann LeCun
Invariant
InvariantFeatures
FeaturesShort-Range
Short-RangeLateral
LateralExcitation
Excitation++L1
L1
l
Yann LeCun
Disentangling
Disentangling the
the
Explanatory
Explanatory Factors
Factors
of
of Images
Images
Yann LeCun
Separating
Separating
I used to think that recognition was all about eliminating irrelevant
information while keeping the useful one
Building invariant representations
Eliminating irrelevant variabilities
I now think that recognition is all about disentangling independent factors
of variations:
Separating what and where
Separating content from instantiation parameters
Hinton's capsules; Karol Gregor's what-where auto-encoders
Yann LeCun
Invariant
InvariantFeatures
Featuresthrough
throughTemporal
TemporalConstancy
Constancy
Object is cross-product of object type and instantiation parameters
[Hinton 1981]
small
Objecttype
Yann LeCun
[KarolGregoretal.]
medium
Objectsize
large
Invariant
InvariantFeatures
Featuresthrough
throughTemporal
TemporalConstancy
Constancy
Decoder
St
St1
W1
t
1
t
1
W1
1
f W
Encoder
Yann LeCun
t1
1
t1
1
W1
t2
1
t2
1
1
f W
1
f W
Predicted
input
St2
t1
W2
t
2
Inferred
code
t
2
Predicted
code
f
2
W
W 2 W 2
t2
Input
Invariant
InvariantFeatures
Featuresthrough
throughTemporal
TemporalConstancy
Constancy
C1
(where)
C2
(what)
Yann LeCun
Generating
Generatingfrom
fromthe
theNetwork
Network
Input
Yann LeCun
What
What is
is the
the right
right
criterion
criterion to
to train
train
hierarchical
hierarchical feature
feature
extraction
extraction
architectures?
architectures?
Yann LeCun
Flattening
Flatteningthe
theData
DataManifold?
Manifold?
The manifold of all images of <Category-X> is low-dimensional
and highly curvy
Feature extractors should flatten the manifold
Yann LeCun
Flattening
Flattening the
the
Data
Data Manifold?
Manifold?
Yann LeCun
The
TheUltimate
UltimateRecognition
RecognitionSystem
System
Trainable
Feature
Transform
Trainable
Feature
Transform
Trainable
Classifier
LearnedInternalRepresentation
Bottom-up and top-down information
Top-down: complex inference and disambiguation
Bottom-up: learns to quickly predict the result of the top-down
inference
Integrated supervised and unsupervised learning
Capture the dependencies between all observed variables
Compositionality
Each stage has latent instantiation variables
Yann LeCun

FCV Learn LeCun

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

FCV Learn LeCun

Încărcat de

Drepturi de autor:

Formate disponibile

5yearsfromnow,

This model of a feature extraction stage is biologically-inspired

Some results on C101 (I know, I know....)

SIFT->Sparse coding on Blocks->Pyramid pooling->SVM: >75%

SIFT->Local Sparse coding on Block->Pyramid pooling->SVM: >77%

(Small) supervised ConvNet with sparsity penalty: >71%

The second stage extracts mid-level features

Small network: 64 features at stage-1, 256 features at stage-2:

Tanh non-linearity, Rectification, normalization:

Shrink non-linearity, Rectification, norm, sparsity penalty 71%

deconvolutional networks [Zeiler, Taylor, Fergus CVPR 2010]

PATCH based learning

S-ar putea să vă placă și