Sunteți pe pagina 1din 54

5yearsfromnow,

5yearsfromnow,
everyonewilllearn
everyonewilllearn
theirfeatures
theirfeatures
(youmightaswellstartnow)
(youmightaswellstartnow)

YannLeCun
YannLeCun
CourantInstituteofMathematicalSciences
CourantInstituteofMathematicalSciences
and
and
CenterforNeuralScience,
CenterforNeuralScience,
NewYorkUniversity
NewYorkUniversity

Yann LeCun

IIHave
HaveaaTerrible
TerribleConfession
Confessionto
toMake
Make
I'm interested in vision, but no more in vision than in audition or in
other perceptual modalities.
I'm interested in perception (and in control).
I'd like to find a learning algorithm and architecture that could work
(with minor changes) for many modalities
Nature seems to have found one.
Almost all natural perceptual signals have a local structure (in space
and time) similar to images and videos
Heavy correlation between neighboring variables
Local patches of variables have structure, and are representable
by feature vectors.
I like vision because it's challenging, it's useful, it's fun, we have data
the image recognition community is not yet stuck in a deep
local minimum like the speech recognition community.
Yann LeCun

The
The Unity
Unity of
of
Recognition
Recognition
Architectures
Architectures

Yann LeCun

Most
MostRecognition
RecognitionSystems
SystemsAre
AreBuilt
Builton
onthe
theSame
SameArchitecture
Architecture

Filter

Non

feature

Norma

Bank

Linearity

Pooling

lization

Filter

Non

Bank

Lin

Pool

Norm

Classifier

Filter

Non

Bank

Lin

Pool

Norm

Classifier

First stage: dense SIFT, HOG, GIST, sparse coding, RBM, auto-encoders.....
Second stage: K-means, sparse coding, LCC....
Pooling: average, L2, max, max with bias (elastic templates).....
Convolutional Nets: same architecture, but everything is trained.
Yann LeCun

Filter
FilterBank
Bank++Non-Linearity
Non-Linearity++Pooling
Pooling++Normalization
Normalization

Filter
Bank

Non
Linearity

Spatial
Pooling

This model of a feature extraction stage is biologically-inspired


...whether you like it or not (just ask David Lowe)
Inspired by [Hubel and Wiesel 1962]
The use of this module goes back to Fukushima's Neocognitron
(and even earlier models in the 60's).
Yann LeCun

How
Howwell
welldoes
doesthis
thiswork?
work?

Filter

Non

feature

Filter

Non

feature

Bank

Linearity

Pooling

Bank

Linearity

Pooling

Oriented
Edges

Winner Histogram
Kmeans
Takes (sum)
Or
All
SparseCoding
SIFT

Pyramid

Classifier

SVMor

Histogram. Another
Elasticparts Simple
Models,...

classifier

Some results on C101 (I know, I know....)


SIFT->K-means->Pyramid pooling->SVM intersection kernel: >65%
[Lazebnik et al. CVPR 2006]

SIFT->Sparse coding on Blocks->Pyramid pooling->SVM: >75%


[Boureau et al. CVPR 2010] [Yang et al. 2008]

SIFT->Local Sparse coding on Block->Pyramid pooling->SVM: >77%


[Boureau et al. ICCV 2011]

(Small) supervised ConvNet with sparsity penalty: >71%


[rejected from CVPR,ICCV,etc] REAL TIME
Yann LeCun

Convolutional
ConvolutionalNetworks
Networks(ConvNets)
(ConvNets)fits
fitsthat
thatmodel
model

Yann LeCun

Why
Whydo
dotwo
twostages
stageswork
workbetter
betterthan
thanone
onestage?
stage?

Filter

Non

Bank

Lin

Pool

Norm

Filter

Non

Bank

Lin

Pool

Norm

The second stage extracts mid-level features


Having multiple stages helps the selectivity-invariance dilemma

Yann LeCun

Classifier

Learning
LearningHierarchical
HierarchicalRepresentations
Representations

Trainable
Feature
Transform

Trainable
Feature
Transform

Trainable
Classifier

LearnedInternalRepresentation
I agree with David Lowe: we should learn the features
It worked for speech, handwriting, NLP.....
In a way, the vision community has been running a ridiculously
inefficient evolutionary learning algorithm to learn features:
Mutation: tweak existing features in many different ways
Selection: Publish the best ones at CVPR
Reproduction: combine several features from the last CVPR
Iterate. Problem: Moore's law works against you
Yann LeCun

Sometimes,
Sometimes,
Biology
Biology gives
gives you
you
good
good hints
hints
example:
example:
contrast
contrast normalization
normalization

Yann LeCun

Harsh
HarshNon-Linearity
Non-Linearity++Contrast
ContrastNormalization
Normalization++Sparsity
Sparsity
CConvolutions(filterbank)
SoftThresholding+Abs
NSubtractiveandDivisiveLocalNormalization
PPoolingdownsamplinglayer:averageormax?

subtractive+divisive
contrastnormalization

Pooling,subsampling

Rectification

Thresholding

Convolutions

Yann LeCun

THISISONESTAGEOFTHECONVNET

Soft
SoftThresholding
ThresholdingNon-Linearity
Non-Linearity

Yann LeCun

Local
LocalContrast
ContrastNormalization
Normalization
Performed on the state of every layer, including
the input
Subtractive Local Contrast Normalization
Subtracts from every value in a feature a
Gaussian-weighted average of its
neighbors (high-pass filter)
Divisive Local Contrast Normalization
Divides every value in a layer by the
standard deviation of its neighbors over
space and over all feature maps
Subtractive + Divisive LCN performs a kind of
approximate whitening.

Yann LeCun

C101
C101Performance
Performance(I(Iknow,
know,IIknow)
know)

Small network: 64 features at stage-1, 256 features at stage-2:


Tanh non-linearity, No Rectification, No normalization:

29%

Tanh non-linearity, Rectification, normalization:

65%

Shrink non-linearity, Rectification, norm, sparsity penalty 71%

Yann LeCun

Results
Resultson
onCaltech101
Caltech101with
withsigmoid
sigmoidnon-linearity
non-linearity

likeHMAXmodel

Yann LeCun

Feature
Feature Learning
Learning
Works
Works Really
Really Well
Well
on
on everything
everything but
but C101
C101

Yann LeCun

C101
C101isisvery
veryunfavorable
unfavorableto
tolearning-based
learning-basedsystems
systems
Because it's so small. We are switching to ImageNet
Some results on NORB

Nonormalization
Randomfilters
Unsupfilters
Supfilters
Unsup+Supfilters

Yann LeCun

Sparse
SparseAuto-Encoders
Auto-Encoders
Inference by gradient descent starting from the encoder output

E Y , Z =Y W d Z Z g e W e ,Y j z j
i

Z =argmin z E Y , z ; W
i

INPUT

Z
i

ge W e ,Y
Yann LeCun

j .

WdZ

Y Y

2
Z Z

z j

FEATURES

Using
UsingPSD
PSDto
toTrain
TrainaaHierarchy
Hierarchyof
ofFeatures
Features
Phase 1: train first layer using PSD

2
Y i Y

j .

WdZ

Z
ge W e ,Y i

z j

2
Z Z

FEATURES
Yann LeCun

Using
UsingPSD
PSDto
toTrain
TrainaaHierarchy
Hierarchyof
ofFeatures
Features
Phase 1: train first layer using PSD
Phase 2: use encoder + absolute value as feature extractor

z j
ge W e ,Y i
FEATURES

Yann LeCun

Using
UsingPSD
PSDto
toTrain
TrainaaHierarchy
Hierarchyof
ofFeatures
Features
Phase 1: train first layer using PSD
Phase 2: use encoder + absolute value as feature extractor
Phase 3: train the second layer using PSD

2
Y i Y

z j
ge W e ,Y i

j .

WdZ

Z
ge W e ,Y i

z j

2
Z Z

FEATURES
Yann LeCun

Using
UsingPSD
PSDto
toTrain
TrainaaHierarchy
Hierarchyof
ofFeatures
Features
Phase 1: train first layer using PSD
Phase 2: use encoder + absolute value as feature extractor
Phase 3: train the second layer using PSD
Phase 4: use encoder + absolute value as 2 nd feature extractor

z j
ge W e ,Y i

z j
ge W e ,Y i
FEATURES

Yann LeCun

Using
UsingPSD
PSDto
toTrain
TrainaaHierarchy
Hierarchyof
ofFeatures
Features
Phase 1: train first layer using PSD
Phase 2: use encoder + absolute value as feature extractor
Phase 3: train the second layer using PSD
Phase 4: use encoder + absolute value as 2 nd feature extractor
Phase 5: train a supervised classifier on top
Phase 6 (optional): train the entire system with supervised back-propagation

z j
ge W e ,Y i

z j

classifier

ge W e ,Y i
FEATURES

Yann LeCun

Learned
LearnedFeatures
Featureson
onnatural
naturalpatches:
patches:V1-like
V1-likereceptive
receptivefields
fields

Yann LeCun

Using
UsingPSD
PSDFeatures
Featuresfor
forObject
ObjectRecognition
Recognition
64 filters on 9x9 patches trained with PSD
with Linear-Sigmoid-Diagonal Encoder

Yann LeCun

ConvolutionalSparseCoding
ConvolutionalSparseCoding
[Kavukcuogluetal.NIPS2010]:convolutionalPSD
[Zeiler,Krishnan,Taylor,Fergus,CVPR2010]:DeconvolutionalNetwork
[Lee,Gross,Ranganath,Ng,ICML2009]:ConvolutionalBoltzmannMachine
[Norouzi,Ranjbar,Mori,CVPR2009]:ConvolutionalBoltzmannMachine
[Chen,Sapiro,Dunson,Carin,Preprint2010]:DeconvolutionalNetworkwith
automaticadjustmentofcodedimension.

Yann LeCun

Convolutional
ConvolutionalTraining
Training
Problem:
With patch-level training, the learning algorithm must reconstruct
the entire patch with a single feature vector
But when the filters are used convolutionally, neighboring feature
vectors will be highly redundant

Patchleveltrainingproduces
lotsoffiltersthatareshifted
versionsofeachother.

Yann LeCun

Convolutional
ConvolutionalSparse
SparseCoding
Coding
Replace the dot products with dictionary element by convolutions.
Input Y is a full image
Each code component Zk is a feature map (an image)
Each dictionary element is a convolution kernel
Regular sparse coding

Convolutional S.C.

k .

Zk

Wk

deconvolutional networks [Zeiler, Taylor, Fergus CVPR 2010]


Yann LeCun

Convolutional
ConvolutionalPSD:
PSD:Encoder
Encoderwith
withaasoft
softsh()
sh()Function
Function
Convolutional Formulation
Extend sparse coding from PATCH to IMAGE

PATCH based learning

Yann LeCun

CONVOLUTIONAL learning

Cifar-10
Cifar-10Dataset
Dataset
Dataset of tiny images
Images are 32x32 color images
10 object categories with 50000 training and 10000 testing
Example Images

Yann LeCun

Comparative
ComparativeResults
Resultson
onCifar-10
Cifar-10Dataset
Dataset

* Krizhevsky. Learning multiple layers of features from tiny images. Masters thesis, Dept of CS U of Toronto
**Ranzato and Hinton. Modeling pixel means and covariances using a factorized third order boltzmann machine.
2010
YannCVPR
LeCun

Road
RoadSign
SignRecognition
RecognitionCompetition
Competition
GTSRB Road Sign Recognition Competition (phase 1)
32x32 images
The 13 of the top 14 entries are ConvNets, 6 from NYU, 7 from IDSIA
No 6 is humans!

Yann LeCun

Pedestrian
PedestrianDetection
Detection(INRIA
(INRIADataset)
Dataset)

Yann LeCun

[Sermanetetal.,RejectedfromICCV2011]]

Pedestrian
PedestrianDetection:
Detection:Examples
Examples

Yann LeCun

[Kavukcuogluetal.NIPS2010]

Learning
Learning
InvariantFeatures
InvariantFeatures

Yann LeCun

Why
Whyjust
justpool
poolover
overspace?
space?Why
Whynot
notover
overorientation?
orientation?
Using an idea from Hyvarinen: topographic square pooling (subspace ICA)
1. Apply filters on a patch (with suitable non-linearity)
2. Arrange filter outputs on a 2D plane
3. square filter outputs
4. minimize sqrt of sum of blocks of squared filter outputs

Yann LeCun

Why
Whyjust
justpool
poolover
overspace?
space?Why
Whynot
notover
overorientation?
orientation?
The filters arrange
themselves spontaneously so
that similar filters enter the
same pool.
The pooling units can be seen
as complex cells
They are invariant to local
transformations of the input
For some it's translations,
for others rotations, or
other transformations.

Yann LeCun

Pinwheels?
Pinwheels?
Does that look
pinwheely to
you?

Yann LeCun

Sparsity
Sparsity through
through
Lateral
Lateral Inhibition
Inhibition

Yann LeCun

Invariant
InvariantFeatures
FeaturesLateral
LateralInhibition
Inhibition
Replace the L1 sparsity term by a lateral inhibition matrix

Yann LeCun

Invariant
InvariantFeatures
FeaturesLateral
LateralInhibition
Inhibition
Zeros I S matrix have tree structure

Yann LeCun

Invariant
InvariantFeatures
FeaturesLateral
LateralInhibition
Inhibition
Non-zero values in S form a ring in a 2D topology
Input patches are high-pass filtered

Yann LeCun

Invariant
InvariantFeatures
FeaturesLateral
LateralInhibition
Inhibition
Non-zero values in S form a ring in a 2D topology
Left: non high-pass filtering of input
Right: patch-level mean removal

Yann LeCun

Invariant
InvariantFeatures
FeaturesShort-Range
Short-RangeLateral
LateralExcitation
Excitation++L1
L1
l

Yann LeCun

Disentangling
Disentangling the
the
Explanatory
Explanatory Factors
Factors
of
of Images
Images

Yann LeCun

Separating
Separating
I used to think that recognition was all about eliminating irrelevant
information while keeping the useful one
Building invariant representations
Eliminating irrelevant variabilities
I now think that recognition is all about disentangling independent factors
of variations:
Separating what and where
Separating content from instantiation parameters
Hinton's capsules; Karol Gregor's what-where auto-encoders

Yann LeCun

Invariant
InvariantFeatures
Featuresthrough
throughTemporal
TemporalConstancy
Constancy
Object is cross-product of object type and instantiation parameters
[Hinton 1981]

small

Objecttype
Yann LeCun

[KarolGregoretal.]

medium

Objectsize

large

Invariant
InvariantFeatures
Featuresthrough
throughTemporal
TemporalConstancy
Constancy
Decoder

St

St1

W1

t
1

t
1

W1

1
f W

Encoder
Yann LeCun

t1
1

t1
1

W1

t2
1

t2
1

1
f W

1
f W

Predicted
input

St2

t1

W2

t
2

Inferred
code

t
2

Predicted
code

f
2
W
W 2 W 2

t2

Input

Invariant
InvariantFeatures
Featuresthrough
throughTemporal
TemporalConstancy
Constancy

C1
(where)

C2
(what)

Yann LeCun

Generating
Generatingfrom
fromthe
theNetwork
Network

Input

Yann LeCun

What
What is
is the
the right
right
criterion
criterion to
to train
train
hierarchical
hierarchical feature
feature
extraction
extraction
architectures?
architectures?

Yann LeCun

Flattening
Flatteningthe
theData
DataManifold?
Manifold?
The manifold of all images of <Category-X> is low-dimensional
and highly curvy
Feature extractors should flatten the manifold

Yann LeCun

Flattening
Flattening the
the
Data
Data Manifold?
Manifold?

Yann LeCun

The
TheUltimate
UltimateRecognition
RecognitionSystem
System

Trainable
Feature
Transform

Trainable
Feature
Transform

Trainable
Classifier

LearnedInternalRepresentation
Bottom-up and top-down information
Top-down: complex inference and disambiguation
Bottom-up: learns to quickly predict the result of the top-down
inference
Integrated supervised and unsupervised learning
Capture the dependencies between all observed variables
Compositionality
Each stage has latent instantiation variables
Yann LeCun

S-ar putea să vă placă și