Deep Residual Learning For Image Recognition (Summary)

Deep Residual Learning for
Image Recognition
Raining He, Xiangyu Zhang, Shaoqin Ren, Jian Sun
Microsoft Research
presented by Tomoki Tsuchida
Deeper Models Are Better

For ImageNet (ILSVRC) competition:
So why not just train more layers (if youve got the
hardware?)
Problems with Going Deeper

1. Vanishing / exploding gradient problem?
Better weight initialization and batch normalization have practically

solved it (even for deeper nets).
2. Performance degradation problem:
Networks with more layers perform worse on the test data.
This is not overfitting, since they also do worse on the training data.
20
10
56-layer
56-layer
20-layer
10
20-layer
0
iter. (1e4)
test error (%)
training error (%)
20
0
0
iter. (1e4)
Just stacking more layers makes it harder to train in practice.
Deep Residual Learning
In principle, deeper network should be able to do at least as well as a

shallower network by using identity mapping for the deeper layers.
But in practice, these identity mapping (or better) are not learned.
Idea: learn the residual of the input.
Instead of learning H(x) directly, learn F(x) s.t. H(x) = F(x) + x.
Hypothesis: F(x) is easier to learn than H(x).
e.g. for identity mapping, learning F(x) = 0 is easier than learning H(x) = 1.
x
shortcut connections
weight layer
F(x)
relu
weight layer
identity
F(x)+x
relu
output
size: 224
VGG-19
34-layer plain
34-layer residual
image
image
image
3x3 conv, 128
7x7 conv, 64, /2
7x7 conv, 64, /2
pool, /2
pool, /2
pool, /2
3x3 conv, 256
3x3 conv, 64
3x3 conv, 64
3x3 conv, 256
3x3 conv, 64
3x3 conv, 64
3x3 conv, 256
3x3 conv, 64
3x3 conv, 64
3x3 conv, 256
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
pool, /2
3x3 conv, 128, /2
3x3 conv, 128, /2
3x3 conv, 512
3x3 conv, 128
3x3 conv, 128
3x3 conv, 512
3x3 conv, 128
3x3 conv, 128
3x3 conv, 512
3x3 conv, 128
3x3 conv, 128
3x3 conv, 512
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
pool, /2
3x3 conv, 256, /2
3x3 conv, 256, /2
3x3 conv, 512
3x3 conv, 256
3x3 conv, 256
3x3 conv, 64
3x3 conv, 64
pool, /2
output
size: 112
3x3 conv, 128
output
size: 56
output
size: 28
output
size: 14
Related Ideas
Highway networks (Srivastava, Greff, and Schmidhuber, 2015)
Superset of residual network architecture: the residual transformation

and identity map are gated. (Gates are parameterized.)
LSTM
LSTMs unrolled in time also has the same information flow (with gates).
60
60
50
50
error (%)
error (%)
Results (ImageNet)
40
40
34-layer
18-layer
30
20
0
30
18-layer
plain-18
plain-34
10
20
30
iter. (1e4)
40
50
20
0
ResNet-18
ResNet-34
10
34-layer
20
30
iter. (1e4)
40
With residual training architecture, they lowered the

training (thin) and validation (bold) errors with deeper
nets, as was hoped.
50
Shortcut Parameters
A. zero-padding for increasing dimensions
B. projection shortcuts for increasing dimensions
y F px, tWi uq ` Ws x
C. all shortcuts are projections
Extra projection parameters didnt yield much improvement.
Going Deeper (on CIFAR-10)

ResNets
20
20-layer
0
0
10
20-layer
110-layer
plain-20
plain-32
plain-44
plain-56
1
error (%)
error (%)
56-layer
10
iter. (1e4)
0
0
20
ResNet-20
ResNet-32
ResNet-44
ResNet-56
ResNet-110
iter. (1e4)
error (%)
Plain
20
residual-110
residual-1202
10
1
0
iter. (1e4)
Layer Responses
plain-20
plain-56
ResNet-20
ResNet-56
ResNet-110
std
3
2
1
0
20
40
60
80
100
layer index (original)

plain-20
plain-56
ResNet-20
ResNet-56
ResNet-110
std
3
2
1
0
20
40
60
80
100
layer index (sorted by magnitude)
ResNets generally produce smaller responses than plain,

since residuals are closer to zero.
(But these are after BN why are there differences? Small values?)
Conclusion
ResNets architecture allow training of very deep nets

(>30) without performance degradation.
Won 1st places in ImageNet 2015 and COCO 2015.
The key idea is to model the residuals instead of the input

values.
Can be thought of as a simplified version of the highway

network design, which is in turn a simplified version of
time-unrolled LSTM.
Its a trade-off between flexibility and where you use

your parameters.

Deep Residual Learning For Image Recognition (Summary)

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Deep Residual Learning For Image Recognition (Summary)

Încărcat de

Drepturi de autor:

Formate disponibile

Deep Residual Learning for

presented by Tomoki Tsuchida

Deeper Models Are Better

Problems with Going Deeper

Better weight initialization and batch normalization have practically

2. Performance degradation problem:

Networks with more layers perform worse on the test data.

test error (%)

training error (%)

Just stacking more layers makes it harder to train in practice.

Deep Residual Learning

In principle, deeper network should be able to do at least as well as a

Idea: learn the residual of the input.

Instead of learning H(x) directly, learn F(x) s.t. H(x) = F(x) + x.

Hypothesis: F(x) is easier to learn than H(x).

3x3 conv, 128

7x7 conv, 64, /2

7x7 conv, 64, /2

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 128, /2

3x3 conv, 128, /2

3x3 conv, 512

3x3 conv, 128

3x3 conv, 128

3x3 conv, 512

3x3 conv, 128

3x3 conv, 128

3x3 conv, 512

3x3 conv, 128

3x3 conv, 128

3x3 conv, 512

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 256, /2

3x3 conv, 256, /2

3x3 conv, 512

3x3 conv, 256

3x3 conv, 256

Highway networks (Srivastava, Greff, and Schmidhuber, 2015)

Superset of residual network architecture: the residual transformation

With residual training architecture, they lowered the

C. all shortcuts are projections

Extra projection parameters didnt yield much improvement.

Going Deeper (on CIFAR-10)

layer index (original)

layer index (sorted by magnitude)

ResNets generally produce smaller responses than plain,

ResNets architecture allow training of very deep nets

Won 1st places in ImageNet 2015 and COCO 2015.

The key idea is to model the residuals instead of the input

Can be thought of as a simplified version of the highway

Its a trade-off between flexibility and where you use

S-ar putea să vă placă și