Sunteți pe pagina 1din 11

Deep Residual Learning for

Image Recognition
Raining He, Xiangyu Zhang, Shaoqin Ren, Jian Sun
Microsoft Research

presented by Tomoki Tsuchida

Deeper Models Are Better


For ImageNet (ILSVRC) competition:

So why not just train more layers (if youve got the
hardware?)

Problems with Going Deeper


1. Vanishing / exploding gradient problem?

Better weight initialization and batch normalization have practically


solved it (even for deeper nets).

2. Performance degradation problem:

Networks with more layers perform worse on the test data.

This is not overfitting, since they also do worse on the training data.

20

10

56-layer

56-layer
20-layer

10

20-layer
0

iter. (1e4)

test error (%)

training error (%)

20

0
0

iter. (1e4)

Just stacking more layers makes it harder to train in practice.

Deep Residual Learning

In principle, deeper network should be able to do at least as well as a


shallower network by using identity mapping for the deeper layers.

But in practice, these identity mapping (or better) are not learned.

Idea: learn the residual of the input.

Instead of learning H(x) directly, learn F(x) s.t. H(x) = F(x) + x.

Hypothesis: F(x) is easier to learn than H(x).

e.g. for identity mapping, learning F(x) = 0 is easier than learning H(x) = 1.
x

shortcut connections

weight layer

F(x)

relu

weight layer

identity

F(x)+x

relu

output
size: 224

VGG-19

34-layer plain

34-layer residual

image

image

image

3x3 conv, 128

7x7 conv, 64, /2

7x7 conv, 64, /2

pool, /2

pool, /2

pool, /2

3x3 conv, 256

3x3 conv, 64

3x3 conv, 64

3x3 conv, 256

3x3 conv, 64

3x3 conv, 64

3x3 conv, 256

3x3 conv, 64

3x3 conv, 64

3x3 conv, 256

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

pool, /2

3x3 conv, 128, /2

3x3 conv, 128, /2

3x3 conv, 512

3x3 conv, 128

3x3 conv, 128

3x3 conv, 512

3x3 conv, 128

3x3 conv, 128

3x3 conv, 512

3x3 conv, 128

3x3 conv, 128

3x3 conv, 512

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

pool, /2

3x3 conv, 256, /2

3x3 conv, 256, /2

3x3 conv, 512

3x3 conv, 256

3x3 conv, 256

3x3 conv, 64
3x3 conv, 64
pool, /2

output
size: 112
3x3 conv, 128

output
size: 56

output
size: 28

output
size: 14

Related Ideas

Highway networks (Srivastava, Greff, and Schmidhuber, 2015)

Superset of residual network architecture: the residual transformation


and identity map are gated. (Gates are parameterized.)

LSTM

LSTMs unrolled in time also has the same information flow (with gates).

60

60

50

50
error (%)

error (%)

Results (ImageNet)

40

40

34-layer
18-layer
30

20
0

30

18-layer

plain-18
plain-34
10

20

30
iter. (1e4)

40

50

20
0

ResNet-18
ResNet-34
10

34-layer
20

30
iter. (1e4)

40

With residual training architecture, they lowered the


training (thin) and validation (bold) errors with deeper
nets, as was hoped.

50

Shortcut Parameters
A. zero-padding for increasing dimensions
B. projection shortcuts for increasing dimensions

y F px, tWi uq ` Ws x

C. all shortcuts are projections

Extra projection parameters didnt yield much improvement.

Going Deeper (on CIFAR-10)


ResNets

20

20-layer

0
0

10

20-layer

110-layer

plain-20
plain-32
plain-44
plain-56
1

error (%)

error (%)

56-layer

10

iter. (1e4)

0
0

20

ResNet-20
ResNet-32
ResNet-44
ResNet-56
ResNet-110

iter. (1e4)

error (%)

Plain

20

residual-110
residual-1202

10

1
0

iter. (1e4)

Layer Responses
plain-20
plain-56
ResNet-20
ResNet-56
ResNet-110

std

3
2
1
0

20

40

60

80

100

layer index (original)


plain-20
plain-56
ResNet-20
ResNet-56
ResNet-110

std

3
2
1
0

20

40

60

80

100

layer index (sorted by magnitude)

ResNets generally produce smaller responses than plain,


since residuals are closer to zero.
(But these are after BN why are there differences? Small values?)

Conclusion

ResNets architecture allow training of very deep nets


(>30) without performance degradation.

Won 1st places in ImageNet 2015 and COCO 2015.

The key idea is to model the residuals instead of the input


values.

Can be thought of as a simplified version of the highway


network design, which is in turn a simplified version of
time-unrolled LSTM.

Its a trade-off between flexibility and where you use


your parameters.

S-ar putea să vă placă și