Sunteți pe pagina 1din 86

10-­‐601  Introduction  to  Machine  Learning

Machine  Learning  Department


School  of  Computer  Science
Carnegie  Mellon  University

Deep  Learning
(CNNs)
Deep  Learning  Readings:
Murphy  28 Matt  Gormley
Bishop  -­‐-­‐
HTF  -­‐-­‐
Lecture  21
Mitchell  -­‐-­‐ April  05,  2017

1
Reminders
• Homework 5 (Part II):  Peer  Review
– Release:  Wed,  Mar.  29 Expectation:  You  
should  spend  at  most  1  
– Due:  Wed,  Apr.  05  at  11:59pm hour  on  your  reviews

• Peer  Tutoring
• Homework 7:  Deep  Learning
– Release:  Wed,  Apr.  05  
– Watch for multiple due dates!!

2
BACKPROPAGATION

3
A  Recipe  for  
Background
Machine  Learning
1.  Given  training  data: 3.  Define  goal:

2.  Choose  each  of  these:


– Decision  function 4.  Train  with  SGD:
(take  small  steps  
opposite  the  gradient)
– Loss  function

4
Training Backpropagation

Whiteboard
– Example:  Backpropagation  for  Calculus  Quiz  #1

Calculus  Quiz  #1:


Suppose  x  =  2  and  z  =  3,  what  are  dy/dx  
and  dy/dz for  the  function  below?

5
Training Backpropagation
Automatic  Differentiation  – Reverse  Mode  (aka.  Backpropagation)
Forward  Computation
1. Write  an  algorithm for  evaluating  the  function  y  =  f(x).  The  
algorithm  defines  a  directed  acyclic  graph,  where  each  variable  is  a  
node  (i.e.  the  “computation  graph”)
2. Visit  each  node  in  topological  order.  
For  variable  ui with  inputs  v1,…,  vN
a. Compute  ui =  gi(v1,…,  vN)
b. Store  the  result  at  the  node
Backward  Computation
1. Initialize all  partial  derivatives  dy/duj to  0  and  dy/dy =  1.
2. Visit  each  node  in  reverse  topological  order.  
For  variable  ui =  gi(v1,…,  vN)
a. We  already  know  dy/dui
b. Increment  dy/dvj by  (dy/dui)(dui/dvj)
(Choice  of  algorithm  ensures  computing  (dui/dvj)  is  easy)

Return  partial  derivatives  dy/dui  for  all  variables 6


Training Backpropagation
Simple Example: The goal is to compute J = +Qb(bBM(x2 ) + 3x2 )
on the forward pass and the derivative dJ
dx on the backward pass.

Forward Backward
dJ
J = cos(u) Y= sin(u)
du
dJ dJ du du dJ dJ du du
u = u1 + u2 Y= , =1 Y= , =1
du1 du du1 du1 du2 du du2 du2
dJ dJ du1 du1
u1 = sin(t) Y= , = +Qb(t)
dt du1 dt dt
dJ dJ du2 du2
u2 = 3t Y= , =3
dt du2 dt dt
dJ dJ dt dt
t = x2 Y= , = 2x
dx dt dx dx

7
Training Backpropagation
Simple Example: The goal is to compute J = +Qb(bBM(x2 ) + 3x2 )
on the forward pass and the derivative dJ
dx on the backward pass.

Forward Backward
dJ
J = cos(u) Y= sin(u)
du
dJ dJ du du dJ dJ du du
u = u1 + u2 Y= , =1 Y= , =1
du1 du du1 du1 du2 du du2 du2
dJ dJ du1 du1
u1 = sin(t) Y= , = +Qb(t)
dt du1 dt dt
dJ dJ du2 du2
u2 = 3t Y= , =3
dt du2 dt dt
dJ dJ dt dt
t = x2 Y= , = 2x
dx dt dx dx

8
Training Backpropagation
Output

Case  1:
Logistic   θ1 θ2 θ3 θM
Regression

Input

Forward Backward
dJ y (1 y )
J = y HQ; y + (1 y ) HQ;(1 y) = +
dy y y 1
1 dJ dJ dy dy 2tT( a)
y= = , =
1 + 2tT( a) da dy da da (2tT( a) + 1)2
D
dJ dJ da da
a= j xj = , = xj
j=0
d j da d j d j
dJ dJ da da
= , = j
dxj da dxj dxj
9
Training Backpropagation
(F) Loss
J = 12 (y y (d) )2

(E) Output (sigmoid)


1
y = 1+2tT( b)
Output

(D) Output (linear)


D
… b = j=0 j zj
Hidden  Layer

(C) Hidden (sigmoid)


1
zj = 1+2tT( aj ) , j

Input

(B) Hidden (linear)


M
aj = i=0 ji xi , j

(A) Input
Given xi , i 10
Training Backpropagation
(F) Loss
J = 12 (y y )2

(E) Output (sigmoid)


1
y = 1+2tT( b)
Output

(D) Output (linear)


D
… b = j=0 j zj
Hidden  Layer

(C) Hidden (sigmoid)


1
zj = 1+2tT( aj ) , j

Input

(B) Hidden (linear)


M
aj = i=0 ji xi , j

(A) Input
Given xi , i 11
Training Backpropagation
Case  2: Forward Backward
Neural   dJ y (1 y )
J = y HQ; y + (1 y ) HQ;(1 y) = +
Network dy y y 1
1 dJ dJ dy dy 2tT( b)
y= = , =
1 + 2tT( b) db dy db db (2tT( b) + 1)2

D
… dJ dJ db db
b= j zj = , = zj
j=0
d j db d j d j
dJ dJ db db
= , = j
dzj db dzj dzj
1 dJ dJ dzj dzj 2tT( aj )
zj = = , =
1 + 2tT( aj ) daj dzj daj daj (2tT( aj ) + 1)2
M
dJ dJ daj daj
aj = ji xi = , = xi
i=0
d ji daj d ji d ji
D
dJ dJ daj daj
= , = ji
dxi daj dxi dxi j=0 12
Training Backpropagation
Case  2: Forward Backward
Neural   dJ y (1 y )
Loss J = y HQ; y + (1 y ) HQ;(1 y) = +
Network dy y y 1
1 dJ dJ dy dy 2tT( b)
y= = , =
Sigmoid 1 + 2tT( b) db dy db db (2tT( b) + 1)2

D
… dJ dJ db db
b= j zj = , = zj
j=0
d j db d j d j
Linear
dJ dJ db db
= , = j
dzj db dzj dzj
1 dJ dJ dzj dzj 2tT( aj )
Sigmoid zj = = , =
1 + 2tT( aj ) daj dzj daj daj (2tT( aj ) + 1)2
M
dJ dJ daj daj
aj = ji xi = , = xi
i=0
d ji daj d ji d ji
Linear
D
dJ dJ daj daj
= , = ji
dxi daj dxi dxi j=0 13
Training Backpropagation

Whiteboard
– SGD  for  Neural  Network
– Example:  Backpropagation  for  Neural  Network

14
Training Backpropagation
Backpropagation  (Auto.Diff.  -­‐ Reverse  Mode)
Forward  Computation
1. Write  an  algorithm for  evaluating  the  function  y  =  f(x).  The  
algorithm  defines  a  directed  acyclic  graph,  where  each  variable  is  a  
node  (i.e.  the  “computation  graph”)
2. Visit  each  node  in  topological  order.  
a. Compute  the  corresponding  variable’s  value
b. Store  the  result  at  the  node

Backward  Computation
1. Initialize all  partial  derivatives  dy/duj to  0  and  dy/dy =  1.
2. Visit  each  node  in  reverse  topological  order.  
For  variable  ui =  gi(v1,…,  vN)
a. We  already  know  dy/dui
b. Increment  dy/dvj by  (dy/dui)(dui/dvj)
(Choice  of  algorithm  ensures  computing  (dui/dvj)  is  easy)
Return  partial  derivatives  dy/dui  for  all  variables 15
A  Recipe  for  
Background
Gradients
Machine  Learning
1.  Given  training  data: 3.  Define  goal:
Backpropagation can  compute  this  
gradient!  
And  it’s  a  special  case  of  a  more  
general  algorithm  called  reverse-­‐
2.  Choose  each  of  these:
mode  automatic  differentiation  that  
– Decision  function can  compute  the  gradient  of  any  
4.  Train  with  SGD:
differentiable  function  efficiently!
(take  small  steps  
opposite  the  gradient)
– Loss  function

16
Summary
1. Neural  Networks…
– provide  a  way  of  learning  features
– are  highly  nonlinear  prediction  functions
– (can  be)  a  highly  parallel  network  of  logistic  
regression  classifiers
– discover  useful  hidden  representations  of  the  
input
2. Backpropagation…
– provides  an  efficient  way  to  compute  gradients
– is  a  special  case  of  reverse-­‐mode  automatic  
differentiation
17
DEEP  LEARNING

18
Deep  Learning  Outline
• Background:  Computer  Vision
– Image  Classification
– ILSVRC  2010  -­‐ 2016
– Traditional  Feature  Extraction  Methods
– Convolution  as  Feature  Extraction
• Convolutional  Neural  Networks  (CNNs)
– Learning  Feature  Abstractions
– Common  CNN  Layers:
• Convolutional  Layer
• Max-­‐Pooling  Layer
• Fully-­‐connected  Layer  (w/tensor  input)
• Softmax Layer
• ReLU Layer
– Background:  Subgradient
– Architecture:  LeNet
– Architecture:  AlexNet
• Training  a  CNN
– SGD  for  CNNs
– Backpropagation  for  CNNs

19
Why  is  everyone  talking  
Motivation
about  Deep  Learning?
• Because  a  lot  of  money  is  invested  in  it…
– DeepMind:    Acquired  by  Google  for  $400  
million
– DNNResearch:    Three  person  startup  
(including  Geoff  Hinton)  acquired  by  Google  
for  unknown  price  tag
– Enlitic,  Ersatz,  MetaMind,  Nervana,  Skylab:  
Deep  Learning  startups  commanding  millions  
of  VC  dollars
• Because  it  made  the  front  page  of  the  
New  York  Times
20
Why  is  everyone  talking  
Motivation
about  Deep  Learning?
1960s Deep  learning:  
– Has  won  numerous  pattern  recognition  
1980s competitions
– Does  so  with  minimal  feature  
1990s engineering
This  wasn’t  always  the  case!
2006 Since  1980s:   Form  of  models  hasn’t  changed  much,  
but  lots  of  new  tricks…
– More  hidden  units
2016 – Better  (online)  optimization
– New  nonlinear  functions  (ReLUs)
– Faster  computers  (CPUs  and  GPUs)
21
BACKGROUND:  COMPUTER  VISION

22
Example:  Image  Classification
• ImageNet LSVRC-­‐2011  contest:  
– Dataset:  1.2  million  labeled  images,  1000  classes
– Task:  Given  a  new  image,  label  it  with  the  correct  class
– Multiclass classification  problem
• Examples  from  http://image-­‐net.org/

23
24
25
26
Example:  Image  Classification
Traditional  Feature  Extraction  for  Images:
– SIFT
– HOG

27
Example:  Image  Classification
CNN  for  Image  Classification
(Krizhevsky,  Sutskever &  Hinton,  2012)
15.3%  error  on  ImageNet LSVRC-­‐2012  contest
Input   • Five  convolutional  layers   1000-­‐way  
image   (w/max-­‐pooling)
(pixels) • Three  fully  connected  layers softmax

28

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
CNNs  for  Image  Recognition

(slide from Kaiming He’s recent presentation) 29


Slide  from  Kaiming He
Fei-Fei Li & Andrej Karpathy & Justin Johnson
CONVOLUTION

30
What’s  a  convolution?
• Basic  idea:
– Pick  a  3x3  matrix  F  of  weights
– Slide  this  over  an  image  and  compute  the  “inner  product”  
(similarity)  of  F  and  the  corresponding  field  of  the  image,  and  
replace  the  pixel  in  the  center  of  the  field  with  the  output  of  the  
inner  product  operation
• Key  point:
– Different  convolutions  extract  different  types  of  low-­‐level  
“features”  from  an  image
– All  that  we  need  to  vary  to  generate  these  different  features  is  the  
weights  of  F

Slide  adapted  from  William  Cohen


Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 1 1 1 1 1
0 1 0 0 1 0 0 0 0 0 1 0 0 1 0
0 1 0 1 0 0 0 0 1 1 1 0 1 0 0
0 1 1 0 0 0 0 0 1 0 1 1 0 0 0
0 1 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0

32
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 3 2 2 3 1
0 1 0 0 1 0 0 0 0 0 2 0 2 1 0
0 1 0 1 0 0 0 0 1 1 2 2 1 0 0
0 1 1 0 0 0 0 0 1 0 3 1 0 0 0
0 1 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0

33
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 3 2 2 3 1
0 1 0 0 1 0 0 2 0 2 1 0
0 1 0 1 0 0 0 2 2 1 0 0
0 1 1 0 0 0 0 3 1 0 0 0
0 1 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0

34
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 3 2 2 3 1
0 1 0 0 1 0 0 2 0 2 1 0
0 1 0 1 0 0 0 2 2 1 0 0
0 1 1 0 0 0 0 3 1 0 0 0
0 1 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0

35
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 3 2 2 3 1
0 1 0 0 1 0 0 2 0 2 1 0
0 1 0 1 0 0 0 2 2 1 0 0
0 1 1 0 0 0 0 3 1 0 0 0
0 1 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0

36
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 3
0 1 0 0 1 0 0
0 1 0 1 0 0 0
0 1 1 0 0 0 0
0 1 0 0 0 0 0
0 0 0 0 0 0 0

37
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 3 2
0 1 0 0 1 0 0
0 1 0 1 0 0 0
0 1 1 0 0 0 0
0 1 0 0 0 0 0
0 0 0 0 0 0 0

38
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 3 2 2
0 1 0 0 1 0 0
0 1 0 1 0 0 0
0 1 1 0 0 0 0
0 1 0 0 0 0 0
0 0 0 0 0 0 0

39
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 3 2 2 3
0 1 0 0 1 0 0
0 1 0 1 0 0 0
0 1 1 0 0 0 0
0 1 0 0 0 0 0
0 0 0 0 0 0 0

40
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 3 2 2 3 1
0 1 0 0 1 0 0
0 1 0 1 0 0 0
0 1 1 0 0 0 0
0 1 0 0 0 0 0
0 0 0 0 0 0 0

41
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 3 2 2 3 1
0 1 0 0 1 0 0 2
0 1 0 1 0 0 0
0 1 1 0 0 0 0
0 1 0 0 0 0 0
0 0 0 0 0 0 0

42
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 3 2 2 3 1
0 1 0 0 1 0 0 2 0
0 1 0 1 0 0 0
0 1 1 0 0 0 0
0 1 0 0 0 0 0
0 0 0 0 0 0 0

43
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Convolution
0 1 1 1 1 1 0 3 2 2 3 1
0 1 0 0 1 0 0 2 0 2 1 0
0 1 0 1 0 0 0 2 2 1 0 0
0 1 1 0 0 0 0 3 1 0 0 0
0 1 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0

44
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Identity  
0 1 1 1 1 1 0 Convolution 1 1 1 1 1
0 1 0 0 1 0 0 0 0 0 1 0 0 1 0
0 1 0 1 0 0 0 0 1 0 1 0 1 0 0
0 1 1 0 0 0 0 0 0 0 1 1 0 0 0
0 1 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0

45
Background:  Image  Processing
A  convolution  matrix  is  used  in  image  processing  for  
tasks  such  as  edge  detection,  blurring,  sharpening,  etc.
Input Image

0 0 0 0 0 0 0 Convolved  Image
Blurring
0 1 1 1 1 1 0 Convolution .4 .5 .5 .5 .4
0 1 0 0 1 0 0 .1 .1 .1 .4 .2 .3 .6 .3
0 1 0 1 0 0 0 .1 .2 .1 .5 .4 .4 .2 .1
0 1 1 0 0 0 0 .1 .1 .1 .5 .6 .2 .1 0
0 1 0 0 0 0 0 .4 .3 .1 0 0
0 0 0 0 0 0 0

46
What’s  a  convolution?
http://matlabtricks.com/post-­‐5/3x3-­‐convolution-­‐kernels-­‐with-­‐online-­‐demo

Slide  from  William  Cohen


What’s  a  convolution?
http://matlabtricks.com/post-­‐5/3x3-­‐convolution-­‐kernels-­‐with-­‐online-­‐demo

Slide  from  William  Cohen


What’s  a  convolution?
http://matlabtricks.com/post-­‐5/3x3-­‐convolution-­‐kernels-­‐with-­‐online-­‐demo

Slide  from  William  Cohen


What’s  a  convolution?
http://matlabtricks.com/post-­‐5/3x3-­‐convolution-­‐kernels-­‐with-­‐online-­‐demo

Slide  from  William  Cohen


What’s  a  convolution?
http://matlabtricks.com/post-­‐5/3x3-­‐convolution-­‐kernels-­‐with-­‐online-­‐demo

Slide  from  William  Cohen


What’s  a  convolution?
http://matlabtricks.com/post-­‐5/3x3-­‐convolution-­‐kernels-­‐with-­‐online-­‐demo

Slide  from  William  Cohen


What’s  a  convolution?
• Basic  idea:
– Pick  a  3x3  matrix  F  of  weights
– Slide  this  over  an  image  and  compute  the  “inner  product”  
(similarity)  of  F  and  the  corresponding  field  of  the  image,  and  
replace  the  pixel  in  the  center  of  the  field  with  the  output  of  the  
inner  product  operation
• Key  point:
– Different  convolutions  extract  different  types  of  low-­‐level  
“features”  from  an  image
– All  that  we  need  to  vary  to  generate  these  different  features  is  the  
weights  of  F

Slide  adapted  from  William  Cohen


Downsampling
• Suppose  we  use  a  convolution  with  stride  2
• Only  9  patches  visited  in  input,  so  only  9  pixels  in  output
Input Image

1 1 1 1 1 0 Convolved  Image
Convolution
1 0 0 1 0 0
1 0 1 0 0 0 1 1
1 1 0 0 0 0 1 1
1 0 0 0 0 0
0 0 0 0 0 0

54
Downsampling
• Suppose  we  use  a  convolution  with  stride  2
• Only  9  patches  visited  in  input,  so  only  9  pixels  in  output
Input Image

1 1 1 1 1 0 Convolved  Image
Convolution
1 0 0 1 0 0
3
1 0 1 0 0 0 1 1
1 1 0 0 0 0 1 1
1 0 0 0 0 0
0 0 0 0 0 0

55
Downsampling
• Suppose  we  use  a  convolution  with  stride  2
• Only  9  patches  visited  in  input,  so  only  9  pixels  in  output
Input Image

1 1 1 1 1 0 Convolved  Image
Convolution
1 0 0 1 0 0
3 3
1 0 1 0 0 0 1 1
1 1 0 0 0 0 1 1
1 0 0 0 0 0
0 0 0 0 0 0

56
Downsampling
• Suppose  we  use  a  convolution  with  stride  2
• Only  9  patches  visited  in  input,  so  only  9  pixels  in  output
Input Image

1 1 1 1 1 0 Convolved  Image
Convolution
1 0 0 1 0 0
3 3 1
1 0 1 0 0 0 1 1
1 1 0 0 0 0 1 1
1 0 0 0 0 0
0 0 0 0 0 0

57
Downsampling
• Suppose  we  use  a  convolution  with  stride  2
• Only  9  patches  visited  in  input,  so  only  9  pixels  in  output
Input Image

1 1 1 1 1 0 Convolved  Image
Convolution
1 0 0 1 0 0
3 3 1
1 0 1 0 0 0 1 1
3
1 1 0 0 0 0 1 1
1 0 0 0 0 0
0 0 0 0 0 0

58
Downsampling
• Suppose  we  use  a  convolution  with  stride  2
• Only  9  patches  visited  in  input,  so  only  9  pixels  in  output
Input Image

1 1 1 1 1 0 Convolved  Image
Convolution
1 0 0 1 0 0
3 3 1
1 0 1 0 0 0 1 1
3 1
1 1 0 0 0 0 1 1
1 0 0 0 0 0
0 0 0 0 0 0

59
Downsampling
• Suppose  we  use  a  convolution  with  stride  2
• Only  9  patches  visited  in  input,  so  only  9  pixels  in  output
Input Image

1 1 1 1 1 0 Convolved  Image
Convolution
1 0 0 1 0 0
3 3 1
1 0 1 0 0 0 1 1
3 1 0
1 1 0 0 0 0 1 1
1 0 0 0 0 0
0 0 0 0 0 0

60
Downsampling
• Suppose  we  use  a  convolution  with  stride  2
• Only  9  patches  visited  in  input,  so  only  9  pixels  in  output
Input Image

1 1 1 1 1 0 Convolved  Image
Convolution
1 0 0 1 0 0
3 3 1
1 0 1 0 0 0 1 1
3 1 0
1 1 0 0 0 0 1 1
1
1 0 0 0 0 0
0 0 0 0 0 0

61
Downsampling
• Suppose  we  use  a  convolution  with  stride  2
• Only  9  patches  visited  in  input,  so  only  9  pixels  in  output
Input Image

1 1 1 1 1 0 Convolved  Image
Convolution
1 0 0 1 0 0
3 3 1
1 0 1 0 0 0 1 1
3 1 0
1 1 0 0 0 0 1 1
1 0
1 0 0 0 0 0
0 0 0 0 0 0

62
Downsampling
• Suppose  we  use  a  convolution  with  stride  2
• Only  9  patches  visited  in  input,  so  only  9  pixels  in  output
Input Image

1 1 1 1 1 0 Convolved  Image
Convolution
1 0 0 1 0 0
3 3 1
1 0 1 0 0 0 1 1
3 1 0
1 1 0 0 0 0 1 1
1 0 0
1 0 0 0 0 0
0 0 0 0 0 0

63
CONVOLUTIONAL  NEURAL  NETS

64
Deep  Learning  Outline
• Background:  Computer  Vision
– Image  Classification
– ILSVRC  2010  -­‐ 2016
– Traditional  Feature  Extraction  Methods
– Convolution  as  Feature  Extraction
• Convolutional  Neural  Networks  (CNNs)
– Learning  Feature  Abstractions
– Common  CNN  Layers:
• Convolutional  Layer
• Max-­‐Pooling  Layer
• Fully-­‐connected  Layer  (w/tensor  input)
• Softmax Layer
• ReLU Layer
– Background:  Subgradient
– Architecture:  LeNet
– Architecture:  AlexNet
• Training  a  CNN
– SGD  for  CNNs
– Backpropagation  for  CNNs

65
Convolutional  Neural  Network  (CNN)
• Typical  layers  include:
– Convolutional  layer
– Max-­‐pooling  layer
– Fully-­‐connected  (Linear)  layer
– ReLU layer  (or  some  other  nonlinear  activation  function)
– Softmax
• These  can  be  arranged  into  arbitrarily  deep  topologies

Architecture  #1:  LeNet-­‐5

66
Convolutional  Layer
CNN  key  idea:  
Treat  convolution  matrix  as  
parameters  and  learn  them!
Input Image

0 0 0 0 0 0 0 Convolved  Image
Learned
0 1 1 1 1 1 0 Convolution .4 .5 .5 .5 .4
0 1 0 0 1 0 0 θ11 θ12 θ13 .4 .2 .3 .6 .3
0 1 0 1 0 0 0 θ21 θ22 θ23 .5 .4 .4 .2 .1
0 1 1 0 0 0 0 θ31 θ32 θ33 .5 .6 .2 .1 0
0 1 0 0 0 0 0 .4 .3 .1 0 0
0 0 0 0 0 0 0

67
Downsampling by  Averaging
• Downsampling by  averaging  used  to  be a  common  approach
• This  is  a  special  case  of  convolution  where  the  weights  are  fixed  to  a  
uniform  distribution
• The  example  below  uses  a  stride  of  2
Input Image

1 1 1 1 1 0 Convolved  Image
Convolution
1 0 0 1 0 0
3/4 3/4 1/4
1 0 1 0 0 0 1/4 1/4
3/4 1/4 0
1 1 0 0 0 0 1/4 1/4
1/4 0 0
1 0 0 0 0 0
0 0 0 0 0 0

68
Max-­‐Pooling
• Max-­‐pooling  is  another  (common)  form  of  downsampling
• Instead  of  averaging,  we  take  the  max  value  within  the  same  range  as  
the  equivalently-­‐sized  convolution
• The  example  below  uses  a  stride  of  2
Input Image
Max-­‐Pooled  
1 1 1 1 1 0 Image
Max-­‐
1 0 0 1 0 0 pooling
1 1 1
1 0 1 0 0 0 xi,j xi,j+1
1 1 0
1 1 0 0 0 0 xi+1,j xi+1,j+1
1 0 0
1 0 0 0 0 0
0 0 0 0 0 0

69
Multi-­‐Class  Output

Output …

Hidden  Layer …

Input …
71
Multi-­‐Class  Output
(F) Loss
Softmax Layer: J = k=1 yk HQ;(yk )
K

2tT(bk ) (E) Output (softmax)


yk = yk = K2tT(b k)

2tT(bl )
K l=12tT(b )
l

l=1
(D) Output (linear)
D
bk = j=0 kj zj k

Output
(C) Hidden (nonlinear)
zj = (aj ), j

Hidden  Layer
(B) Hidden (linear)
M
aj = i=0 ji xi , j


Input
(A) Input
Given xi , i
72
Training  a  CNN
Whiteboard
– SGD  for  CNNs
– Backpropagation  for  CNNs

73
Common  CNN  Layers
Whiteboard
– ReLU Layer
– Background:  Subgradient
– Fully-­‐connected  Layer  (w/tensor  input)
– Softmax Layer
– Convolutional  Layer
– Max-­‐Pooling  Layer

74
Convolutional  Layer

75
Convolutional  Layer

76
Max-­‐Pooling  Layer

77
Max-­‐Pooling  Layer

78
Convolutional  Neural  Network  (CNN)
• Typical  layers  include:
– Convolutional  layer
– Max-­‐pooling  layer
– Fully-­‐connected  (Linear)  layer
– ReLU layer  (or  some  other  nonlinear  activation  function)
– Softmax
• These  can  be  arranged  into  arbitrarily  deep  topologies

Architecture  #1:  LeNet-­‐5

79
Architecture  #2:  AlexNet
CNN  for  Image  Classification
(Krizhevsky,  Sutskever &  Hinton,  2012)
15.3%  error  on  ImageNet LSVRC-­‐2012  contest
Input   • Five  convolutional  layers   1000-­‐way  
image   (w/max-­‐pooling)
(pixels) • Three  fully  connected  layers softmax

80

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
CNNs  for  Image  Recognition

(slide from Kaiming He’s recent presentation) 81


Slide  from  Kaiming He
Fei-Fei Li & Andrej Karpathy & Justin Johnson
CNN  VISUALIZATIONS

83
3D  Visualization  of  CNN
http://scs.ryerson.ca/~aharley/vis/conv/
Convolution  of  a  Color  Image
• Color  images  consist  of  3  floats  per  pixel  for  
RGB  (red,  green  blue)  color  values
• Convolution  must  also  be  3-­‐
A closer look at spatial dimensions: dimensional
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1
85
Figure  from  Fei-­‐Fei Li  &  Andrej  Karpathy &  Justin  Johnson  (CS231N)  
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 23 27 Jan 2016
Animation  of  3D  Convolution
http://cs231n.github.io/convolutional-­‐networks/

86
Figure  from  Fei-­‐Fei Li  &  Andrej  Karpathy &  Justin  Johnson  (CS231N)  
MNIST  Digit  Recognition  with  CNNs  
(in  your  browser)
https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html

87
Figure  from  Andrej  Karpathy
CNN  Summary
CNNs
– Are  used  for  all  aspects  of  computer  vision,  and  
have  won  numerous  pattern  recognition  
competitions
– Able  learn  interpretable  features  at  different  levels  
of  abstraction
– Typically,  consist  of  convolution layers,  pooling
layers,  nonlinearities,  and  fully  connected  layers
Other  Resources:
– Readings  on  course  website
– Andrej  Karpathy,  CS231n  Notes
http://cs231n.github.io/convolutional-­‐networks/
88

S-ar putea să vă placă și