Semantic Segmentation

Segmentation and Semantic
Segmentation
C. V. Jawahar and Girish Varma
IIIT Hyderabad
The 3 core problems
Reconstruction
Machine
Learning
Reorganization (Grouping) Recognition

Convolutional
Networks
Segmentation
accurate boundary delineation is often required
Goal:
find coherent “blobs” or specific “objects”

lower level tasks large grey area higher level tasks
(e.g. “superpixels”) in-between (e.g. cars, humans, or organs)
Segmentation and contour detection
The BSDS Benchmark (2001)
• 5 humans annotate boundaries, A
take union
• Algorithm assigns “probability of
boundary” to each pixel
B C
• Threshold gives a binary map
History from Low/Mid-Level Vision
Malik, BISD, ICCV 2001
Sobel (1968,0.48), Canny(1986,0.54),

Martin(2004,0.63), Marie(2008,0.70),
Human(0.79)
Use any Visual Cue as input. Learn to

combine individual inputs
1990s 2004 2008

1970s
Maire, Arberaez, et. al., IEEE PAMI 2011
More on edge/contour [F is same or better than Human]
Figure 7: Some examples of RCF. From top to bottom: BSDS500 [2], NYUD [49], Multicue-Boundary [41], and Multicue-
Richer convolutional
Edge [41]. From left tofeatures for image,
right: origin edge detection, CVPR
ground truth, RCF2017
edge map, origin image, ground truth, and RCF edge map.
Another View Point: Finer Understanding
Is there a dog in this image? If yes, where is the dog?
American Bulldog
Which pixels exactly? What breed?

PASCAL VOC 2005-2012
20 object classes 22,591 images
Classification: person, motorcycle
Detection Segmentation
Person
Motorcycle
Action: riding bicycle

Everingham, Van Gool, Williams, Winn and Zisserman.
The PASCAL Visual Object Classes (VOC) Challenge. IJCV 2010.
Space of Computer Vision Tasks
Other Computer Vision Tasks
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation
GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY
No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Semantic Segmentation
§ Semantic Segmentation
§ Labelling every pixel in an image
§ A key part of Scene Understanding
§ Applications
§ Autonomous navigation
§ Assisting the partially sighted
§ Medical diagnosis
§ Image editing
11
(Clockwise from top) [1] Cityscapes Dataset. [2] ISBI Challenge 2015, dental x-ray images. [3] Royal National Institute of Blind People
A quick tour of
“Segmentation”
Age Old Methods
Example: assume known
probability distributions
P1 = N ( µ1 , s )
P2 = N ( µ 2 , s )
r³0 Û I ³T
µ1 + µ 2
T=
μ2 T μ1 2
Thresholding could be derived as
statistical decision: likelihood ratio test
P1 ( I p ) P1 and P2 are
rp := log object and background
P0 ( I p ) known color models
rp ³ 0 Þ pixel p is object
rp < 0 Þ pixel p is background
Segmentation as clustering (Unsupervised Learning)
Distance based on color only

Source: K. Grauman
Segmentation as clustering
Distance based
on color and
position
Source: K. Grauman
Segmentation Unsupervised
K Means
K Means Algorithm
• There are K clusters {1,2, …, K} with means µ1,…, µK
• The least-squares error is defined as
• Problem: Out of all possible partitions into K clusters, choose

the one that minimizes J.
• Soln: (i) Assign Pixels to K Clusters, (ii) Compute Means of Each

Cluster (iii) Reassign Pixels and Repeat
K-means vs. GMM
optional
■ hard assignment to clusters ■ soft mode searching material
- separates data points into multiple - estimates data distribution with
Gaussian blobs multiple Gaussian modes
■ only estimates means μi ■ estimates both mean μi and
- Σi can also be added as a cluster (co)variance Σi for each mode
parameter (elliptic K-means)
k=4 k=4
hard clustering may not work well

when clusters overlap While this is an optimal GMM,
(generally not a problem in segmentation, it is hard to find for standard EM
since objects do not “overlap” in RGBXY) algorithm due to local minima.
K-means vs. GMM
optional
■ hard assignment to clusters ■ soft mode searching material
- separates data points into multiple - estimates data distribution with
Gaussian blobs multiple Gaussian modes
■ only estimates means μi ■ estimates both mean μi and
- Σi can also be added as a cluster (co)variance Σi for each mode
parameter (elliptic K-means)
■ computationally cheap ■ more expensive

(block-coordinate descent) (EM algorithm, Ref: Szeliski Sec.5.3.1)
■ sensitive to local minima ■ sensitive to local minima
■ scales to higher dimensions ■ does not scale to higher

(kernel K-means) dimensions
GMM and FG/BG Segmentation in Videos
Spatial Structures
Segmentation Unsupervised
K Means
Normalized Cut
[Shi and Mallik 2000]

Images as graphs
j
wij
i
• Node for every pixel

• Edge between every pair of pixels (or every pair of
“sufficiently close” pixels)
• Each edge is weighted by the affinity or similarity of
the two nodes
Source: S. Seitz
Segmentation by graph partitioning
j
wij
i
A B C
• Break Graph into Segments

• Delete links that cross between segments
• Easiest to break links that have low affinity
• similar pixels should be in the same segments
• dissimilar pixels should be in different segments
Source: S. Seitz
Graph cut
A B
• Set of edges whose removal makes a graph

disconnected
• Cost of a cut: sum of weights of cut edges
• A graph cut gives us a segmentation
• What is a “good” graph cut and how do we find one?
Source: S. Seitz
Normalized cut algorithm
1. Represent the image as a weighted graph

G = (V,E), with weights as W and summarize the information in D
and W. D(i, i) = Σj W(i, j)
2. Solve (D − W)y = λDy for the eigenvector with the second smallest
eigenvalue
3. Use the entries of the eigenvector to bipartition the graph
4. Recursively partition the segmented parts,
if necessary
Shi and Mallik, 2000

Success of N-Cut
• Formulation of Segmentation as a Graph Partitioning
• Pixels are not classified in “isolation”.
• A nice and meaningful objective function.
• A solution based on eigen vectors.
• A generic framework. Can work with many features
• Cons: (i) High storage requirement (ii) Bias towards
equal sized segments (iii) Approximates a Discrete
Labeling Problem (needs to threshold eigen vectors)
Graph Formulation
Node yi : pixel label
Edge : constrained pair

ProblemInput
ofImage
Labeling Output Labelled Image
GraphCut
Cut = separating source and sink = segmentation

Adding regional properties
another segmentation example [Boykov&Jolly’01]
Appearance modelsq 0 and q1 can also be

obtained from user seeds
Graph cuts as energy optimization
for segmentation S [Boykov&Jolly 2001]
t
n-links a cut
D p (s )
t-link
S
w pq
t-link
S
D p (t )
segmentation Û cut s
S p Î {0,1}
E(S) =
cost(cut) åD å
(1)D+ ( Så)D
pÎS
p
p
p p
pÎS
p (0) + åw pq × [ S p ¹ Sq ]
pqÎN
cost of severed t-links cost of severed n-links
unary terms pair-wise terms
regional properties of S boundary smoothness for S
Segmentation as Graph Cut
Boykov (2001)
• Cut: separating source and sink
• Energy : collection of edges
• Min Cut: Global minimal energy in
polynomial time under some regularities
Segmentation is formulated as “Labeling” of each pixel either as foreground (0) or

background (1).
Graph Cut: By Energy Minimization
Node yi : pixel label
Edge : constrained pair
Cost to assign a label

Cost to assign a pair of label
to each pixel
to each connected pixel
Unary and Pairwise
Example
Example
Associative Potentials:
pay a cost when neighbouring
pixels are different
Cost to assign a label Cost to assign a pair of label

to each pixel to each connected pixel
Segmentation as Graph Cut
Boykov (2001)
• Cut: separating source and sink
• Energy : collection of edges
• Min Cut: Global minimal energy in
polynomial time under some regularities
Segmentation is formulated as “Labeling” of each pixel either as foreground (0) or

background (1).
How to obtain unary and pairwise?
• Unary:
• Probability of the pixel to be FG or BG
• Can be “defined”. Can be “learnt”
• Also called “data” term
• Pairwise
• Encodes joint probability
• Enforces some amount of “smoothness” or spatial cohesively
• Can be defined/learnt. Many possible “valid” functions.
ProblemInput
ofImage
Labeling Output Labelled Image
GraphCut
Cut = separating source and sink = segmentation

Graph Cuts – Pros & Cons
Pros:
• Very fast inference. Subsecond implementations
• Can incorporate data likelihood and priors
• Applicable to wide range of problems (Stereo, labelling,
recognition)
• Has nice connections to Markov Random Field (MRF) and Bayesian
Inference
• Many nice theoretical results. (like global optima )
Cons:
• Not always globally optimal (Theorem / Appropriate potentials )
• Approximate for multi-label (example: α-expansion)
Learning from Humans
Grab Cut [Rother et al. 2004]
User Interaction Provides the Supervision

Grab Cut
1. Define graph usually 4-connected or 8-connected
2. Define unary potentials
• Colour histogram or mixture of Gaussians for background
and foreground.
3. Define pairwise potentials

• Based on agreement of neighbouring pixels.
4. Apply graph cuts

5. Return to 2, using current labels to compute foreground,
background models
Details: Unary/pairwise
Grab Cut using Iterated Graph Cuts
Gu
ara
?
co nte
nv ed
er
ge to
User Initialisation 1 2 3 4
Energy after each Iteration
Learn foreground color Graph cuts to

model infer the
foreground
Learning Potentials with GMM
Iterated
graph cut
Gaussian Mixture Model (typically 5-8 components)

Examples
… GrabCut completes automatically

Difficult Examples
Camouflage &
Fine structure No telepathy
Low Contrast
Initial
Rectangle
Initial
Result
GrabCut – Interactive Foreground Extraction

Segmentation of
“Categories”
Pixels to Superpixels
Graph based Models for Semantic
Segmentation
Graph
construction
Input image
Training of
Potentials
(Learning)
MAP
(Inference)
Final segmentation
Semantic Segmentation
Summary
• Supervision: No supervision, Human interaction, Annotated examples
of pixels and superpixels. Weaker supervision from multiple images,
Use of Prior
• Problems: “Labeling” and “Classification”
• Tools: Graph Cuts, MRFs, SVMs, Random Forests. Etc.

Era of Deep Learning
Convolutional Neural Network (CNN) 5
A sequence of local & shift invariant layers
Overview of CNNs
Data = 3D tensors 6
There is a vector of feature channels (e.g. RGB) at each spatial location (pixel). Example: convolution layer
W channels
H =
c=1 c=2 c=3
W
✱
3D 
tensor H
=
Linear / non-linear chains 9
The basic blueprint of most architectures

input data x filter bank F output data y
Σ S
Σ S …
Σ S
x y
filtering ReLU filtering  ReLU …

& downsampling
DL and Image Classification
Input: Image
Output: P(c) (A vector/distribution of probabilities)
Semantic Segmentation Idea: Fully Convolutional
Convolutions and Finally Pixel Classifications

Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!
Conv Conv Conv Conv argmax
Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
DxHxW
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 22 May 10, 2017
In general,
Conv1
Convolutional Encoder Convolutional Decoder Deconv1 Pred
Conv2 Deconv2
Conv3 Deconv3
Conv4 Deconv4
Conv5 Deconv5
Conv6 Deconv6
Max pooling
Max pooling Unpooling
Unpooling
Figure 2. Architecture of the proposed fully convolutional encoder-decoder network.
motivated by efficient object detection. One of their draw- and introduce a light-weighted decoder. The first layer of
backs is that bounding boxes usually cannot provide accu- decoder “deconv6” is designed for dimension reduction that
rate object localization. More related to our work is gener- projects 4096-d “conv6” to 512-d with 1⇥1 kernel so that
ating segmented object proposals [4, 10, 14, 24, 26, 29, 43]. we can re-use the pooling switches from “conv5” to upscale
At the core of segmented object proposal algorithms is con- the feature maps by twice in the following “deconv5” layer.
More (wait for GANs)
More (Wait for GANs)
Summary/History/Relationships
• Initial Methods
• Simple unsupervised learning/Clustering/Partitioning
• Introduction of spatial relationship and formalisms
• Graphs, Graph Cuts and Energy Minimization Frameworks
• Learning from “one” image
• Learn unary potential; human; iterate;
• Learn from many labelled examples
• Popular semantic/instance segmentations.
• Finer understanding of the visual content
• Input: Image. Output: Image
• Many low-level vision tasks in the same framework
• Low-level vision, Segmentation, Generation, ??
More on Newer
Methods/Trends
Acknowledge the extensive use of slides/images/content available online.

Semantic Segmentation

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Semantic Segmentation

Încărcat de

Drepturi de autor:

Formate disponibile

Segmentation and Semantic

Reorganization (Grouping) Recognition

accurate boundary delineation is often required

find coherent “blobs” or specific “objects”

Sobel (1968,0.48), Canny(1986,0.54),

Use any Visual Cue as input. Learn to

1990s 2004 2008

Is there a dog in this image? If yes, where is the dog?

Which pixels exactly? What breed?

Action: riding bicycle

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT

Distance based on color only

• The least-squares error is defined as

• Problem: Out of all possible partitions into K clusters, choose

• Soln: (i) Assign Pixels to K Clusters, (ii) Compute Means of Each

hard clustering may not work well

■ computationally cheap ■ more expensive

■ sensitive to local minima ■ sensitive to local minima

■ scales to higher dimensions ■ does not scale to higher

[Shi and Mallik 2000]

• Node for every pixel

• Break Graph into Segments

• Set of edges whose removal makes a graph

1. Represent the image as a weighted graph

Shi and Mallik, 2000

Edge : constrained pair

Cut = separating source and sink = segmentation

Appearance modelsq 0 and q1 can also be

Segmentation is formulated as “Labeling” of each pixel either as foreground (0) or

Edge : constrained pair

Cost to assign a label

Cost to assign a label Cost to assign a pair of label

Segmentation is formulated as “Labeling” of each pixel either as foreground (0) or

Cut = separating source and sink = segmentation

User Interaction Provides the Supervision

3. Define pairwise potentials

4. Apply graph cuts

Learn foreground color Graph cuts to

Gaussian Mixture Model (typically 5-8 components)

… GrabCut completes automatically

GrabCut – Interactive Foreground Extraction

• Problems: “Labeling” and “Classification”

• Tools: Graph Cuts, MRFs, SVMs, Random Forests. Etc.

A sequence of local & shift invariant layers

c=1 c=2 c=3

The basic blueprint of most architectures

filtering ReLU filtering ReLU …

Convolutions and Finally Pixel Classifications

Conv Conv Conv Conv argmax

Figure 2. Architecture of the proposed fully convolutional encoder-decoder network.

Acknowledge the extensive use of slides/images/content available online.

S-ar putea să vă placă și

filtering ReLU filtering  ReLU …