Sunteți pe pagina 1din 59

Segmentation and Semantic

Segmentation
C. V. Jawahar and Girish Varma
IIIT Hyderabad
The 3 core problems
Reconstruction

Machine
Learning

Reorganization (Grouping) Recognition


Convolutional
Networks
Segmentation

accurate boundary delineation is often required

Goal:

find coherent “blobs” or specific “objects”


lower level tasks large grey area higher level tasks
(e.g. “superpixels”) in-between (e.g. cars, humans, or organs)
Segmentation and contour detection
The BSDS Benchmark (2001)
• 5 humans annotate boundaries, A
take union
• Algorithm assigns “probability of
boundary” to each pixel
B C
• Threshold gives a binary map
History from Low/Mid-Level Vision
Malik, BISD, ICCV 2001

Sobel (1968,0.48), Canny(1986,0.54),


Martin(2004,0.63), Marie(2008,0.70),
Human(0.79)

Use any Visual Cue as input. Learn to


combine individual inputs

1990s 2004 2008


1970s
Maire, Arberaez, et. al., IEEE PAMI 2011
More on edge/contour [F is same or better than Human]

Figure 7: Some examples of RCF. From top to bottom: BSDS500 [2], NYUD [49], Multicue-Boundary [41], and Multicue-
Richer convolutional
Edge [41]. From left tofeatures for image,
right: origin edge detection, CVPR
ground truth, RCF2017
edge map, origin image, ground truth, and RCF edge map.
Another View Point: Finer Understanding

Is there a dog in this image? If yes, where is the dog?

American Bulldog

Which pixels exactly? What breed?


PASCAL VOC 2005-2012
20 object classes 22,591 images
Classification: person, motorcycle
Detection Segmentation
Person

Motorcycle

Action: riding bicycle


Everingham, Van Gool, Williams, Winn and Zisserman.
The PASCAL Visual Object Classes (VOC) Challenge. IJCV 2010.
Space of Computer Vision Tasks
Other Computer Vision Tasks
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain
Semantic Segmentation
§ Semantic Segmentation
§ Labelling every pixel in an image
§ A key part of Scene Understanding

§ Applications
§ Autonomous navigation
§ Assisting the partially sighted
§ Medical diagnosis
§ Image editing

11
(Clockwise from top) [1] Cityscapes Dataset. [2] ISBI Challenge 2015, dental x-ray images. [3] Royal National Institute of Blind People
A quick tour of
“Segmentation”
Age Old Methods
Example: assume known
probability distributions
P1 = N ( µ1 , s )
P2 = N ( µ 2 , s )
r³0 Û I ³T
µ1 + µ 2
T=
μ2 T μ1 2
Thresholding could be derived as
statistical decision: likelihood ratio test
P1 ( I p ) P1 and P2 are
rp := log object and background
P0 ( I p ) known color models

rp ³ 0 Þ pixel p is object
rp < 0 Þ pixel p is background
Segmentation as clustering (Unsupervised Learning)

Distance based on color only


Source: K. Grauman
Segmentation as clustering

Distance based
on color and
position

Source: K. Grauman
Segmentation Unsupervised

K Means
K Means Algorithm
• There are K clusters {1,2, …, K} with means µ1,…, µK

• The least-squares error is defined as

• Problem: Out of all possible partitions into K clusters, choose


the one that minimizes J.

• Soln: (i) Assign Pixels to K Clusters, (ii) Compute Means of Each


Cluster (iii) Reassign Pixels and Repeat
K-means vs. GMM
optional
■ hard assignment to clusters ■ soft mode searching material
- separates data points into multiple - estimates data distribution with
Gaussian blobs multiple Gaussian modes
■ only estimates means μi ■ estimates both mean μi and
- Σi can also be added as a cluster (co)variance Σi for each mode
parameter (elliptic K-means)

k=4 k=4

hard clustering may not work well


when clusters overlap While this is an optimal GMM,
(generally not a problem in segmentation, it is hard to find for standard EM
since objects do not “overlap” in RGBXY) algorithm due to local minima.
K-means vs. GMM
optional
■ hard assignment to clusters ■ soft mode searching material
- separates data points into multiple - estimates data distribution with
Gaussian blobs multiple Gaussian modes
■ only estimates means μi ■ estimates both mean μi and
- Σi can also be added as a cluster (co)variance Σi for each mode
parameter (elliptic K-means)

■ computationally cheap ■ more expensive


(block-coordinate descent) (EM algorithm, Ref: Szeliski Sec.5.3.1)

■ sensitive to local minima ■ sensitive to local minima

■ scales to higher dimensions ■ does not scale to higher


(kernel K-means) dimensions
GMM and FG/BG Segmentation in Videos
Spatial Structures
Segmentation Unsupervised

K Means

Normalized Cut

[Shi and Mallik 2000]


Images as graphs

j
wij
i

• Node for every pixel


• Edge between every pair of pixels (or every pair of
“sufficiently close” pixels)
• Each edge is weighted by the affinity or similarity of
the two nodes
Source: S. Seitz
Segmentation by graph partitioning

j
wij
i

A B C

• Break Graph into Segments


• Delete links that cross between segments
• Easiest to break links that have low affinity
• similar pixels should be in the same segments
• dissimilar pixels should be in different segments
Source: S. Seitz
Graph cut

A B

• Set of edges whose removal makes a graph


disconnected
• Cost of a cut: sum of weights of cut edges
• A graph cut gives us a segmentation
• What is a “good” graph cut and how do we find one?

Source: S. Seitz
Normalized cut algorithm

1. Represent the image as a weighted graph


G = (V,E), with weights as W and summarize the information in D
and W. D(i, i) = Σj W(i, j)
2. Solve (D − W)y = λDy for the eigenvector with the second smallest
eigenvalue
3. Use the entries of the eigenvector to bipartition the graph
4. Recursively partition the segmented parts,
if necessary

Shi and Mallik, 2000


Success of N-Cut
• Formulation of Segmentation as a Graph Partitioning
• Pixels are not classified in “isolation”.
• A nice and meaningful objective function.
• A solution based on eigen vectors.
• A generic framework. Can work with many features
• Cons: (i) High storage requirement (ii) Bias towards
equal sized segments (iii) Approximates a Discrete
Labeling Problem (needs to threshold eigen vectors)
Graph Formulation
Node yi : pixel label

Edge : constrained pair


ProblemInput
ofImage
Labeling Output Labelled Image

GraphCut

Cut = separating source and sink = segmentation


Adding regional properties
another segmentation example [Boykov&Jolly’01]

Appearance modelsq 0 and q1 can also be


obtained from user seeds
Graph cuts as energy optimization
for segmentation S [Boykov&Jolly 2001]

t
n-links a cut
D p (s )

t-link
S
w pq

t-link
S
D p (t )
segmentation Û cut s
S p Î {0,1}

E(S) =
cost(cut) åD å
(1)D+ ( Så)D
pÎS
p
p
p p
pÎS
p (0) + åw pq × [ S p ¹ Sq ]
pqÎN
cost of severed t-links cost of severed n-links
unary terms pair-wise terms
regional properties of S boundary smoothness for S
Segmentation as Graph Cut
Boykov (2001)
• Cut: separating source and sink
• Energy : collection of edges
• Min Cut: Global minimal energy in
polynomial time under some regularities

Segmentation is formulated as “Labeling” of each pixel either as foreground (0) or


background (1).
Graph Cut: By Energy Minimization
Node yi : pixel label

Edge : constrained pair

Cost to assign a label


Cost to assign a pair of label
to each pixel
to each connected pixel
Unary and Pairwise
Example

Example

Associative Potentials:
pay a cost when neighbouring
pixels are different

Cost to assign a label Cost to assign a pair of label


to each pixel to each connected pixel
Segmentation as Graph Cut
Boykov (2001)
• Cut: separating source and sink
• Energy : collection of edges
• Min Cut: Global minimal energy in
polynomial time under some regularities

Segmentation is formulated as “Labeling” of each pixel either as foreground (0) or


background (1).
How to obtain unary and pairwise?
• Unary:
• Probability of the pixel to be FG or BG
• Can be “defined”. Can be “learnt”
• Also called “data” term
• Pairwise
• Encodes joint probability
• Enforces some amount of “smoothness” or spatial cohesively
• Can be defined/learnt. Many possible “valid” functions.
ProblemInput
ofImage
Labeling Output Labelled Image

GraphCut

Cut = separating source and sink = segmentation


Graph Cuts – Pros & Cons
Pros:
• Very fast inference. Subsecond implementations
• Can incorporate data likelihood and priors
• Applicable to wide range of problems (Stereo, labelling,
recognition)
• Has nice connections to Markov Random Field (MRF) and Bayesian
Inference
• Many nice theoretical results. (like global optima )

Cons:
• Not always globally optimal (Theorem / Appropriate potentials )
• Approximate for multi-label (example: α-expansion)
Learning from Humans
Grab Cut [Rother et al. 2004]

User Interaction Provides the Supervision


Grab Cut
1. Define graph usually 4-connected or 8-connected
2. Define unary potentials
• Colour histogram or mixture of Gaussians for background
and foreground.

3. Define pairwise potentials


• Based on agreement of neighbouring pixels.

4. Apply graph cuts


5. Return to 2, using current labels to compute foreground,
background models

Details: Unary/pairwise
Grab Cut using Iterated Graph Cuts
Gu
ara

?
co nte
nv ed
er
ge to

User Initialisation 1 2 3 4
Energy after each Iteration

Learn foreground color Graph cuts to


model infer the
foreground
Learning Potentials with GMM

Iterated
graph cut

Gaussian Mixture Model (typically 5-8 components)


Examples

… GrabCut completes automatically


Difficult Examples
Camouflage &
Fine structure No telepathy
Low Contrast

Initial
Rectangle

Initial
Result

GrabCut – Interactive Foreground Extraction


Segmentation of
“Categories”
Pixels to Superpixels
Graph based Models for Semantic
Segmentation
Graph
construction

Input image

Training of
Potentials
(Learning)
MAP

(Inference)
Final segmentation
Semantic Segmentation
Summary
• Supervision: No supervision, Human interaction, Annotated examples
of pixels and superpixels. Weaker supervision from multiple images,
Use of Prior

• Problems: “Labeling” and “Classification”

• Tools: Graph Cuts, MRFs, SVMs, Random Forests. Etc.


Era of Deep Learning
Convolutional Neural Network (CNN) 5

A sequence of local & shift invariant layers

Overview of CNNs
Data = 3D tensors 6

There is a vector of feature channels (e.g. RGB) at each spatial location (pixel). Example: convolution layer
W channels

H =

c=1 c=2 c=3

W

3D

tensor H
=
Linear / non-linear chains 9

The basic blueprint of most architectures


input data x filter bank F output data y

Σ S

Σ S …
Σ S

x y

filtering ReLU filtering
 ReLU …


& downsampling
DL and Image Classification

Input: Image
Output: P(c) (A vector/distribution of probabilities)
Semantic Segmentation Idea: Fully Convolutional

Convolutions and Finally Pixel Classifications


Design a network as a bunch of convolutional layers
to make predictions for pixels all at once!

Conv Conv Conv Conv argmax

Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
DxHxW

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 11 - 22 May 10, 2017
In general,

Conv1
Convolutional Encoder Convolutional Decoder Deconv1 Pred
Conv2 Deconv2
Conv3 Deconv3
Conv4 Deconv4
Conv5 Deconv5
Conv6 Deconv6

Max pooling
Max pooling Unpooling
Max pooling Unpooling
Max pooling Unpooling
Max pooling Unpooling
Unpooling

Figure 2. Architecture of the proposed fully convolutional encoder-decoder network.

motivated by efficient object detection. One of their draw- and introduce a light-weighted decoder. The first layer of
backs is that bounding boxes usually cannot provide accu- decoder “deconv6” is designed for dimension reduction that
rate object localization. More related to our work is gener- projects 4096-d “conv6” to 512-d with 1⇥1 kernel so that
ating segmented object proposals [4, 10, 14, 24, 26, 29, 43]. we can re-use the pooling switches from “conv5” to upscale
At the core of segmented object proposal algorithms is con- the feature maps by twice in the following “deconv5” layer.
More (wait for GANs)
More (Wait for GANs)
Summary/History/Relationships
• Initial Methods
• Simple unsupervised learning/Clustering/Partitioning
• Introduction of spatial relationship and formalisms
• Graphs, Graph Cuts and Energy Minimization Frameworks
• Learning from “one” image
• Learn unary potential; human; iterate;
• Learn from many labelled examples
• Popular semantic/instance segmentations.
• Finer understanding of the visual content
• Input: Image. Output: Image
• Many low-level vision tasks in the same framework
• Low-level vision, Segmentation, Generation, ??
More on Newer
Methods/Trends

Acknowledge the extensive use of slides/images/content available online.

S-ar putea să vă placă și