Documente Academic
Documente Profesional
Documente Cultură
This thesis was proposed in April, 2003. The dissertation of the thesis will
be published in May, 2005. While the dissertation subsumes and modifies much
of the material in this proposal, I have made it available (as URCS TR 862)
as a historical supplement to document the details and results of the MASSES
(Material and Spatial Experimental Scenes) prototype.
Probabilistic Modeling for Semantic
Scene Classification
by
Matthew R. Boutell
Thesis Proposal
Doctor of Philosophy
Supervised by
Christopher M. Brown
University of Rochester
Rochester, New York
2005
ii
Abstract
Scene classification, the automatic categorization of images into semantic classes
such as beach, field, or party, is useful in applications such as content-based im-
age organization and context-sensitive digital enhancement. Most current scene-
classification systems use low-level features and pattern recognition techniques;
they achieve some success on limited domains.
However, our simple Bayes net is not expressive enough to model the faulty
detection at the level of individual regions. As future work, we propose first to
iii
evaluate full (DAG) Bayesian networks and Markov Random Fields as potential
probabilistic frameworks. We then plan to extend the chosen framework for our
problem. Finally, we will compare our results on real and simulated sets of images
with those obtained by other systems using spatial features represented implicitly.
iv
Table of Contents
Abstract ii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4.1 Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Related Work 10
2.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
v
3 Methodology 44
4 Experimental Results 66
5 Proposed Research 73
6 Acknowledgments 87
Bibliography 88
List of Tables
4.3 MASSES with faulty material detection: Accuracy with and with-
out spatial information. . . . . . . . . . . . . . . . . . . . . . . . . 70
viii
List of Figures
3.1 Ground-truth labeling of a beach scene. Sky, water, and sand re-
gions are clearly shown. . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Prototypical beach scenes. (a) A simple beach scene without back-
ground objects. (b) Because we make no attempt to detect it, we
consider the sailboat to be “background”. (c) A more complicated
scene: a developed beachfront. (d) A scene from a more distant
field-of-view. (e) A crowded beach. . . . . . . . . . . . . . . . . . 49
ix
3.3 Prototypical urban scenes. (a) The most common urban scene,
containing sky, buildings, and roads. (b),(c) The sky is not simply
above the buildings in these images. (d) Roads are not necessary.
(e) Perspective views induce varied spatial relationships. (f) Close
views can preclude the presence of sky. . . . . . . . . . . . . . . . 50
3.5 The MASSES environment. Statistics from labeled scenes are used
to bootstrap the generative model, which can then produce new
virtual scenes for training or testing the inference module. . . . . 54
3.7 Sampling the scene type yields class C. Then we sample to find
the materials present in the image, in this case, M1 , M3 , and M4 .
Finally, we sample to find the relationships between each pair of
these material regions. . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Classifier input (labeled regions) and output (classification and con-
fidence). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Proposed DAG Bayesian network. Note that separate material and
region layers are needed. . . . . . . . . . . . . . . . . . . . . . . . 77
1
1 Introduction
1.1 Motivation
With digital libraries growing in size so quickly, accurate and efficient tech-
niques for CBIR become more and more important. Many current systems allow a
user to specify an image and then search for images “similar” to it, where similarity
is often defined only by color or texture properties. Because a score is computed
on each image in the potentially-large database, it is somewhat inefficient (though
individual calculations vary in complexity).
inadequate results [68]. Sometimes the match between the retrieved and the query
images is hard to understand, while other times, the match is understandable, but
contains no semantic value. For instance, with simple color features, a query for a
rose can return a picture of a man wearing a red shirt, especially if the background
colors are similar as well.
Knowledge of the semantic category of a scene helps narrow the search space
dramatically [37]. If the categories of the query image and the database images
have been assigned (either manually or by an algorithm), they can be exploited
to improve both efficiency and accuracy. For example, knowing what constitutes
a party scene allows us to consider only potential party scenes in our search and
thus helps to answer the query “find photos of Mary’s birthday party”. This way,
search time is reduced, the hit rate is higher, and the false alarm rate is expected
to be lower. Visual examples can be found in [76].
Knowledge about the scene category can find also application in digital en-
hancement [73]. Digital photofinishing processes involve three steps: digitizing
the image if necessary (if the original source was film), applying enhancement
algorithms, and outputting the image in either hardcopy or electronic form. En-
hancement consists primarily of color balancing, exposure enhancement, and noise
reduction. Currently, enhancement is generic (i.e. without knowledge of the scene
content). Unfortunately, while a balancing algorithm might enhance the quality
of some classes of pictures, it degrades others.
Other images that are negatively affected by color balancing are those con-
taining skin-type colors. Correctly balanced skin colors are important to human
3
Figure 1.1: Content-ignorant color balancing can destroys the brilliance of sunset
images, such as those pictured, which have the same global color distribution as
indoor, incandescent-illuminated images.
Rather than applying generic color balancing and exposure adjustment to all
images, knowledge of the scene’s semantic classification allows us to customize
them to the scene. Following the example above, we could retain or boost sun-
set scenes’ brilliant colors while reducing a tungsten-illuminated indoor scene’s
yellowish cast.
On one hand, isn’t scene classification preceded by image understanding, the “holy
grail” of vision? What makes us think we can achieve results? On the other hand,
isn’t scene classification just an extension of object recognition, for which many
techniques have been proposed with varying success? How is scene classification
different from these two related fields?
4
understanding problem, and can be used to ease other image understanding tasks
[75]. For example, knowing that a scene is of a beach constrains where in the
scene one should look for people.
Again, the areas of scene classification and object recognition are related;
knowing the identity of some of the scene’s objects will certainly help to classify the
scene, while knowing the scene type affects the expected likelihood and location
of the objects it contains.
6
Most of the current systems primarily use low-level features to classify scenes
and achieve some success on constrained problems. These systems tend to be
exemplar-based, in which features are extracted from images, and pattern recog-
nition techniques are used to learn the statistics of a training set and to classify
novel test images. Very few systems are model-based, in which the expected
configuration of the objects in the scenes is specified by a human expert.
The limited success of scene classification systems using low-level features forces us
to look for other solutions. Currently, good semantic material detectors and object
recognizers are available [70; 38; 63] and have begun to be successfully applied to
scene classification [37]. However, the presence or absence of certain objects is
not always enough to determine a scene type. Furthermore, object detection is
still developing and is far from perfect. Faulty detection causes brittle rule-based
systems to break.
Our central claim is that Spatial modeling of semantic objects and materials
can be used to disambiguate certain scene types as well as mitigate the effects of
faulty detectors. Furthermore, an appropriate probabilistic inference mechanism
must be developed to handle the loose spatial structure found in real images.
Current research into spatial modeling relies on (fuzzy) logic and subgraph
matching [44; 83]. While we have found no research that incorporates spatial
modeling in a probabilistic framework, we argue that a probabilistic approach
would be more appropriate. First, logic (even fuzzy variants) is not equipped to
handle exceptions efficiently [50], a concern we address in more detail in Section
2.4. Second, semantic material detectors often yield belief in the materials. While
7
it is not obvious how to use belief values, it seems desirable to exploit the uncer-
tainty in calculating the overall belief in each scene type. A nice side effect of true
belief-based calculation is the ease in which a “don’t know” option can be added
to the classifier: simply threshold the final belief value.
1) Baseline (no spatial relationships). Use naive Bayes classification rules using
the presence or absence of materials only.
1.4.1 Philosophy
The success of our approach seems to hinge on the strength of the underlying
detectors. Consider two scenarios. First, if the detectors are reasonably accurate,
then we can expect to overcome some faults using spatial relationships. However,
if they are extremely weak, we would be left with a very ambitious goal: from a
very pessimistic view of the world (loose spatial structure and weak detectors),
8
pull order from the chaos and accurately apply a discrete label to a configuration
of objects.
In this latter case, prior research seems to confirm that the task does not sound
promising. For instance, Selinger found that if an object could be recognized
with moderate success using a single camera view, additional views could improve
recognition substantially. However, if the single view gave weak detection, then
multiple views could not redeem the problem. She states [62] (p. 106):
Therefore, while we cannot expect to use our technique to classify scenes for
which the detectors are completely inaccurate, we stand a reasonable chance if
improving accuracy if the detectors are reasonably strong themselves.
However, when simulating faulty detectors, we found that the network is not
expressive enough to capture the necessary information, actually leading to lower
accuracy when spatial relationships were used.
2 Related Work
Scene classification is a young, emerging field. The first section of this chapter
is taken in large part from our earlier review of the state of the art in scene
classification [6]; because this thesis is a work in progress, there is much overlap
between the two. Here we focus our attention on systems using approaches directly
related to our proposed thesis. Readers desiring a more comprehensive survey or
more detail are referred to the original review.
All systems classifying scenes must extract appropriate features and use some
sort of learning or inference engine to classify the image. We start by outlining
the options available for features and classifiers. We then present a number of
systems which we have deemed to be good representations of the field.
2.1.1 Features
The classifiers used in these type of systems differ in how they extract infor-
mation from the training data. In Table 2.2, we present a summary of the major
systems used in the realm of scene classification.
As stated, many of the systems proposed in the literature for scene classification
are exemplar-based, but a few are model-based, relying on expert knowledge to
model scene types, usually in terms of the expected configuration of objects in
the scene. In this section, we describe briefly some of these systems and point
out some of their limitations en route to differentiating our proposed method. We
organize the systems by feature type and in the use of spatial information, as
shown in Table 2.3. Features are grouped into low-level, mid-level, and high-level
(semantic) features, while spatial information is grouped into those that model
the spatial relationships explicitly in the inference stage and those that do not.
Classifier Description
1-Nearest-Neighbor Classifies test sample with same class as the exemplar
(1NN) closest to it in the feature space.
K-Nearest-Neighbor Generalization of 1NN in which the sample is given
(kNN) [18] the label of the majority of the k closest exemplars.
Learning Vector A representative set of exemplars, called a codebook,
Quantization (LVQ) is extracted. The codebook size and learning rate
[31; 32] must be chosen in advance.
Maximum a Posteriori Combines the class likelihoods (which must be
(MAP) [77] modeled, e.g., with a mixture of Gaussians) with
class priors using Bayes rule.
Support Vector Find an optimal hyperplane separating two classes.
Machine (SVM) Maps data into higher dimensions, using a kernel
[8; 61] function,to increase separability. The kernel and
associated parameters must be chosen in advance.
Artificial Neural Function approximators in which the inputs are
Networks (ANN) [1] mapped, through a series of linear combinations
and non-linear activation functions to outputs.
The weights are learned using a technique
called backpropagation.
15
Table 2.3: Related work in scene classification, organized by feature type and use
of spatial information.
Spatial Information
Feature Type Implicit/None Explicit
Low-level Vailaya, et al. Lipson, et al.
Oliva, et al. Ratan & Grimson
Szummer & Picard Smith & Li
Serrano, et al.
Paek & Chang
Carson, et al.
Wang, et al.
Mid-level Oliva, et al.
High-level Luo, et al. Mulhem, et al.
(Semantic) Song & Zhang Proposed Method
The indoor vs. outdoor classifiers’ accuracy approaches 90% on tough (e.g.,
consumer) image sets. On the outdoor scene classification problem, mid-90%
accuracy is reported. This may be due to the use of constrained data sets (e.g.
from the Corel stock photo library), because on less constrained (e.g., consumer)
image sets, we found the results to be lower. The generalizability of the technique
is also called into question by the discrepancies in the numbers reported for image
orientation detection by some of the same researchers [79; 80].
1
While image orientation detection is a different level of semantic classification, many of the
techniques used are similar.
16
Pseudo-Object Features
The Blobworld system, developed at Berkeley, was developed primarily for content-
based indexing and retrieval. However, it is used for scene classification problem
in [9]. The researchers segment the image and use statistics computed for each re-
gion (e.g., color, texture, location with respect to a 3 × 3 grid) without performing
object recognition. Admittedly, this is a more general approach for scene types
containing no recognizable objects. However, we can hope for more using object
recognition. Finally, a maximum likelihood classifier performs the classification.
The systems above either ignore spatial information or encode it implicitly using
a feature vector. However, other bodies of research imply that explicitly-encoded
spatial information is valuable and should be encoded explicitly and used by the
inference engine. In this section, we review this body of research, describing a
number of systems using spatial information to model the expected configuration
of the scene.
Configural Recognition
Lipson, Grimson, and Sinha at MIT use an approach they call “configural recog-
nition” [34; 35], using relative spatial and color relationships between pixels in low
resolution images to match the images with class models.
17
The specific features extracted are very simple. The image is smoothed and
subsampled at a low resolution (ranging from 8 × 8 to 32 × 32). Each pixel
represents the average color of a block in the original image; no segmentation
is performed. For each pixel, only its luminance, RGB values, and position are
extracted.
The hand-crafted models are also extremely simple. For example, a template
for a snowy mountain image is a blue region over a white region over a dark
region; one for a field image is a large bluish region over a large greener region.
In general, the model contains relative x- and y-coordinates, relative R-, G-, B-,
and luminance values, and relative sizes of regions in the image.
The matching process uses the relative values of the colors in an attempt to
achieve illumination invariance. Furthermore, using relative positions mimics the
performance of a deformable template: as the model is compared to the image,
the model can be deformed by moving the patch around so that it best matches
the image. A model-image match occurs if any one configuration of the model
matches the image. However, this criterion may be extended to include the degree
of deformation and multiple matches depending on how well the model is expected
to match the scene.
Classification is binary for each classifier. On a test set containing 700 pro-
fessional images (the Corel Fields, Sunsets and Sunrises, Glaciers and Mountains,
Coasts, California Coasts, Waterfalls, and Lakes and Rivers CDs), the authors
report recall using four classifiers: fields (80%), snowy mountains (75%), snowy
mountains with lakes (67%), and waterfalls (33%). Unfortunately, exact precision
numbers cannot be calculated from the results given.
The strength of the system lies in the flexibility of the template, in terms of
both luminance and position. However, one limitation the authors state is that
each class model captured only a narrow band of images within the class and that
multiple models were needed to span a class.
18
During the query process, the common configurations are broken into smaller
parts and converted to a vector format, in which feature i corresponds to the
probability that sub-configuration i is present in the image.
better retrieval performance than other measures such as color histograms, wavelet
coefficients, and Gabor filter outputs.
Note that the spatial information is explicitly encoded in the features, but is
used directly in the inference process.
CRTs are configurations of segmented image regions [69]. The configurations are
limited to those occurring in the vertical direction: each vertical column is stored
as a region string and statistics are computed for various sequences occurring in
the strings. While an interesting approach, one unfortunate limitation of their
experimental work is that the size of the training and testing sets were both
extremely limited.
Oliva and Torralba [46; 47] propose what they call a “scene-centered” description
of images. They use an underlying framework of low-level features (multiscale
Gabor filters), coupled with supervised learning to estimate the “spatial envelope”
properties of a scene. They classify images with respect to “verticalness” (vertical
vs. horizontal), “naturalness” (vs. man-made), “openness” (presence of a horizon
line), “roughness” (fractal complexity), “busyness” (sense of clutter in man-made
scenes), “expansion” (perspective in man-made scenes), “ruggedness” (deviation
from the horizon in natural scenes), and “depth range”.
Images are then projected into this 8-dimensional space in which the dimen-
sions correspond to the spatial envelope features. They measure their success first
on individual dimensions through a ranking experiment. They then claim that
20
their features are highly correlated with the semantic categories of the images
(e.g., “highway” scenes are “open” and exhibit high “expansion”), demonstrating
some success on their set of images. It is unclear how their results generalize.
Luo and Savakis extended the method of [65] by incorporating semantic mate-
rial detection [37]. A Bayesian Network was trained for inference, with evidence
coming from low-level (color, texture) features and semantic (sky, grass) features.
Detected semantic features (which are not completely accurate) produced a gain
of over 2% and “best-case” (100% accurate) semantics gave a gain of almost 8%
over low-level features alone. The network used conditional probabilities of the
form P (sky present|outdoor). While this work showed the advantage of using
semantic material detection for certain types of scene classification, it stopped
short of using spatial relationships.
Song and Zhang investigate the use of semantic features within the context of
image retrieval [70]. Their results are impressive, showing that semantic features
greatly outperform typical low-level features, including color histograms, color
coherence vectors, and wavelet texture for retrieval.
They use the illumination topology of images (using a variant of contour trees)
to identify image regions and combine this with other features to classify the
regions into the semantic categories such as sky, water, trees, waves, placid water,
lawn, and snow.
21
While they do not apply their work directly to scene classification, their success
with semantic features confirms our hypothesis that they help bridge the semantic
gap between pixel-representations and high-level understanding.
Mulhem, Leow, and Lee [44] present a novel variation of fuzzy conceptual graphs
for use in scene classification. Conceptual graphs are used for representing knowl-
edge in logic-based applications, since they can be converted to expressions of
first-order logic. Fuzzy conceptual graphs extend this by adding a method of
handling uncertainty.
Model graphs for prototypical scenes are hand-crafted, and contain crisp con-
cepts and fuzzy relations and attributes. For example, a ”mountain-over-lake”
scene must contain a mountain and water, but the spatial relations are not guar-
anteed to hold. A fuzzy relation such as ”smaller than” may hold most of the
time, but not always.
Image graphs contain fuzzy concepts and crisp relations and attributes. This
is intuitive: while a material detector calculates the boundaries of objects and
can therefore calculate relations (e.g. ”to the left of”) between them, they can be
uncertain as to the actual classification of the material (consider the difficulty of
distinguishing between cloudy sky and snow, or of rock and sand). The ability to
handle uncertainty on the part of the material detectors is an advantage of this
framework.
22
Two subgraphs are matched using graph projection, a mapping such that each
part of a subgraph of the model graph exists in the image graph, and a metric
for linearly combining the strength of match between concepts, relations, and
attributes. A subgraph isomorphism algorithm is used to find the subgraph of the
model that matches best the image.
The basic idea of the algorithms is to decompose the model and image into
arches (two concepts connected by a relation), seed a subgraph with the best
matching pair of arches, and incrementally add other model arches that match
well.
They found that the image matching metric worked well on a small database of
two hundred images and four scene models (of mountain/lake scenes) generated by
hand. Fuzzy classification of materials was done using color histograms and Gabor
texture features. The method of generating confidence levels of the classification
is not specified.
Referring back to the summary of prior work in semantic scene classification given
in Table 2.3, we see that our work is closest to that of Mulhem, et al., but differs in
one key aspect: while theirs is logic-based, our proposed method is founded upon
probability theory, leading to principled methods of handling variability in scene
configurations. Our proposed method also learns the model parameters from a
set of training data, while theirs are fixed.
23
Regier and Carlson [55] propose a computational model of spatial relations based
on human perception. They consider up, down, left, and right, to be symmetric,
and so focus their work on the above relation.
They call the reference object the landmark and the located object the tra-
jector. For example, ”the ball (trajector) is above the table (landmark)”. The
model is designed to handle 2D landmarks, but only point trajectors. However,
the researchers state that they are in the process of extending the model.
3. Hybrid Model (PC-BB). This model extends the PC model by adding the
BB model’s height term. The height term gives the presence of a ”grazing
line” at the top of the landmark, an effect that was observed experimentally.
The model has four parameters: the slope, y-intercept, and relative weight
of the PC model plus the gain on the height function’s sigmoid.
4. Attentional Vector Sum (AVS). This model incorporates two human percep-
tual elements:
In the AVS model, the angle between the landmark and the trajector is
calculated as the weighted sum of angles between the points in the landmark
area and the trajector. The weights in the sum are related to attention. The
center of attention on the landmark is the point closest to the trajector; its
angle receives the most weight. As the landmark points get further from
the center of attention, they are weighted less, dropping off exponentially.
25
Lastly, the BB model’s height function is used again (for which they can
give no physiological or perceptual basis, but only because it was observed
experimentally). The model has four parameters: the width of the beam of
attention, the slope and y-intercept of the linear function relating angle to
match strength (as in the PC model) and the gain on the height function’s
sigmoid.
Optimal parameters for each model were found by fitting the model with an-
other researcher’s data set. A series of experiments was then performed to dis-
tinguish between the models. The AVS model fit each experiment’s data most
consistently.
The AVS method also gives a measure of ”how above” one region is compared
to another. This measure may potentially be used to our advantage.
Weighted Walkthroughs
each of the regions). For some pairs, A will lie northeast of B, for others A will
lie SE, etc. for the four quadrants. The fraction of pairs contained in each of the
four quadrants are computed: four ”weights”: wN E , wN W , wSE , and wSW .
One advantage of this method is its ability to handle 2D, occluded ((i.e.,
disconnected) landmarks and trajectors.
Hybrid Approach
In their research, Luo and Zhu [40] use a hybrid approach, combining the bounding
box and weighted walkthrough methods. The method was designed for modeling
spatial relations between materials in natural scenes, and so favors above/below
calculations. It skips weighted walkthroughs when object bounding boxes do not
overlap. It does not handle some obscure cases correctly, but is fast and correct
when used in practice.
The final decision of the spatial relationship model would be most appropriate
for my work depends in large part on whether the AVS method can be extended
to 2D trajectors while being made computationally tractable. One answer may
be to work on a higher conceptual level than individual pixels.
While computationally more expensive and possibly too sensitive, more detailed
spatial information may be necessary to distinguish some scene types. Rather
than just encoding the direction (such as “above”) in our knowledge framework,
we could incorporate a direction and a distance. For example, Rimey encoded
spatial relationships using an expected area net [56].
27
Early research involving spatial relationships between objects used logic [2]. This
approach was not particularly successful on natural scenes: while logic is certain,
life is uncertain. In an attempt to overcome this limitation, more recent work has
extended the framework to use fuzzy logic [44].
However, Pearl [50] argues that logic cannot be extended to the uncertainty
of life, where many rules have exceptions. The rules of Boolean logic contain no
method of combining exceptions. Furthermore, logic interactions occur in stages,
allowing for efficient computation. We would like to handle uncertain evidence
incrementally as well. But unless one makes strict independence assumptions, this
is impossible with logic–and computing the effect of evidence in one global step
is impossible.
assumptions are too strong, except under the strongest of independence assump-
tions. They cause the following problems semantically.
Some attempts have been made to overcome this last limitation, such as bounds
propagation or user-defined combination; however, each approach introduces fur-
ther difficulties.
We are fully aware that there is not universal agreement with Pearl philo-
sophically regarding the superiority of probability over logic. (Witness the heated
rebuttals to Cheeseman’s argument for probability [14] by the logic community!)
Still, we think his arguments are sound.
that has been used primarily for low-level vision problems (finding boundaries,
growing regions), but has recently been used for object detection.
Bayesian (or belief ) networks are used to model causal probabilistic relationships
[13] between a system of random variables. The causal relationships are repre-
sented by a directed acyclic graph (DAG) in which each link connects a cause (the
“parent” node) to an effect (the “child” node). The strength of the link between
the two is represented as the conditional probability of the child given the parent.
The directed nature of the graph allows conditional independence to be specified;
in particular, a node is conditionally independent of all of its non-successors, given
its parent(s).
• Prior probabilities are the initial beliefs about the root node(s) in the net-
work when no evidence is presented.
• Each node has a conditional probability matrix (CPM) associated with it,
representing the causality between the node and its parents. These can be
assigned by an expert or learned from data.
• Posteriors are the output of the network. Their value is calculated from
the product of priors and likelihoods arising from the evidence (as in Bayes’
Rule).
Trees
If the graph is tree-structured, with each node having exactly one parent node,
each node’s exact posterior belief can be calculated quickly and in a distributed
fashion using a simple message-passing scheme. Feedback is avoided by separating
causal and diagnostic (evidential) support for each variable using top-down and
bottom-up propagation of messages, respectively.
Causal Polytrees
The message-passing schemes for trees generalize to polytrees, and exact pos-
terior beliefs can be calculated. One drawback is that each variable is conditioned
on the combination of its parents’ values. Estimating the values in the condi-
tional probability matrix may be difficult because its size is exponential in the
number of parent nodes. Large numbers of parents for a node can induce con-
siderable computational complexity, since the message involves a summation over
each combination of parent values.
Models for multicausal interactions, such as the noisy-OR gate, have been
developed to solve this problem. They are modeled after human reasoning and
reduce the complexity of the messages from a node to O(p), linear in the number
of its parents. The messages in the noisy-OR gate model can be computed in
closed form (see [50]).
A nice summary of the inference processes for trees and polytrees given in [50]
can be found in [66].
The most general case is a DAG that contains undirected loops. While a DAG
cannot contain a directed cycle, its underlying undirected graph may contain a
cycle, as shown in Figure 2.1.
Loops cause problems for Bayesian networks, both architectural and semantic.
First, the message passing algorithm fails, since messages may cycle around the
loop. Second, the posterior probabilities may not be correct, since the conditional
independence assumption is violated. In Figure 2.1, variables B and C may be
32
B C
There exist a number of methods for coping with loops [50]. Two methods,
clustering and conditioning, are tractable only for sparse graphs. Another method,
stochastic simulation, involves sampling the Bayesian network. We use a simple
top-down version, called logic sampling, as a generative model and describe it in
Section 3.2.2. However, it is inefficient in the face of instantiated evidence, since
it involves rejecting each sample that does not agree with the evidence.
In computer vision, Bayesian networks have been used in many applications in-
cluding indoor vs. outdoor image classification [37; 48], main subject detection
[66], and control of selective perception [57]. An advantage of Bayesian networks
33
is that they are able to fuse different types of sensory data (e.g. low-level and
semantic features) in a well-founded manner.
We now discuss the basic concepts of MRFs, drawing from our in-house review
[7] of the typical treatments in the literature [30; 16; 22; 15].
The probabilities in Equation 2.4.2 are called local characteristics and intu-
itively describe a locality condition, namely that the value of any variable
in the field depends only on its neighbors.
Positivity Condition The positivity condition states that for every configura-
tion ω ∈ Ω, P (x = ω) > 0.
Markov Random Field Any random field satisfying both the Markov property
and the positivity condition. Also called a Markov Network.
“Two-Layer” describes the network topology of the MRF. The top layer
represents the input, or evidence, while the bottom layer represents the
relationships between neighboring nodes (Figure 2.2).
In typical computer vision problems, inter level links between the top and
bottom layers enforce compatibility between image evidence and the un-
derlying scene. Intra-level links in the top layer of the MRF leverage a
prioriknowledge about relationships between parts of the underlying scene
to enforce consistency between neighboring nodes in the underlying scene
[16]..
35
Figure 2.2: Portion of a typical two-layer MRF. In low-level computer vision prob-
lems, the top layer (black) represents the external evidence of the observed image
while the bottom layer (white) expresses the a prioriknowledge about relationships
between parts of the underlying scene.
In a pairwise MRF, the joint distribution over the MRF is captured by a set
of compatibility functions that describe the statistical relationships between
pairs of random variables in the MRF. For inferential purposes, this means
that the graphical model representing the MRF has no cliques larger than
size two.
Inference
The HCF algorithm is used for MAP estimation, finding local maxima of
the posterior distribution. It is a deterministic procedure founded on the
principle of least commitment. Scene nodes connected to image nodes with
the strongest external evidence (i.e. a hypothesis with a large ratio of the
maximum-likelihood hypothesis to the others) are “committed” first, since
they are unlikely to change (based on compatibility with neighbors). Nodes
with weak evidence commit later and are based primarily on their compat-
ibility with their “committed” neighbors.
X
bi (xi ) = kφi (xi ) mji (xi )
j∈N (i)
X Y
mij (xj ) = φ(xi , yi )ψ(xi , xj ) mki (xi )
xi k∈N (i)j
In the rare case that the graph contains no loops, it can be shown that
the marginals are exact. However, some experimental work suggests that
at least for certain problems, that the approximations are good even in the
typical “loopy” networks, as the evidence is “double-counted” [81].
One can calculate the MAP estimate at each node by replacing the summa-
tions in the messages with max.
GBP has been found to perform much better than BP on graphs with short
loops. The drawback is that the complexity is exponential in the cluster
size, but again, if the graph has short loops (and thus necessitates only
small clusters), the increased complexity can be minimal.
An advantage of GBP is that it can be used to vary the cluster size in order
to make a trade-off between accuracy and complexity.
sent general spatial relationships between the parts. However, rather than using
the typical square lattice, they use a tree.
Inference in the MRF is both exact and efficient, due to the tree structure.
Their MAP estimation algorithm is based on dynamic programming and is very
similar in flavor to the Viterbi algorithm for Hidden Markov Models. In fact, the
brief literature in the field on using Hidden Markov Models for object and people
detection [20] might be better cast in an MRF framework.
Our treatment is taken in large part from [7; 45; 41; 23].
Monte Carlo methods are used for sampling. The goal is to characterize a
distribution using a set of well-chosen samples. These can be used for approximate
MAP estimation, computing expectations, etc., and are especially helpful when
the expectations cannot be calculated analytically.
How the representative samples are drawn and weighted depends on the Monte
Carlo method used. One must keep in mind that the number of iterations of the
various algorithms that are needed to obtain independent samples may be large.
The drawback is that our assumption that p(x) is “easy” is often not valid;
p(x) is often not easy to sample from, and so we need a search mechanism
to draw good samples. Furthermore, we must be careful that this search
mechanism does not bias the results.
In Monte Carlo Markov Chain (MCMC) methods, the samples are drawn
from the end of a random walk.
The chain can be specified using the initial probability, p0 (x) = P (X 0 ) and
the transition probabilities, p(X ( t + 1)|X t ). The transition probability of
moving from state x to state y at time t is denoted Tt (x, y) , which can be
summarized in a transition matrix, Tt .
We take an initial distribution across the state space (which, in the case of
MRFs, is the set of possible configurations of the individual variables).
In the Gibbs sampling algorithm, each step of the random walk is taken
along one dimension, conditioned on the present values of the other
dimensions. In a MRF problem, it is assumed that the conditional
probabilities are known, since they are local (by the Markov property).
This method was developed in the field of physics, but was first applied
to low-level computer vision problems by Geman and Geman on the
problem of image restoration [23]. Geman and Geman furthermore
combined Gibbs sampling with simulated annealing to obtain not just
a sample, but the MAP estimate of their distribution.
The efficiency of the method depends on how tight c is and how close the
functions f (x) and g(x) are, and is not useful in practice for “difficult”
problems [45].
While we need to settle this issue in our minds, we argue that even if the models
are technically equivalent, their ease of use is not necessarily; each definitely has
3
its particular merits. Bayesian networks model causal dependencies and allow
3
Consider the theoretical equivalence of Turing machines with modern computers; which is
easier to program for practical tasks?
43
for efficient inference in sparse graphs. Their utility in high-level vision problems
has been proven. In contrast, Markov Random Fields have been used primarily
for low-level vision problems in which the spatial constraints can be specified
directly and only recently have been applied to object recognition (in which case
a tree-structured MRF was used). We believe that each model warrants further
investigation.
44
3 Methodology
For our prototype scene classifier using spatial information, we have focused
on the problem of classifying outdoor scenery (i.e., natural ) scenes. Many other
attempts have been made to classify this type of scene (e.g., [4; 9; 27; 35; 44; 46;
77]), all with some degree of success. One reason is that in natural scenes, low-
level features tend to be more correlated with semantics (than scenes of human
activity). Another is that there is a limited range of object type present in these
scenes [70]. Some research suggests that natural scene databases are less complex
(by a certain metric) than those of people [52]. We do note that these findings
are by no means conclusive because the range of features that were used in the
experiments was extremely narrow.
Finally, the ultimate test for our research is to solve a practical, yet very
interesting problem: classifying real consumer-type images. In much of the scene
classification work, as we have seen, experimental databases are usually limited
to professional stock photo libraries. If we can classify consumer images, which
are much less constrained in color and composition, successfully, our work could
potentially generalize to other domains. Robot vision, for instance, needs to
operate in non-structured environments with low-quality sensors.
When the time comes to experiment with real images and real detectors, we
would be well-equipped to detect the materials in these scenes. We have access
to state-of-the-art detectors for sky (blue [38] and mixed [67]), as well as grass,
water, and snow [39].
For a small set of these natural images, we have marked the location and extent of
each material in each image. While gathering data about the presence or absence
of materials is fairly straightforward (and can be represented by text), learning the
spatial relationships between them requires labeling the materials in each image
in such a way that the spatial information can be extracted. Because we want to
compare the effects of qualitative and quantitative spatial relationships, it is not
enough to collect data in the form “the sky is above the sand”. We need to outline
each material’s region in each image in some manner, so that the relationships
can be calculated. A sample of the desired information is captured in Figure 3.1.
46
Figure 3.1: Ground-truth labeling of a beach scene. Sky, water, and sand regions
are clearly shown.
Ground truth collection involved two stages, collecting images of each class and
labeling the material regions in each image.
In general, images chosen did not have a dominant main subject. This is
legitimate; for example, a close-up image containing a horse in a field would be
considered a “horse” image, not a “field” image. We do plan to investigate later
how to adapt the framework for such images.
47
Our region labeling was done in a principled fashion, using the following method-
ology:
• Define each material precisely. For instance, how is a crowd of people de-
fined? Is wet sand on a beach considered part of a sand or water region? We
defined “sand/gravel” as dry sand or gravel, and ignored wet sand, some-
times causing wide borders between the sand and water regions as a result.
For a detailed list of definitions, see [39].
For each image region, we selected a polygon contained strictly in the interior
of the region, yielding a “pure” sample of that region. Avoiding the boundary
altogether is simpler, yet leaves spaces between the regions (as shown in Figure
3.1). We could, of course, achieve a closer approximation using a minimization
refinement such as snakes [71]; however, it is doubtful when such a fine-grained
segmentation would help.
We labeled a total of 321 images, belonging to the beach, urban, and field
classes.
Figure 3.2: Prototypical beach scenes. (a) A simple beach scene without back-
ground objects. (b) Because we make no attempt to detect it, we consider the
sailboat to be “background”. (c) A more complicated scene: a developed beach-
front. (d) A scene from a more distant field-of-view. (e) A crowded beach.
the beside relation, because they were found to be more helpful for outdoor scenes.
We use their algorithm for its proven performance on similar scenes (and because
it is faster than the other methods discussed).
This also solves the issue of conflicting relations between a material’s subre-
gions in an intuitive fashion. As a simple example, see Figure 3.4. We have two
regions (A1 and A2 ) of material MA and one region (B) of material MB . A1
is beside B and A2 is above B. However, by considering A1 and A2 as one re-
50
Figure 3.3: Prototypical urban scenes. (a) The most common urban scene, con-
taining sky, buildings, and roads. (b),(c) The sky is not simply above the buildings
in these images. (d) Roads are not necessary. (e) Perspective views induce varied
spatial relationships. (f) Close views can preclude the presence of sky.
gion, the resulting bounding box overlaps region B and the weighted-walkthrough
algorithm yields a relationship favoring the larger region.
A2
A1 B
One goal of our research is to develop and demonstrate representations and al-
gorithms for difficult semantic scene classification problems. We start with prop-
erties of image regions and their spatial relations. We exploit probabilistic con-
straints on interpretations (discrete labels) of regions arising from cultural and
physical regularity and predictability. A simulated environment is essential for us
to manipulate and explore the various relationships, properties, algorithms, and
assumptions available to us without being constrained by the “accidental” (in the
philosophical sense) nature of any particular domain.
These goals can stand in competition with each other: abstracting away too
much in a simulating environment can be totally unrealistic and give little hope
of solving the original problem. Our goal is to balance these two goals.
We can safely abstract away the bank of material and object detectors available
by representing them by their probabilities of correct and incorrect detection,
along with their expected beliefs. For some detectors, we learn these beliefs from
training data, and for others, we estimate them. We can also abstract away the
images (to some degree) by using a probabilistic generative model, as will be
52
described. We do not have to label a large number of images, and yet we learn
the probabilities from the images we have labeled, thus achieving some balance.
MASSES is similar in spirit to Rimey’s T-world [56; 57], except T-world mod-
eled scenes with tight structure (e.g., a table setting), and should generalize to
scenes such as traffic scenes or medical diagnosis, where parts of the image follow
precise rules. It is doubtful that T-world could model natural scenes, which have
a much looser structure.
We now discuss the specific advantages of idealizing the world with MASSES.
1. It is a standalone work, under our control and free from outside influences.
Therefore, we are not dependent upon others’ ground truth data collection
or upon their detectors. Intellectual property issues are avoided.
2. It can embody the ideas we hold about the world, namely the strength and
granularity of the spatial organization and the accuracy of the detectors.
4. Perhaps most importantly, we can easily quantify the value of added infor-
mation. We can answer questions such as:
53
(b) How does performance decline when we replace true material knowledge
with faulty detectors?
(c) How does a better sky detector help? We can vary the performance
continuously and track the effects.
(d) How does adding another spatial constraint help? For instance, cliques
in a region adjacency graph (RAG) [71] may contain discriminatory
information.
(e) How strict do the spatial rules have to be before we can benefit from
them? How fine-grained do the measurements have to be?
The expected output of the simulator is a series of virtual scenes, which can be
used to train a classifier or to test its performance. In this section, we describe
our scene generative model.
Material Detection
Sky
Water MASSES
Sand Above(Sk, Wa)
Above(Sky, Sa)
Beside(Wa, Sa)
Inference Engine
Beach Class
Figure 3.5: The MASSES environment. Statistics from labeled scenes are used to
bootstrap the generative model, which can then produce new virtual scenes for
training or testing the inference module.
Bayesian Network
Class
The network has been designed with the following assumptions in mind:
4. We assume flat priors on the scene classes, thus eliminating their effect.
While this assumption does not hold in practice, the analysis becomes sim-
pler and we lose no generality.
With this in mind, the method of sampling the Bayes Net is straightforward,
using its priors and the conditional probability matrices of its links, because the
network is a directed and acyclic graph (DAG). The algorithm called probabilistic
logic sampling for DAGs was proposed by Henrion [25] (and extended by Cowell
[17]) 2 .
1. Generate the class, C, of the scene by randomly sampling the priors, P (class =
C).
2. Generate the list of materials present in the scene. Each of the materials has
as its parent the scene’s class, so we sample from P (materiali |class = C)
(one row in the CPM on the link) for each material i.
3. Generate spatial relationships for each pair of materials < i, j > present in
the scene by sampling P (Mi relij Mj |class = C).
Figure 3.7 shows an example. Note that there is a separate node in the network
for each material and for each pair of materials in the image, a total of (m C2 + m)
2
If the graph is not a DAG, a more complicated method such as rejection or a different
framework such as a nested junction tree [17] can be used.
57
Class
Figure 3.7: Sampling the scene type yields class C. Then we sample to find the
materials present in the image, in this case, M1 , M3 , and M4 . Finally, we sample
to find the relationships between each pair of these material regions.
If we were to have full knowledge of the materials in a scene, then many scenes
could be distinguished by the mere presence or absence of materials only. For
instance, consider the outdoor scenes described above. An image containing only
sky, a skyscraper, and roads is almost guaranteed to be an urban scene.
Sand_Detector
0.05 0.95 Sky
0.95 0.05 Sand
0.20 0.80 Road
0.20 0.80 Water
0.25 0.75 Building
0.30 0.70 Background
59
The first column gives P (Ds |M ), the probability that the sand detector Ds
fires, given the true material M . The second column gives the probability 1 −
(P (Ds |M )) that the detector does not fire. In this example, the sand detector has
95% recall of true sand, and detects sand falsely on 20% of the water, for instance
(perhaps due to a brown reflection in the water or shallow sand). It may also fire
falsely on 30% of the background (unmodeled) regions in the images because they
have similar colors.
Assume we have a detector for each material of interest. Each of these detec-
tors will be connected to any given region in the image and will have a certain
probability of firing given the true material in that region. The evidence consists
of the detectors that actually fired and the corresponding beliefs. Take a sand
region; each of the detectors’ characteristics may be as follows:
True Sand
0.03 0.97 Sky_detector
0.95 0.05 Sand_detector
0.20 0.80 Road_detector
0.20 0.80 Water_detector
0.10 0.90 Building_detector
In this case, the first column gives for each material detector Dm , P (Dm |S),
the probability that the detector fires on a true sand region.
Each detector is linked to each material in the image, as shown in Figure 3.8.
Note that the subgraph’s “root” node corresponds not to a specific material, but
to a specific region.
Inference on this subgraph allows the combined likelihood of each true material
to be calculated, yielding a soft belief in each material. Let Mi be a material and
D be the set of detectors, D = {D1 , D2 , . . . Dn }.
Region_label
Figure 3.8: Bayesian network subgraph showing relationship between regions and
detectors.
Y
λ(R) = α λDi (R) (3.1)
i
Y
= α MD|Ri λ(Di ) (3.2)
i
My|x , P (y|x)
At this point, we have two options. The likelihoods, λ(R) can be passed on to
the remainder of the network (i.e. attaching the subgraph to each material leaf
in the original network) or only the material with the maximum likelihood (ML)
can be passed on by taking arg maxi P (Mi |D). This second option corresponds to
performing inference on the subgraph off-line. Each option can advantages and
disadvantages.
1. Pass up the belief in each material, given which detectors fired on the re-
gion. Maintaining uncertainty concerning the true material is desirable. Our
61
2. Use a Maximum Likelihood approach for detecting one material per region.
Even though we lose the uncertainty in the detection, we decided to choose
this approach initially, since incorporating it into the network was straight-
forward.
In the generative stage for test images, we first generate the scene as described
earlier in this section, including true materials and spatial relations. Consider a
single region with known material. We can approximate, by sampling, the proba-
bility of each configuration of detectors firing, and the material chosen using the
ML approach. Doing this offline once can save time and lend insight into the
true quality of the detectors. The result is a distribution of ML detections given
the true material; sampling from this distribution yields a perturbed (possibly
faulty) material that we substitute for the true one. We keep the spatial relation-
ships originally generated; the composition of the region was mis-detected, not its
location.
We start by describing how to perturb a single material region. The idea is that
we can use each detector’s belief in the region to determine the material with the
maximum likelihood.
2. For each detector that fires, sample the belief distribution to determine the
confidence in the detection.
3. Propagate the beliefs in the Bayesian network to determine the overall like-
lihood of each material.
4. Replace the true material with the (possibly different) material with the
maximum likelihood.
Y
λ(R) = α λDi (R) (3.3)
i∈{KD,SD,RD,W D,BD}
63
0.05 0.95
0.95 0.05
0.20 0.80 0.6
λDSD (R) = (3.4)
0.20 0.80 0.4
0.25 0.75
0.30 0.70
= (0.41, 0.59, 0.44, 0.44, 0.45, 0.46)T (3.5)
Multiplying yields:
In this somewhat unlikely case, because the road material has the greater
likelihood, the ML approach would choose it.
in its place. We achieve this off-line by simply running the Material Perturbation
Algorithm repeatedly, keeping tallies of each material detected and normalizing.
Using a given set of five detector characteristics (see Appendix B), we find the
distributions in Table 3.2.
Note that using an ML framework, the detectors are not as good as first
appeared. The most common detector error is for none of the detectors to fire (or
to fire with low belief), causing the region not to be detected at all.
Because we assume to know the orientation of the image, we could also incor-
porate an orientation model into the material fusion module, which prior work
suggests yields a substantial gain in accuracy (11% in [40]).
2) Not using orientation info in the detectors gives lousy detection overall, so
we can expect better when we use it. However, we can just interpret these results
as what we would get in the case of bad detectors!
between materials. Background regions can be treated like other material regions,
their probability of occurrence and their spatial relationships with other regions
can be calculated. It is not yet clear whether spatial relationships involving the
background would be helpful.
When our framework includes faulty detectors, there is always a chance that
true materials of interest will not be detected by any detector (missed regions),
and would therefore be considered background, or that background regions could
falsely be detected as materials of interest (false positives). We currently account
for misses by ignoring (by not instantiating) the background and any relationships
involving it.
4. Modify the conditional probabilities to make the model more robust to er-
rors. For instance, 0s in the matrix should be changed to a small .
4 Experimental Results
1. B -scenes always contain water, sky, and sand/gravel. The sky is above or
far above the other materials, and the water tends to lie above the sand,
although it occasionally can lie to the side or below the sand (accounting
for certain less-likely perspectives).
2. L-scenes always contain water and sky, and often contain sand/gravel. How-
ever, the sand tends to lie at the horizon line, above the water.
67
We have modeled these three simulated scenes after beach, lake/ocean, and
urban scenes. They are similar to their real-life counterparts, except that we
model the land at the horizon as sand/gravel (since the horizon may be confused
with these in images). The adjectives “occasionally”, and “often” in the definitions
above are quantified by the conditional probability distributions in the Bayesian
network from which we are sampling.
We based the CPMs for the the B- and U-scenes on those learned from data
(120 beach scenes and 100 urban scenes), but only using a subset of the materials
present: sky, water, sand, building, and road. We hand-modeled the L-scene
CPMs ourselves. The scene priors were chosen to be uniform.
Accuracy on 100,000 scenes was 89.94% without spatial information and 96.99%
with spatial information, a gain of 7%. The absolute performance gain is not
important; what is important is that the hypothesis was verified under the as-
sumption of best-case material detection.
Table 4.1: MASSES with best-case material detection: Accuracy with and without
spatial information.
correctly (and with high confidence) if the sand region was between the sky and
the water regions.
Baseline Success The urban scenes could be discriminated from the other
scene types, using material presence alone. In our simplified model, buildings were
present if and only if the image was an U-scene, so this was expected. Furthermore,
L-scenes not containing sand were correctly classified as well, without spatial
relationships.
Continued failure What scenes escaped even spatial relations? First, with
small probability, the sand region was below the water region in some L-scenes.
These were mistaken as B-scenes. Second, when the sand and water regions
were next to each other, the two scenes were often confused. In these cases,
the final likelihoods (and hence the scene type) were determined by the spatial
relations between each region and the sky region. In B-scenes, sand tends to lie
below the sky and water far below the sky, whereas in L-scenes, the opposite
is true. However, the small fraction of images not holding to this relationship
were confused. Third, some simulated scenes contained seeming (at first sight)
conflicts, primarily due to the “far” modifier. Take, for instance, a scene with
water below the sky, and sand FAR below the sky. We could reason from this
that the sand would lie below the water, but often the sand would lie above the
water. Because we assumed each pairwise relationship to be independent of each
other relationship, this situation occurred.
We classified the 120 material-marked, beach scenes in our collection using the
same single-level Bayesian networks used in MASSES. As stated earlier, the
69
MASSES net is designed to distinguish between the B-, L-, and U-scenes.
Table 4.2: MASSES with best-case material detection: Accuracy with and without
spatial information.
These results are not surprising, given our definitions. We had defined beach
scenes (see Table 3.1) as ones with at least 5% of the image occupied by each
of water, sand, and sky regions, with no regard for the spatial relationships be-
tween these regions. Therefore, we expect accuracy to be 100% without spatial
information.
Furthermore, we are not surprised that accuracy went down when spatial in-
formation was added to the inference process. Upon observation of the images,
three contained sand above water, due to camera location (e.g., on a boat), and
could be classified as L-scenes. They all contain large water regions and small
sand regions.
The last image on the right, contains sand beside the water (which alone is
ambiguous), but the sky closer to the sand causes a higher belief in lake. The
rationale for calling it a beach scene is that the sand region dominates the water
region, having nothing to do with spatial relationships.
70
Before the inference stage, we perturb the results according to the method de-
scribed in 3.2.3. Note that the detector model is weak, so we can expect a large
percentage of the materials to be misclassified. Results are shown in Figure 4.3.
Table 4.3: MASSES with faulty material detection: Accuracy with and without
spatial information.
The 10 images missed when the material-only Bayesian network was used all
contained undetected sand. These fall into two categories.
1. Eight images contain sky, water, and background, and are thus classified as
L-scenes (Figure 4.2).
Figure 4.2: Images incorrectly classified using faulty detectors and no spatial
relationships. The actual materials are shown the top row; the detected materials
are shown below each.
2. The other two images mis-detected a building as well and were thus classified
as U-scenes (Figure 4.3).
71
The 12 images missed when the spatial relationships were added to the Bayesian
network included the 10 missed above. In addition, spatial relationships caused
an incorrect classification in two cases.
1. Spatial relationships were unable to rescue the two images above in which
the sand was missed and a building was falsely detected; they were still
classified as U-scenes. In the first image, spatial relationships should be
able to catch the mis-detected regions: the building and road regions are
besides each other, and the building is far below the sky regions, both of
which are unlikely. However, the belief in the U-scene still dominated that in
the other scenes. Mis-detection in the second image was such that it would
be indistinguishable from a normal beach scene.
2. Spatial relationships were unable to rescue the eight images containing sky,
water, and background: they are still classified as L-scenes.
3. Two additional images were misclassified (Figure 4.4). Both contain a region
misclassified as sand present above the water region, causing them to be
classified as L-scenes. In the first image, a large sand region was undetected,
and the small remaining sand region was above the water, leading to a higher
belief in an L-scene than a B-scene. The second image continued to be
misclassified as an L-scene, due to camera location (as above).
72
5 Proposed Research
While semantic detectors are not perfect, they continue to increase in speed
and accuracy; they are mature enough now to be reliably useful. We have access
to state-of-the-art detectors for blue and mixed sky, grass, skin, and faces, as well
as less-reliable detectors for water and snow fields, still under development [39;
38; 67]. Current research, still in its infancy, into related frameworks that rely
on high-level features like these is promising [37; 44; 48; 70]. However, no scene
classification system makes use of the spatial relationships between the objects
74
detected.
Experience developing the Bayesian tree that served as our initial inference
mechanism was encouraging, but we found the Bayesian tree to be not expressive
enough to handle evidence from faulty detectors when that evidence corresponds
to individual regions and their spatial relationships.
With this in mind, our central claim is as follows: Spatial modeling of semantic
objects and materials can be used to disambiguate certain scene types as well as
mitigate the effects of faulty detectors. Furthermore, an appropriate probabilistic
inference mechanism must be developed to handle the loose spatial structure found
in real images.
• Extend the chosen framework, because neither fits our needs exactly.
• Measure the efficacy both of our features and of our inference mechanism
by conducting experiments both on simulated and real images.
We define the input and output of our system, then elaborate on our research
plan in the following sections.
75
We also assume the existence of a mechanism for fusing the material beliefs
of each region and for adjudicating between discrepancies in the segmentation
induced by each map. Material detectors and the associated fusion mechanism
are documented in the literature [40; 39].
Bel(sky) = 0.78
Bel(background) = 0.12
Bel(water) = 0.07
...
BEACH
Bel(Beach) = 0.73
Bel(sand) = 0.68 Bel(Urban) = 0.04
Bel(rock) = 0.22 ...
Bel(bkgrd)
... = 0.10
Bel(water)= 0.53
Bel(sky) = 0.36
...
Figure 5.1: Classifier input (labeled regions) and output (classification and confi-
dence).
We would also like to output a belief in at least the most likely scene type
(if not each scene type). The probabilistic framework allows us to output these
beliefs, which can be thresholded to label a scene as “unknown”.
76
As we have seen, the generative model (single-level Bayes net) of our MASSES
simulator generates materials, not individual regions. Under best-case material
detection, this is not problematic. However, when materials are falsely detected,
it is limited. Consider the case shown in Figure 5.2. Initially, we have the re-
lationships (1) sky Abovewater , (2) sand Abovewater , and (3) sky Abovesand . Consider
the case where the water is mis-detected as sky. Now the relationships are (1)
sky Abovesky , (2) sand Abovesky , and (3) sky Abovesand . Relation (1) can be safely ig-
nored if we consider it was a sky region broken into two pieces. However, relations
(2) and (3) are contradictory.
Sky Sky
Sand Sand
Water (Sky)
It is not clear how to handle conflicting evidence. Splitting belief within a single
relationship node is not correct, and adding multiple nodes to capture multiple
binary relationships between two materials is ad hoc.
The limitation seems to stem from the current system’s architecture, which
only encodes materials and not individual regions in the nodes. It is true that
some non-adjacent regions of the same material could be merged using model-
based or belief-based grouping [67] (and creating occlusion relationships in the
process). However, since multiple regions may have different associated material
beliefs, information is lost.
77
It is clear that the single-level Bayesian network is not expressive enough to encode
our knowledge of the world. We have identified two potential frameworks to
explore: general-topology Bayesian networks and Markov Random Fields.
Class
Materials Material
Spatial Rels
Regions Region
Spatial Rels
Figure 5.3: Proposed DAG Bayesian network. Note that separate material and
region layers are needed.
The network shown uses a two-level Bayesian network, in which a level corre-
sponds to the materials present and a level corresponds to the individual regions.
78
The two levels are combined using a fuzzy-OR logical relation: for example, if any
region is detected as sky, then the sky node is activated.
However, we have some concerns with this network. First, while the approx-
imate inference algorithms discussed in Section 2.4.1 are designed for networks
with loops, it is unclear how well they would perform on dense graphs such as
this. Pearl hints that both the clustering and conditioning algorithms are tractable
only for sparse graphs. We need to investigate the stochastic simulation and belief
propagation algorithms further.
Second, it is not clear how to quantify the links connecting the region and
material nodes, because they represent logical relationships of variables with mul-
tiple hypotheses. Formal mechanisms such as the noisy-OR gate are designed for
binary variables.
Furthermore, given the debate between the relative merits of logic and proba-
bility [14; 50], mixing the two by adding logical constraints to a probability model
raises red flags in our minds due to technical details.
For each scene-specific MRF, the top level of the MRF corresponds to the
combined material beliefs in each region. The bottom-level corresponds to the
scene model, which is given by a distribution of ”typical” material configurations
for that scene.
Neither MRFs nor Bayesian networks fit our needs exactly. We briefly describe
shortcomings in each framework and our initial ideas for extending them.
3. In cases where the detector evidence is weak (or evidence is conflicting, the
configurations are unlikely, or the scene is unmodeled), we would like to
label the image as “unknown”. This necessitates our modeling an “other”
category, which would require a large training set. Thie problem is also
much more difficult than forced-choice classification.
MRFs are typically used on a single scene model and generate a MAP estimate
of the scene given the observed image. To approximate the MAP, one does not
need to calculate the actual probabilities:
P (I|S)P (S)
argi max P (S|I) = (5.1)
P (I)
∝ P (I|S)P (S) (5.2)
the probability of the image of that scene). We label the image with the class
of the MRF model producing the “Maximum of the MAP estimates” (MMAP).
Of course, cross-model comparison necessitates normalization, which is our first
challenge.
First, using a small number of nodes allows direct computation of the nor-
malization constant. Having a node correspond to each region or using a
small (e.g. 4 × 4) square lattice should suffice.
4. Because they operate on the pixel level, typical MRFs for vision are con-
structed on regular lattices. Our high-level vision problem starts with seg-
82
(a) Encode irregular regions and spatial relationships using a general graph
structure. The literature shows few departures from the lattice struc-
ture, except for use in hypertext link modeling [11] and object recog-
nition [19]
In order to validate our claims about semantic features and spatial relationships,
we must compare the efficacy of our system with a system using low-level features
and pattern classification techniques. We plan to use a benchmark set of outdoor
scene images.
Our proposed framework is not specific to outdoor scenes; the probabilities of the
presence of materials and the spatial relationships between them are learned from
data. We plan to experiment with other object-centric scene types which have
spatial structure.
table. While the spatial relationships vary more with scenes in which the image
is captured at close range, this will test the generalizability of our methods.
Some researchers, in particular Oliva and Torralba [47], argue that human
observers can recover scene identity even when objects are unrecognizable
in isolation and thus propose the “scene-centered” features discussed in Sec-
tion 2.2.3. This appears to be true in certain cases, but I would agree with
Oliva and Torralba’s assessment that their approach is complementary to
approaches like the proposed one which are based on semantic objects. An
interesting direction would be to begin the process of uniting the two com-
plementary approaches.
85
• Comparison of the two schemes in Section 2.3.1 (AVS and Weighted Walk-
through) for computing qualitative relations in order to achieve a balance
between computational cost and perceptual integrity.
Our experiments revealed some ambiguity in the distinction between real beach
and lake photos. There is more to semantic understanding than materials and
spatial relations. For example, if I take a picture while standing on the beach
looking out over the water, and no sand is in the image, we think of it as a lake
scene. If a large sand region dominates the foreground, then we consider it to be
a beach scene. If the camera is positioned above the water and its field of view
contains the beach, it may be classified as either: if I am standing ten feet from
the shore, then it would be a beach semantically; if I am on a boat three miles
from the shore, then it would be a lake semantically. In this case, our system
classifies each as a lake, because sand appears above water in the image.
Accounting for one summer working in industry, the anticipated defense date
is May, 2005.
87
6 Acknowledgments
This research was supported by a grant from Eastman Kodak Company, by the
NSF under grant number EIA-0080124, and by the Department of Education
(GAANN) under grant number P200A000306.
88
Bibliography
[4] D.C. Becalick. Natural scene classification that avoids a segmentation stage.
Technical report, ECE Department, Imperial College of Science, Technology,
and Medicine, 1996.
[6] Matthew Boutell, Jiebo Luo, and Christopher Brown. Review of the state of
the art in semantic scene classification. Technical Report 799, University of
Rochester, Rochester, NY, December 2002.
[7] Matthew Boutell and Brandon Sanders. Markov random fields: An overview.
Technical report, University of Rochester, Rochester, NY, 2003. in prepara-
tion.
89
[12] Peng Chang and John Krumm. Object recognition woth color cooccurance
histograms. In IEEE Conference on Computer Vision and Pattern Recogni-
tion, Fort Collins, CO, June 23-25 1999.
[15] Rama Chellappa and Anil Jain, editors. Markov Random Fields: Theory and
Application. Academic Press, San Diego, CA, 1993.
[16] Paul Chou. The Theory and Practice of Bayesian Image Labeling. PhD thesis,
University of Rochester, Rochester, NY, 1988.
[18] R. Duda, R. Hart, and D. Stork. Pattern Classification. John Wiley and
Sons, Inc., New York, 2nd edition, 2001.
[20] Forsyth and Ponce. Computer Vision: A Modern Approach. Prentice Hall,
2002.
[22] W.T. Freeman, E.C. Pasztor, and O.T. Carmichael. Learning low-level vision.
International Journal of Computer Vision, 40(1):24–57, October 2000.
[23] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distribu-
tions, and the bayesian restoration of images. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 6(6):721–741, November 1984.
[24] A. Hauptmann and M. Smith. Text, speech, and vision for video segmenta-
tion: The informedia project. In AAAI Symposium on Computational Models
for Integrating Language and Vision, Fall 1995.
[26] P. Hong, T. Huang, and R. Wang. Learning patterns from images by com-
bining soft decisions and hard decisions. In IEEE Conference on Computer
Vision and Pattern Recognition, volume 1, pages 78–83, Hilton Head, SC,
June 13-15 2000.
91
[27] Q. Iqbal and J. K. Aggarwal. Combining structure, color and texture for
image retrieval: A performance evaluation. In International Conference on
Pattern Recognition (ICPR), August 2002.
[28] A.K. Jain, R.P.W. Duin, and Jianchang Mao. Statistical pattern recognition:
a review. IEEE PAMI, 22(1):4–37, January 2000.
[30] Ross Kindermann and J. Laurie Snell. Markov Random Fields and Their
Applications, volume 1. American Mathematical Society, Providence, RI,
1980.
[36] Y. Lu, C. Hu, X. Zhu, H. J. Zhang, and Q. Yang. A unified framework for
semantics and feature based relevance feedback in image retrieval systems.
In ACM Multimedia Conference, Los Angeles, October 2000.
[37] J. Luo and A. Savakis. Indoor vs. outdoor classification of consumer pho-
tographs using low-level and semantic features. In IEEE International Con-
ference on Image Processing, Thessaloniki, Greece, October 2001.
[38] Jiebo Luo and Stephen Etz. A physics-motivated approach to detecting sky in
photographs. In International Conference on Pattern Recognition, volume 1,
Quebec City, QC, Canada, August 11 - 15 2002.
[39] Jiebo Luo, Amit Singhal, and Weiyu Zhu. Natural object detection in outdoor
scenes based on probabilistic spatial context models. In ICME, Baltimore,
MD, July 2003.
[40] Jiebo Luo, Amit Singhal, and Weiyu Zhu. Towards holistic scene content
classification using spatial context-aware scene models. In IEEE Conference
on Computer Vision and Pattern Recognition, Madison, WI, June 2003.
[43] David Marr. Vision - A Computational Investigation into the Human Rep-
resentation and Processing of VIsual Information. Freeman, San Francisco,
1982.
93
[44] Philippe Mulhem, Wee Kheng Leow, and Yoong Keok Lee. Fuzzy conceptual
graphs for matching images of natural scenes. In IJCAI, pages 1397–1404,
2001.
[45] R. M. Neal. Probabilistic inference using markov chain monte carlo methods.
Technical Report CRG-TR-93-1, Dept. of Computer Science, University of
Toronto, 1993.
[46] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic repre-
sentation of the spatial envelope. International Journal of Computer Vision,
42(3):145–175, 2001.
[48] S. Paek and S.-F. Chang. A knowledge engineering approach for image clas-
sification based on probabilistic reasoning systems. In IEEE International
Conference on Multimedia and Expo. (ICME-2000), volume II, pages 1133–
1136, New York City, NY, Jul 30-Aug 2 2000.
[49] G. Pass, R. Zabih, and J. Miller. Comparing images using color coherence
vectors. In Proceedings of the 4th ACM International Conference on Multi-
media, pages 65–73, Boston, Massachusetts, November 1996.
[51] T. Randen and J.M. Husoy. Filtering for texture classification. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 21(4):291–310, April
1999.
94
[52] A. Rao, R. Srihari, L. Zhu, and A. Zhang. A theory for measuring the
complexity of image databases. IEEE Transactions on Multimedia, 4(2):160–
173, June 2002.
[53] Rajesh Rao and Dana Ballard. Efficient encoding of natural time varying
images produces oriented space-time receptive fields. Technical Report 97.4,
University of Rochester, Rochester, NY, August 1997.
[54] A. Ratan and W.E.L. Grimson. Training templates for scene classification
using a few examples. In Proceedings of IEEE Content Based Access of Image
and Video Libraries, San Juan, 1997.
[55] Terry Regier and Laura Carlson. Grounding spatial language in perception:
An empirical and computational investigation. Journal of Experimental Psy-
chology: General, 130(2):273–298, 2001.
[57] Raymond D. Rimey. Control of Selective Perception using Bayes Nets and De-
cision Theory. PhD thesis, Computer Science Dept., U. Rochester, Rochester,
NY, December 1993.
[60] Henry Schneiderman and Takeo Kanade. A statistical method for 3d object
detection applied to faces and cars. In IEEE Conference on Computer Vision
and Pattern Recognition, 2000.
[64] Satoshi Semba, Masayoshi Shimizu, and Shoji Suzuki. Skin color based light-
ness correction method for digital color images. In PICS, pages 399–402,
2001.
[65] Navid Serrano, Andreas Savakis, and Jiebo Luo. A computationally efficient
approach to indoor/outdoor scene classification. In International Conference
on Pattern Recognition, September 2002.
[66] A. Singhal. Bayesian Evidence Combination for Region Labeling. PhD thesis,
University of Rochester, Rochester, NY, 2001.
[67] Amit Singhal and Jiebo Luo. Hybrid approach to classifying sky regions in
natural images. In Proceedings of the SPIE, volume 5022, July 2003.
[69] J. R. Smith and C.-S. Li. Image classification and querying using composite
region templates. Computer Vision and Image Understanding, 75(1/2):165 –
174, July/August 1999.
[70] Y. Song and A. Zhang. Analyzing scenery images by monotonic tree. ACM
Multimedia Systems Journal, 2002.
[71] Milan Sonka, Vaclav Hlavac, and Roger Boyle. Image Processing, Analysis,
and Machine Vision. Brooks/Cole Publishing, Pacific Grove, CA, 2 edition,
1999.
[75] A. Torralba and P. Sinha. Statistical context priming for object detection.
In Proceedings of the International Conference on Computer Vision, pages
763–770, Vancouver, Canada, 2001.
[78] A. Vailaya, A. K. Jain, and H.-J. Zhang. On image classification: City images
vs. landscapes. Pattern Recognition, 31:1921–1936, December 1998.
[79] A. Vailaya, H.J. Zhang, and A. Jain. Automatic image orientation detection.
In Proc. IEEE International Conference on Image Processing, Kobe, Japan,
October 1999.
[82] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Understanding belief propagation
and its generalizations. International Joint Conference on Artificial Intelli-
gence (IJCAI), Distinguished Presentations, August 2001.
We give the spatial relationship counts calculated from the 220 labeled beach and
urban scenes.
The row is the landmark object and the column is the trajector object, so
for instance, Above(f oliage, sky) = 24 from the first table means that sky was
located above foliage in 24 beach images. Blue patch (of sky) was incorporated
into sky; we do not differentiate based on cloudy/clear sky.
BEACH
Relation above
sky blu gra fol wat sno san roc roa bui cro
sky 0 0 0 1 0 0 0 0 0 0 0
bluePatch 0 0 0 0 0 0 0 0 0 0 0
grass 0 0 0 1 0 0 0 0 0 0 0
foliage 24 0 0 0 2 0 8 0 0 1 1
water 97 0 0 12 0 0 5 8 1 7 0
snow 0 0 0 0 0 0 0 0 0 0 0
sand 16 0 0 5 51 0 0 9 0 4 10
rock 11 0 0 4 3 0 3 0 0 0 0
road 1 0 0 0 1 0 0 0 0 1 0
building 13 0 0 1 0 0 0 1 0 0 0
crowd 3 0 0 2 6 0 0 0 0 2 0
99
Relation below
sky blu gra fol wat sno san roc roa bui cro
sky 0 0 0 24 97 0 16 11 1 13 3
bluePatch 0 0 0 0 0 0 0 0 0 0 0
grass 0 0 0 0 0 0 0 0 0 0 0
foliage 1 0 1 0 12 0 5 4 0 1 2
water 0 0 0 2 0 0 51 3 1 0 6
snow 0 0 0 0 0 0 0 0 0 0 0
sand 0 0 0 8 5 0 0 3 0 0 0
rock 0 0 0 0 8 0 9 0 0 1 0
road 0 0 0 0 1 0 0 0 0 0 0
building 0 0 0 1 7 0 4 0 1 0 2
crowd 0 0 0 1 0 0 10 0 0 0 0
URBAN
Relation above
sky blu gra fol wat sno san roc roa bui cro
sky 0 0 0 1 0 0 0 0 0 0 0
bluePatch 0 0 0 0 0 0 0 0 0 0 0
grass 0 0 0 6 0 0 0 0 0 1 0
foliage 9 0 0 0 1 0 0 0 0 22 0
water 6 0 0 0 0 0 0 0 0 14 0
snow 0 0 0 0 0 0 0 0 0 0 0
sand 0 0 0 0 0 0 0 0 0 0 0
rock 0 0 0 0 0 0 0 0 0 0 0
road 1 0 1 4 2 0 0 0 0 19 0
building 77 0 0 2 1 0 0 1 0 0 0
crowd 0 0 0 0 0 0 0 0 0 0 0
101
Relation below
sky blu gra fol wat sno san roc roa bui cro
sky 0 0 0 9 6 0 0 0 1 77 0
bluePatch 0 0 0 0 0 0 0 0 0 0 0
grass 0 0 0 0 0 0 0 0 1 0 0
foliage 1 0 6 0 0 0 0 0 4 2 0
water 0 0 0 1 0 0 0 0 2 1 0
snow 0 0 0 0 0 0 0 0 0 0 0
sand 0 0 0 0 0 0 0 0 0 0 0
rock 0 0 0 0 0 0 0 0 0 1 0
road 0 0 0 0 0 0 0 0 0 0 0
building 0 0 1 22 14 0 0 0 19 0 0
crowd 0 0 0 0 0 0 0 0 0 0 0
B Detector Characteristics
Each row gives the probability of the detector firing and not firing on each true
material.
node-name Sky_Detector
0.95 0.05 Sky
0.03 0.97 Sand
0.05 0.95 Road
0.18 0.82 Water
0.05 0.95 Building
0.01 0.99 Background
node-name Sand_Detector
0.05 0.95 Sky
0.95 0.05 Sand
0.20 0.80 Road
0.20 0.80 Water
0.25 0.75 Building
0.30 0.70 Background
node-name Road_Detector
0.05 0.95 Sky
0.20 0.80 Sand
0.85 0.15 Road
0.15 0.85 Water
0.40 0.60 Building
0.25 0.75 Background
node-name Water_Detector
104
node-name Building_Detector
0.05 0.95 Sky
0.10 0.90 Sand
0.15 0.85 Road
0.05 0.95 Water
0.80 0.20 Building
0.10 0.90 Background