Sunteți pe pagina 1din 5

1

Enhancement of Object Detection and


Definition Using Spatio-Temporal Clustering
(April 2015)
Michael J. Seese

AbstractThis document describes the motivation, research,


implementation, results, and conclusions behind the exploitation
of spatio-temporal clustering to enhance object detection and
definition within imagery data. This paper utilizes standard
feature detection algorithms to find points of interest. These
features are assigned descriptors and correlated between frames.
Point trajectories are calculated by how the features change
between frames; these point trajectories are clustered using a
motion dynamics model. The clusters formed enhance object
detection and better define the extents of these objects. This
paper also presents a Graphical User Interface (GUI) for
analyzing candidate video segmentation algorithms; this GUI
lays a framework for advanced algorithm development.
Index Termsspatio-temporal, clustering, unsupervised
learning, image segmentation, object detection, object definition,
video segmenation

I. INTRODUCTION

MAGE segmentation techniques often over-segment by


separating an object by different features. Due to different
colors, shapes, and other features, the reported segments are
incorrect. When only looking at a single frame, it is very
difficult to gather information to determine that these features
belong to the same object without doing prior learning for
classifying that specific object. To achieve correlation
between these features, spatio-temporal clustering can be
exploited to gather how the features change over time. By
using clustering techniques, an unsupervised learning
algorithm can be realized to define object boundaries in video
sequences by correlating different features that have the same
object dynamics.
Described in Section IV, there are a few popular spatiotemporal features that are used for clustering. Point trajectories
describe how a specific feature/pixel changes between frames.
Region trajectories will generalize and see how regions or
segments change between frames. This paper focuses on using
point trajectories as the spatio-temporal features; however the
algorithm presented could incorporate a mixture of point
M. J. Seese is a graduate student at the University of Florida studying
towards a Master of Science in Electrical and Computer Engineering. Seese
holds a Bachelor of Science in Electrical Engineering and a Bachelor of
Science in Computer Engineering, earned from the University of Florida in
May, 2013 (e-mail: mjseese@ufl.edu).

trajectories and region trajectories.


The methods used for clustering these spatio-temporal
features vary between papers. Hierarchical clustering has been
seen as a semi-popular method for spatio-temporal clustering.
Many papers have favored spectral clustering of the spatiotemporal features to determine if behavior and location cluster
together (if two objects are moving at the same rate but are
separated by space, spectral clustering will help distinguish
these objects). This paper focuses on a simplistic clustering
algorithm in which minimizes computational complexity
while provided a framework for further research/investigation.
Spatio-temporal clustering can be done by realizing possible
motion dynamics. Motion dynamics are a set of prior
knowledge, or assumptions, of how objects may move from
one frame to another frame. The higher order used for this
dynamics model the better an algorithm can determine
accurate object boundaries. However, adding complexity to
the motion dynamics model adds computational complexity
([3] presents ( ) computational complexity for k orders
with n trajectories). Ambiguity is also added by adding higher
orders. Since there is noise when determining the spatiotemporal features (i.e. point trajectories), it becomes difficult
or ambiguous to determine whether observed noise is noise or
another object.
Due to time restrictions, the original algorithm could not be
fully implemented and tested. However, the research presented
in this paper lays out a framework for an advanced algorithm
development for video segmentation. This framework will be
vital to future work in the area of video segmentation and can
be used by other researchers to design new algorithms. Further
investigation will be conducted to look at higher order motion
dynamics and spectral clustering.
II. MOTIVATION
Although there are many applications for the research
conducted in this paper (see Section IX), the author was
motivated towards autonomous system applications and
unsupervised learning. Autonomous systems currently
struggle with interacting with unknown objects; many
autonomous systems today will use training data to help
classify and recognize objects desired for interaction.
Interaction requires knowledge of shape, size, motion
dynamics, and other features of the object. The main
motivation, but not the only application, of this research is to

2
provide a foundation for object detection and definition for
autonomous interactions of untrained objects.
III. SPATIO-TEMPORAL
Spatio-temporal refers to something belonging to both space
and time. This paper uses spatio-temporal features within
video sequences for defining objects.
First, this paper outlines a method for obtaining these
spatio-temporal features. These features describe how features
within an image change between frames over time. Next, we
show a candidate clustering algorithm for associating and
correlating these features together into distinct objects.
There have been substantial amounts of research in the area
of spatio-temporal clustering. Rather than clustering on space
or time individually, this type of clustering shows how data
correlates between both time and space. Many papers focus on
the aspect that events happen in certain areas at certain times;
this draws conclusions about distinct events. Another analysis
of spatio-temporal features looks at how data points change
over time.
The latter, which this paper focuses on, of these types of
spatio-temporal clustering shows promising results towards
unsupervised learning. We can draw conclusions about data by
observing how they change over time. Clusters of certain
behaviors can be observed and decisions can be made about
this similar behavior without having strict prior training.
IV. RELATED WORK
Many have researched this area with regards to observing
image feature changes within a video sequence. This is to try
to gain enhanced performance with regards to image
segmentation. Properly segmenting an image provides
boundaries of objects. This task is vital to a handful of
applications shown in Section IX.
A. Point Trajectories
A point trajectory describes how a point in an image has
changed from the previous image. To calculate a point
trajectory, one must correlate the desired pixel value to the
previous frame. Once the correlation is made between this
frame and the previous frame, the point trajectory is the vector
from the previous pixel location to the current pixel location.
Addressed in [3], point trajectories inherently focus on
translational change but lack in higher order dynamics models
(i.e. including rotation and scale). [3] attempts to address this
issue by using hypergraphs. By executing a projection, the
hypergraph is transferred to a normal graph for use in spectral
clustering. Unfortunately, the computational complexity is
( ) for k-affinities. [3] mitigates this problem by reducing
the number of hyperedges to be considered for clustering by
sampling the edges.
B. Region Trajectories
A region trajectory differs from a point trajectory by looking
how regions, or segments, change from frame to frame.
However, the calculation of a region trajectory differs from a
point trajectory. [2] shows how to generate region trajectories
in section 2.2 by using an acyclic graph.

Region trajectories are useful for video segmentation;


however, the image segmentation algorithm needs to be
accurate and provide meaningful regions/segments. Seen in
[2], over-segmentation can lose information about the object
boundaries. Region trajectories may also lose resolution in
optical flow; this can cause over-clustering within the video.
C. Spectral Clustering
Spatio-temporal clustering techniques have favored spectral
clustering in many research articles [1], [3]. Spectral
clustering attempts to reduce the dimensions of non-linear
models by using the eigenvalues of the similarity matrix.
V. CONCEPT
The concept for this algorithm uses image feature detectors
to find points of interest within each frame as a basis for
determining object boundaries. Most algorithms will sample
most pixels within a frame for determining the optical flow of
the image. This takes a long time, and some sampling needs to
be done. [3] introduces a pseudo-random sampling mechanism
to reduce the number of computations for clustering.
The algorithm presented here finds features within an image
such as FAST corners, Sobel edges, and (Speeded-Up Robust
Features) SURFs. These features are assigned descriptors for
correlation to other frames. SURF already incorporates a
descriptor; this descriptor was used to describe the other
features.
The original concept for the algorithm was to use higher
order motion dynamics to include rotational and scaling
objects. This is almost necessary for most video sequences
since objects have relative rotational and scaling properties
when they move or the field of view moves. However due to
time constraints, only a 2-D translational motion dynamics
model was realized.
VI. ALGORITHM
The proposed algorithm consists of several components.
Figure 1 shows the overview of the algorithm. For every frame
captured, this algorithm loops. First, features in the frame are
grabbed. These features are special only and have no
dependency on prior frames. Figure 2 shows the Get
Features sub-function; it can be seen that Sobel Edge, FAST
Corner, Color-Based Blob, and Speeded-Up Robust Features
(SURF) detectors are used as sample features. Unfortunately,
only enough time was permitted to test Sobel Edge, FAST,
and SURF detectors in the clustering algorithm. This
clustering algorithm can be extended to other types of feature
detectors. Figure 3 shows the Process Features sub-function
where feature descriptors are calculated using the SURF
descriptor implementation.
[6] was used for obtaining SURF features and descriptors
for non-SURF features (i.e. Sobel Edges). [7] was used FAST
features. Standard MATLAB edge detection functions were
used for Sobel Edge detection.

3
Interface (GUI) was created in MATLAB. This GUI allows
for users to load in a picture or video. Once loaded in, the GUI
will process the file with candidate algorithms. A simple video
player will allow the user to navigate between frames. The
user has the option to overlay data onto the image displayed in
the video player. Figure 3 shows the GUI in use.
When the user opens the GUI, the user will select a folder
using Browse functionality. Once a folder is selected, a
navigation list is populated with all the videos and pictures
within the folder; this makes it easy for quickly changing
between files. Figure 1 shows the navigation list.

Figure 1. Overview of Algorithm

Figure 4. Test GUI Navigation List

Figure 5. Test GUI Video Player


Interface

Figure 2. Obtain Image Features.

Figure 3. Process Image Features and Calculate Spatio-Temporal Features.

Clustering of these point trajectories is done using a nearest


neighbor mechanism. Point trajectories with a similar heading
and rate are clustered together. This allows for flexibility in
terms of how many clusters are formed. Other clustering
techniques such as k-means require separating the data into a
specific number of clusters even if the data should belong to
fewer or more clusters.
However, this method for clustering is not robust. It appears
that non-rigid cameras generate jitter in the video which
causes this algorithm to over-generalize and put too many
features into the same cluster.
VII. EXPERIMENTATION
A. Graphical User Interface (GUI)
To assist with experimentation, a test Graphical User

Once the user selects a file and presses Load, the program
will read in the file and do all possible processing. This
preprocessing helps with analysis of algorithms as some of the
feature detectors are slower than real-time. The user is alerted
of the progress of loading and processing the file.
Once the file is loaded and image features are detected, the
user may navigate between frames with a video player
interface. Figure 2 shows this video player. Users can skip to a
frame, go next or previous, and play forward or backward.
The GUI allows the user to select what type of data is
overlaid the image displayed. This includes the image features
detected using different algorithms (i.e. SURF, FAST, colorbased blobs, etc.) as well as clustering output. Figure 3 shows
an image with overlaid features.

Figure 6. Test GUI in use. Features are overlaid onto an image within a video
sequence. Sobel edges can be seen as magent, FAST corners can be seen as
green/red, and SURF features can be seen as blue.
Figure 7. marple2 frame 24. Spatio-temporal clusters are shown as colored
dots.

VIII. EVALUATION
A. Berkeley Motion Segmentation Benchmark
A benchmark suite [5] for training and testing video
segmentation is introduced in [4]. This benchmark suite
contains 100 videos with ground-truth labeling and error
metrics. The ground-truth labeled/annotated videos were
annotated by 4 separate individuals about how the video
sequences should be segmented. This provides a reliable
benchmark suite to compare algorithms ([1] - [4] all use this
dataset for performance evaluation). The benchmark even
provides software for evaluating algorithms and generating
error metrics.
Unfortunately this benchmark suite was not used for
evaluating the presented algorithm. It appears that the
benchmark suite only annotates a few frames out of the video
sequence. The algorithm presented focuses on continuous
video data that does not lose several frames before obtaining a
new sample. Testing against the benchmark suite [5] was not
very applicable to this algorithm at the time of writing this
paper. Further research will allow the algorithm be robust
enough to compete against the benchmark tests.
B. Results
Although there was not a rigorous truthing analysis to
determine error metrics, results could still be seen using the
GUI. For certain translational scenes, like that of Figure 7,
semi-accurate clustering was shown. You can see in this
image that the sleeve of the arm graphing for the phone was
clustered together, separate than the cluster of the chair as well
as the picture on the wall.

Sample video was taken with a mobile camera to test


specific cases. In Figure 8, a handful of stationary objects are
on the floor. The camera moves around these objects to gain
observability. Since the briefcase is closer than the textbook,
the briefcase has a faster relative rate as it gets closer to
camera. Due to the lack of robustness, the presented algorithm
had difficulty distinguishing slow motion. This slow motion
caused little difference in movement and multiple objects were
clustered together. However, it can be seen that clustering is
starting to occur with this simplified algorithm. Adding a
higher order will show promise for improved performance.

Figure 8. 20150322_182607 frame 90. Initial clustering can be seen; however


it appears that over-segmentation is occurring for slow motion.

IX. APPLICATIONS
Video segmentation has many applications. As mentioned
in Section II, machine learning algorithms can improve given
accurate segmentation. If range estimates were known of
objects within video, which could be achieved with binocular
sensors, machine learning algorithms can make better
predictions about size and shape of unknown objects. This can

5
open new doors for object interaction (i.e. pushing, grabbing,
accurate avoiding, etc.) for robots using unsupervised
learning.
Anomaly detection can also be realized by looking for
behavior that is uncommon to the surrounding behavior. This
could be large objects passing through or a change/lack of
flow. This could be looking for anomalies on a manufacturing
line or traffic in a street. Anomaly detection can help
businesses, first responders, and other entities detect issues
early on and react quickly.
X. LIMITATIONS
Due to time constraints, limited research was able to be
conducted. The current algorithm focuses on a simple 2D
dynamics model where performance lacks for rotating and
scaling objects. Since all video has a form of these higher
order motion dynamics and almost never is purely
translational, the clustering algorithm did not accurately
portray objects.
Further research is necessary to incorporate higher order
motion dynamics as well as more advanced clustering
techniques such as spectral clustering. The framework
presented in this paper is a good foundation for further
development and will ease future research.
XI. CONCLUSIONS AND FUTURE WORK
There is plenty of research in the area of video
segmentation. My initial research when the proposal was
written showed minimal amounts of research. It wasnt until a
rigorous research was conducted, when I found specific
keywords for the field, did much relevant research appear. I
was able to conduct sufficient research to gather theories about
path forward for future investigation; however, there was
limited time to develop these theories.
[3] presents a strong algorithm for higher order motion
dynamics. As a base, this algorithm shows promise, but the
computational complexity of ( ) is too large. Further
research will be conducted to reduce this computational
complexity and see how higher order motion dynamics can be
used efficiently.
The framework and GUI presented in this research paper
provide a core development environment for further
investigation in video segmentation. Further development will
add capability for quick benchmark testing using the Berkeley
Motion Segmentation Benchmark presented in [4].

REFERENCES
[1]
[2]
[3]
[4]

F. Galasso, M. Iwasaki, K. Nobori, and R. Cipolla. Spatio-temporal


clustering of probabilistic region trajectories. In IEEE ICCV, 2011.
G. Zhang, Z. Yuan, D. Chen, Y. Liu, and N. Zheng. Video object
segmentation by clustering region trajectories. In ICPR, 2012.
P. Ochs and T. Brox. Higher order motion models and spectral
clustering. In CVPR, 2012.
T. Brox and J. Malik. Object segmentation by long-term analysis of
point trajectories. In European Conference on Computer Vision
(ECCV), 2010.

[5]

[6]

[7]

F. Galasso, N. Nagaraja, T. Cardenas, T. Brox, B. Schiele. A Unified


Video Segmentation Benchmark: Annotation Metrics and Analysis. In
ICCV, 2013.
D. Kroon. OpenSurf MATLAB Detector. University of Twente, 2010.
Available:
http://www.mathworks.com/matlabcentral/fileexchange/28300opensurf--including-image-warpE. Rosten. FAST Corner Detector. 2013. Available:
http://www.edwardrosten.com/work/fast.html

S-ar putea să vă placă și