Sunteți pe pagina 1din 11

Hindawi Publishing Corporation

International Journal of Distributed Sensor Networks


Volume 2014, Article ID 394942, 10 pages
http://dx.doi.org/10.1155/2014/394942

Research Article
Self-Supervised Sensor Learning and Its Application:
Building 3D Semantic Maps Using Terrain Classification
Chuho Yi,1 Donghui Song,2 and Jungwon Cho3,4
1

Future IT R&D Lab, LG Electronics, Seoul 137-724, Republic of Korea


Department of HCI and Robotics, University of Science and Technology, Daejeon 305-350, Republic of Korea
3
Department of Computer Education, Jeju National University, Jeju 690-756, Republic of Korea
4
Department of Mathematics and Computer Science, Salisbury University, Salisbury, MD 21801, USA
2

Correspondence should be addressed to Jungwon Cho; jwcho@jejunu.ac.kr


Received 2 January 2014; Accepted 28 February 2014; Published 7 April 2014
Academic Editor: Tai-hoon Kim
Copyright 2014 Chuho Yi et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
An autonomous robot in an outdoor environment needs to recognize the surrounding environment to move to a desired location
safely; that is, a map is needed to classify/perceive the terrain. This paper proposes a method that enables a robot to classify a terrain
in various outdoor environments using terrain information that it recognizes without the assistance of a user; then, it creates a threedimensional (3D) semantic map. The proposed self-supervised learning system stores data on the appearance of the ground data
using image features extracted by observing the movement of humans and vehicles while the robot is stopped. It learns about the
surrounding environment using a support vector machine with the stored data, which is divided into terrains where people or
vehicles have moved and other regions. This makes it possible to learn which terrain an object can travel on using a self-supervised
learning and image-processing methods. Then the robot can recognize the current environment and simultaneously build a 3D
map using the RGB-D iterative closest point algorithm with a RGB-D sensor (Kinect). To complete the 3D semantic map, it adds
semantic terrain information to the map.

1. Introduction
As seen in the Defense Advanced Research Projects Agency
(DARPA) Grand Challenge, USA, robotics has progressed
markedly from industrial robots that perform only given
tasks to autonomous mobile robots that determine how to
travel to a target. For robots to operate more intelligently,
research on mobile robots needs to consider the following:
(1) how to recognize certain objects, humans, or specific patterns; (2) simultaneous localization and mapping (SLAM)
[1]; and (3) navigation to a specific destination. When an
autonomous robot needs to reach a destination, the most
important basic process before moving is for it to assess the
safety of the surrounding environment and find a safe path
for movement [2]. This process could incorporate global
positioning system (GPS) and mapping technology, but a
GPS system is not accurate enough to identify an exact position, and maps do not include all objects, especially moving
objects. Therefore, a moving robot must be able to recognize
the surrounding environment. One recognition method is

terrain classification, in which a robot uses sensor responses


to recognize the surrounding environment and determine
possible safe pathways. This paper presents a method by
which a robot equipped with a vision sensor can stop and classify terrain using self-supervised learning system and identify
roads and sidewalks from passing vehicles and humans. Then
it introduces the method used to create a three-dimensional
(3D) map using a red/green/blue-depth (RGB-D) sensor
(Kinect) and to add terrain information to create a 3D semantic map.
Figure 1(a) shows the environment of interest for this
paper. It is an urban environment with a road for vehicles and
a sidewalk for people. Figure 1(b) is the expected 3D semantic
map after adding semantic information to the dense map.
The remainder of this paper is organized as follows. We
summarize related work in Section 2 and introduce the
terrain-classification method and learning method for distinguishing between road and sidewalk through observations
using a camera sensor in Section 3. Section 4 introduces

International Journal of Distributed Sensor Networks

(a)

(b)

Figure 1: (a) Example of an urban environment with a road, sidewalk, and buildings. (b) Example of a 3D semantic map (blue region: road;
green region: sidewalk; red region: obstacles).

the RGB-D iterative closest point (ICP) method, which uses a


RGB-D sensor to make a 3D dense map and to develop a 3D
semantic map by integrating the results of Section 3. Experimental results showing the effectiveness of the proposed
method are presented in Section 5 and the conclusions and
future research are presented in Section 6.

2. Related Works
2.1. Terrain Classification. During the past year, several studies have proposed methods for terrain classification. One
study developed a method that involved searching for possible obstacles using a stereo camera, eliminating candidates
based on texture and color clues, and then modeling terrain
after obstacles had been defined [3]. Another study focused
on avoiding trees in a forest; it used a stereo camera to
recognize trees and classify terrain to find a safe pathway
[4]. Other studies used a vibrating sensor to classify terrain
that had already been traveled, based on various vibration
frequencies [5].
However, these techniques only work in specific environments. Robots need to be able to learn about unknown
terrain. Some studies have focused on supervised learning
that requires human intervention when a robot reaches an
unknown area. Due to the limitations of supervised learning,
many researchers are now working on self-supervised or
unsupervised techniques, in which a robot can learn about
an environment on its own, without any human supervision.
One recent study developed a technique in which a robot
can calculate the depth of a ground plane using a depth map
generated by a stereo camera and can classify and learn about
the ground and obstacles within 12 m. Based on these data,
it can recognize very distant regions, as far as 3040 m [6].
Another study developed an unsupervised learning method
that deletes incorrect detections about a wide variety of terrain types (e.g., trees, rocks, tall grass, bushes, and logs) while
the robot navigates and collects data [7]. Yet another study
involved self-supervised classification using two classifiers: an
offline classifier that used vibration frequencies to provide the
other classifier, an online and visual classifier, with labels for

various observed terrains. This allowed the visual classifier to


learn about, and recognize, new environments [8].
However, some of these methods require more than one
sensor; some use stereo cameras or vibrating sensors with
monocular cameras, and most assume either that the robot
is facing a flat plane through which it can navigate or that the
robot will learn about the terrain after it navigates through it.
2.2. 3D Map Building. The construction of 3D maps using
various sensors has been studied, including range scanners
[9], stereo cameras [10], and single cameras. The biggest problem in constructing a 3D map is the alignment of captured
images. To process 3D laser data, the ICP algorithm is widely
used [9]. This algorithm finds a rigid transformation of the
optimal distance between a point on the frame and each other
point on the frame repeatedly. A passive stereo image system
can extract depth data for features through the paired images.
The feature points may be combined via optimization similar
to the iterative process of the first ICP. Then, additional
algorithms such as the random sample consensus (RANSAC)
algorithm can be used to solve the problem of consistency
[10].
Recent research on creating 3D maps has obtained depth
information for each pixel using a sensor that simultaneously
extracts a color depth image and a general video, such as by
combining Kinect with a time-of-flight (TOF) camera [11].
Kim et al. [12] built a 3D map using a fixed TOF camera that
had a frame unrelated to the order of time. In contrast, Henry
et al. [11] proposed a 3D map-building method that used a
freely moving RGB-D sensor and information on time, shape,
and appearance simultaneously.

3. Self-Supervised Terrain Classification


The proposed method is based on a self-supervised framework. The robot observes moving objects and determines
their movement along roads and sidewalks. From these data
it extracts image data (e.g., patch) about the terrain and classifies it as one of three classes (road, sidewalk, or background).
This framework consists of three parts: detection and tracking

International Journal of Distributed Sensor Networks

(a)

(b)

Figure 2: Detection, classification, and tracking of moving objects ((a) the result of background subtraction, (b) the result of object classification (blue square: human, green: vehicle) and tracking based on size and location).

of moving objects, recognition of paths taken by moving


objects, and learning the terrain patches that were extracted
from the paths taken by moving objects and classifying the
environment.
The proposed method has two classifiers. One is for classifying moving objects; this is offline and supervised learning.
The other is for classifying terrain; this is self-supervised
learning, and it also learns image patches, based on labels
generated by the object classifier. Both classifiers are Support
Vector Machines (SVM) as proposed by Vapnik [13] and both
use only visual features.
3.1. Detection and Tracking of Moving Objects. For this study,
we assumed that only humans and vehicles move in the
outdoor environment. Thus, moving objects are defined in
two classes: human and vehicle. Background subtraction is
used to detect moving objects; this involves a mixture of
adaptive Gaussians [14]. The system tracks objects based on
their size and location. Figure 2 shows the results of detection,
classification, and tracking of moving objects.
3.2. Object Classification. To identify whether a detected
object is human or vehicle, the SVM selects an object classifier. Data about the objects edge are used for a feature vector.
This classifier also provides information about what class of
terrain is involved.
(1) Classifier. The first SVM in this system classifies objects
as either human or vehicle. This binary classification can be
expressed as (1) where the classification function is : R
1, is a training vector, and is a label of class. For
example, is an objects edge feature vector and shows that
a positive object is classified as human and a negative object
is classified as vehicle:

() = (, ) + ,

(1)

=1

where is the number of total training data and () is a


kernel function and we used the Radial Basis Function (RBF)
2
which is (, ) = /2 as the kernel function. and
are weights that reduce numbers of wrong classification
by making a distance between a hyper plane and support

Figure 3: Path extraction about moving objects and lines shows its
path (blue: human, green: vehicles).

vectors far and these weights are calculated by changing to


an optimized problem using

max
=1

1
( , )
2 =1 =1

subject to = 0,
=1

(2)

0 ,

where is a constant to minimize incorrect classification and


is calculated with and support vector using (1) [15, 16].
This SVM is implemented using LIBSVM [17, 18].
(2) Features. The feature vector for the object classifier is a
global edge histogram of the objects region [19, 20]; this is
shown in (3) and also consists of responses from four orientations and other data:
= [V , , 45 , 135 , ] ,

(3)

where is a feature vector of edge responses of an object,


is the region of an object, V expresses the response about
the vertical orientation of objects region, is for horizontal,
45 is for 45 , 135 is for 135 , and is the response about
nonedges. These feature values are normalized 0 to 1.
3.3. Data Association and Path Extraction. To extract human
and vehicle movement paths (see Figure 3), the system saves
data about objects at every frame during the classification

International Journal of Distributed Sensor Networks

(1) Observe the object that is moving ( is the object is detected at this frame).
(2) Confirm whether the presently detected object was seen before or not by comparing it with
every object that was already detected, based on size and position data of the objects .
(2a) If the object was seen before (e.g., and (, ),1 are similar to and (, ), ),
update data of through {(, ),1 }, = { , }, = + 1 and = + .
(2b) If the object is seen for the first time, create new object data +1 same as .
Thus, the total number of detection objects increases to + 1.
(3) Check whether is a multiple of .
(3a) If it is, do path extraction.
(3b) If it is not, keep observing until becomes the next multiples of .
Algorithm 1: Data collection and association.
= The total number of detected objects
= The threshold for defining a moving object
Sidewalk map: A map that includes the total paths of human movement
Road map: A map that includes the total paths of vehicle movement
Input: All detect object ( = 1, 2, . . . , )
Goal: Find all paths on which humans and vehicles move (Sidewalk: , Road: )
(0) Initialize , , Sidewalk map and Road map
(1) for = 1 to do
(2) if then
(3)
is a moving object
(4)
for = 1 to do

= , , /2
(5)
,
(6)
end for
(7)
if > 0 then
(8)
for = 1 to ( 1) do
(9)
Draw a line from ( , ), to ( , ),+1 into Sidewalk map
(10)
end for
(11) else
(12)
for = 1 to ( 1) do
(13)
Draw a line from ( , ), to ( , ),+1 into Road map
(14)
end for
(15) else
(16)
is a noise
(17) end for
(18) for all pixels of Sidewalk and Road map do
(19) if Sidewalk (, ) ! = 0 then {(, )}
(20) if Road (, ) ! = 0 then {(, )}
(21) end for
Algorithm 2: Path extraction.

and tracking of moving objects, in which means the th


appearance of a moving object during the observation. These
data include the th objects locations in terms of image
coordinate and volume (which consists of the objects
width and height ); the number times the th object
is tracked and the accumulated label value that is generated
by classification. These data are shown as follows:
( { , }) ,
= {(, ),1 , (, ),2 , . . . , (, ), } ,
= { , } ,

(4)

( = 1, 2, . . . , ) .
Here, is the number of total objects that are detected
as moving object; our system defines the bottom-left of

the image as zero in the image coordinates. Algorithm 1


shows the entire process of data collection and association
about a moving object.
Algorithm 1 (2a) shows that elements of location data
increase by (, ),1 as (, ),+1 , the objects volume is
replaced by { , }, the number of iterative tracking is
increased by one and labeled with the value , which shows
the class of the object, accumulated by the classification
results of the newly detected object .
We can use data about moving objects to identify
movement paths, in which sidewalk and road are a set of
image coordinates. Algorithm 1 (3) presents how and when
data are sufficient for the system to extract a path. Here, =
5, which means that path extraction proceeds each time an
object is observed over multiple (= 5) observations. Algorithm 2 shows the process of path extraction in pseudocode.

International Journal of Distributed Sensor Networks

In Algorithm 2, lines (2), (3), (15), and (16) show how


observed objects are defined as useful or not, based on the
assumption that the object is observed over a certain number
of iterative frames (= 30) with a specified number of iterative tracking . Line (5) shows that the -coordinates of the
bottom of the object are the paths -coordinates, and the
paths -coordinates are the same as the objects -coordinates. Lines (7) and (11) show whether the observed object
over a certain number of iterative frames is a human or a vehicle based on its accumulated label value : when is positive,
the object is a human; when is negative, the object is a
vehicle. Lines (18)(21) show how the system obtains path
data and based on a map and a map (cf.
(, ) refers to the intensity of the map at image
coordinates (, ): it is not zero when one point of lines
appears at these coordinates).
Additionally, we use maps, which are one-channel image
spaces, to save objects movements and conduct random sampling for patches based on path data generated from maps. If
we used raw position data instead of path data, patch sampling would be very dependent on the objects location, rather
than random, because objects appear at similar locations in
an image when it is located over the cameras focal length. We
can solve this problem by drawing lines onto maps and using
these to generate path data. The elements of both path data
should be more than 0.4, where is the length of the image
diagonal. If it is not, the system returns to the observation
step.
3.4. Terrain Data Extraction. Terrain data about sidewalks
and roads are randomly extracted from each path and .
Then to extract nonpathway regions (cf., in this study,
these regions were defined as backgrounds) not only means
that , and standard deviations , of feature values (see (8))
of each sidewalk and road are calculated but also that candidates of backgrounds are extracted using global random sampling. Here, is an index of terrains which expresses either
the sidewalk (i.e., ) or the road (i.e., ) and is an index
which means th feature; range of is 1 to and stands
for dimensions of the feature vector (i.e., 11). The candidates
of backgrounds are clustered using Mean-Shift algorithm [21]
and mod, denotes th features mode of th cluster.
And th cluster of backgrounds candidates is selected
as backgrounds data; that is, BG = 1, when a gap Gap,
between the mode of the cluster and the mean of the extracted
sidewalks and roads data is larger than a distance Dist
which can be calculated using standard deviations , (see
Figure 4):

Gap, = (mod, , ) ,

(5)

=1

Dist = ( , ) ,

(6)

1 if Gap, > Dist , Gap, > Dist


BG = {
0 else.

(7)

=1

Cluster number 1
GapS,1

GapR,1
DistR
DistR

DistS
DistS

Figure 4: Example of 2D features clustering, and comparisons of


clusters with sidewalks and roads data. Deep blue: a distribution of
the sidewalk data, deep red: a distribution of the road data, small
dots: candidates of the background; same color represents same
cluster.

In this study, we assumed that distributions of extracted


sidewalks and roads data were the Gaussian distributions.
Thus, in (6) is a multiple of the standard deviation that
specifies the width of the interval (i.e., = 1: 68%, 2: 95%, 3:
99.7% confidence intervals) and is set at 2. is a weight which
influences a distance and is calculated by 95% confidence
intervals, and also it is set at 2.
Figure 4 shows an example of 2D features clustering and
notations.
The size of terrain data, that is, patch size, is set at 12 12;
the numbers of extraction are set at 200 for sidewalk and road
classes and 400 for the background class, based on experimental results about classification performance using various
numbers of training data.
Figure 5 shows experiments about backgrounds candidates clustering in which clusters were selected as background
or not.
3.5. Terrain Classification. The terrain classifier learns terrain
data generated using observed paths as a feature vector (see
(8)) and determines where a robot can move within the surrounding environment as a self-supervised learning method.
For a case with different numbers of training data, we can
weigh each terrain class according to the number of data in
its own class compared with the total number of data in all
classes. The use of features and classifiers to classify image
data into class labels has been popular in recent years [22].
(1) Classifier. We defined terrain using three classes (sidewalk,
road, and background) and used a multiclass SVM for terrain
classification.
SVM systems were initially designed for binary classification, but two common methods allow SVM systems to classify more than two classes: one method involves combining
numerous binary SVM systems. The other involves approaching a problem as an optimization problem and considers all
data simultaneously.
We selected the former approach, specifically the oneagainst-one method. It constructs 2 = ( 1)/2 as a classifier to define -classes and then uses MaxWins, a voting
system, to predict a final class with the most votes [23].
(2) Features. We used two visual features, color and texture,
for the terrain classifier. We used RGB and Lab color spaces
for color features; these have been proven as good features

International Journal of Distributed Sensor Networks

(a)

(b)

Figure 5: (a) The result of clustering (small dots: candidates of the background; same color represents same cluster), (b) the result of terrain
data extraction (red dots: background, blue dots: sidewalk, green dots: road, the others: excluded clusters of candidates from background
class).

(a)

(b)

Figure 6: (a) A sidewalk with complex texture, such as fallen leaves. (b) The road plan estimated using a V-disparity map.

for scene classification [24] and we used a Global Edge Histogram for texture features. Equation (8) presents the visual
feature vector used for the terrain classifier:
= [ , , , , , , V , , 45 , 135 , non ] ,

(8)

where is a feature vector of the image patch (e.g., ). , ,


, , , and are mean values of each channel of RGB and
Lab color. V expresses the response for a vertical orientation

of the patch, is for horizontal, 45 is for 45 , 135 is for 135 ,


and non is the response for nonedges. These feature values
are normalized 0 to 1.

4. 3D Semantic Map Building


This paper constructs a 3D semantic map that shows regions
where people or vehicles can move. To build the 3D map, it
is necessary to determine the position of the sensor at every
time point. It is possible to estimate the location with the
sensor or to do so more precisely using multiple sensors.
Normal position estimation can use the motion model of the
robot and it is also possible to make use of GPS or an image
process. We build the 3D map using the position of the robot
estimated by matching the point cloud and features from the
image and odometry data.
4.1. Ground Plane Estimation with Vertical Disparity Map.
Using a learning environment based on the trajectories of

passing vehicles and humans, it is still possible to classify


some terrains incorrectly because their texture and color
differ completely. For example, as shown in Figure 6(a), the
sidewalk might be recognized as an obstacle because there
are so many leaves with different colors and textures on it.
However, we can assume that a sidewalk or road is in the same
plane after obtaining the ground plane. Then, we solve this
drawback by depending on the class that forms the principal
of the plane.
To estimate the plane, V-disparity is widely used to detect
obstacles and the plane [25]. The horizontal axis of the V-disparity map indicates the depth, and the vertical axis is made
by accumulating the pixels with the same depth along the
horizontal axis of the depth data. A plane in the 3D world
actually appears in the form of a line in the V-disparity
map. Therefore, the extraction of a strong line from the Vdisparity map represents the ground plane. Extraction of the
ground plane is shown in Figure 6(b). The ground plane is
determined using the voting method Max Wins to obtain
the most votes from parts of the terrain class for planes. A
value exceeding the plane in the column direction in a Vdisparity map is an obstacle [26].
4.2. Map Building Using RGBD Iterative Closest Point. The ICP
algorithm searches for the rigid transformation that minimizes the distance between the source point cloud, , and the
target point cloud, . The process repeats the optimization

International Journal of Distributed Sensor Networks

(0) = Extract RGB point features( )


(1) = Extract RGB point features( )
(2) ( , ) = Perform RANSAC Alignment( , )
(3) repeat
(4) = Compute Closet Points( , , )
1
1

2
= argmin ( ( ) ) + (1 ) ( ( ( ) )


(6) until (Error Change ( ) ) or (maxIter reached)
(7) return
(5)

Algorithm 3: RGBD-ICP algorithm.

Figure 7: An example of a 3D map using the RGB-D ICP algorithm.

1000, 1000

600, 1000

600, 600

300, 600

400, 400

200, 400

200, 200

100, 200

50, 100

25, 50

94
92
90
88
86
84
82
80
78
10, 20

Mean true positive (%)

and alignment of the cloud data until convergence. The ICP


algorithm is useful when some extent of the adjustment point
cloud of two points is aligned. Conversely, when the data
association between the point clouds of two points is not
accurate, a local minimum occurs, which is inappropriate
convergence [11]. In contrast, visual alignment makes use of
feature point matching. The main advantage of using visual
features is that it does not require an initialization process
and can be corrected using the RANSAC algorithm. However,
due to the inaccuracy of the scale, two-dimensional (2D)
matching does not guarantee an optimal solution.
In this paper, we apply the RGB-D ICP algorithm, which
has the advantage of matching RGB-D data in two ways. The
algorithm is described in Algorithm 3. It takes input source
and target from the RGB-D sensor to obtain a rigid transformation of the camera. Lines (1) and (2) in Algorithm 3
describe the extraction of visual features and association
between and . We use the Harris corner point as the
visual feature. To calculate the optimal-rigid transformation
between two feature point sets, the RANSAC algorithm is
applied in line (3). Perform RANSAC Alignment searches for
the transformation of the best alignment by matching sets
and selecting the features of three points repeatedly. Then
the number of inliers is counted for the remaining feature
points. The transformation is estimated with most pairs of the
inlier. The association with matched feature points is also
determined through the transformation.
Lines (3) to (6) are the main loop of the ICP algorithm.
The association between point clouds is calculated at line
(4). This process converts the source point cloud using
transformation . Then the first initialization is done with a
visual RANSAC transformation. Consequently, it is possible
to match the point cloud without knowing the relative orientation of clouds. Line (5) minimizes the alignment error of
the point cloud association and the visual feature matching
between clouds. The first term of the error function represents
the average distance of the association of visual features and
the second indicates the error distance of the association
between clouds. Finally, we give the two errors a weight of
. The process ends when any errors have been reduced
sufficiently or the maximum number of iterations, which was
predetermined, is reached. It determines the ICP transformation using the odometry data of the robot, when RANSAC
does not find sufficient inliers.
Figure 7 shows a 3D map created using the RGB-D ICP
algorithm in an outdoor environment. This map will become

Number of training data (each sidewalk and road, background)

Figure 8: Classification performance with various numbers of


training data in whole dataset.

a 3D semantic map when the results of terrain classification


are added.

5. Experiments
5.1. Experimental Environment. We conducted experiments
at four locations at Seongdong-gu, Seoul, Republic of Korea,
and captured test datasets using a camera set at 640 360.
The experiments involved a robot observing moving
objects and learning about the surrounding terrain based on
the paths of moving objects; the robot was motionless with
the assumption that it was in an unknown location.
To extract a reasonable number of patches, test was conducted to classify terrain with various numbers of training
data. Figure 8 presents the results.

International Journal of Distributed Sensor Networks

(a)

(b)

(c)

(d)

(e)

Figure 9: Experimental results. Columns present the results and processes of experiments in the whole dataset. (a) Experimental environments, (b) extraction of training patches based on paths after observing moving objects, (c) ground-true images for terrain classification
(red: background, green: road, blue: sidewalk), (d) supervised terrain classifications based on ground-true image, (e) classification results
of the proposed method.

We evaluated the proposed method by comparing its


results with those of supervised learning, which ensured
correct labels using ground-true images. Both methods randomly extracted training data about each terrain (sidewalk:
200, road: 200, background: 400) based on Figure 9. To train
the object classifier, which involves offline learning, about
1968 units of training data were obtained from NICTA [27]
and UIUC [28]. Each class of object (i.e., humans and vehicles) had the same amount of data; the RBF kernel parameter
, which was used for training, was 23.5 and it was generated
using a tool provided by LIBSVM [17]. The performance of
the object classifier was 94.5408% based on 10-fold crossvalidation, which was also provided by the LIBSVM tool.
5.2. Results of Self-Supervised Terrain Classification. In this
study, self-supervised classification was compared to supervised classification, and classification error rates for both
methods were compared with ground-true images. Table 1
lists the terrain classification error rates as mean and standard
deviation in four datasets. Tests were conducted 10 times
for extraction patches, training, and classification for each
dataset.
Table 1 shows that the results of the proposed method
were approximately 39% different to those of supervised

(a)

(b)

Figure 10: (a) 3D map of the university, (b) RGB images of the same
scene.

International Journal of Distributed Sensor Networks

(a)

(b)

Figure 11: (a) The 3D semantic map. (b) Magnification of part of the 3D semantic map (red: obstacles; blue: road; green: sidewalk).

Table 1: Comparison of self-supervised classification and the supervised learning method.


Dataset
1
2
3
4

Mean error rate (STD DEV)


Supervised classification Self-supervised classification
4.58% (0.32%)
8.79% (1.14%)
5.23% (0.29%)
14.39% (0.56%)
3.30% (0.73%)
6.08% (0.36%)
8.00% (0.24%)
17.11% (1.41%)

classification. The biggest difference between supervised and


proposed classification result was 11% in dataset 2. As shown
in Figure 9, the most wrong classification appeared in boundaries between roads and sidewalks, because any human or
vehicles did not pass through these regions and also those had
different color and textures compared to sidewalk and road.
This fact could be found in datasets 3, 4, and 1 which turned
out the second, third, and fourth worse classification results.
To sum up, the performance of the proposed method might
be almost the same as the supervised method which was
conducted by human interventions who expects the problem
that critically happened at dataset 2, such as the boundary
regions.
Figure 9 presents the experimental environments, extraction of patches for terrain classes based on the paths of
moving objects, ground-true images of terrain, supervised
terrain classification, and the results of the proposed method
per dataset.
5.3. Results of 3D Map Building with Terrain Classification.
After self-supervised learning by observing objects moving
in the unknown environment, the robot moves around to
build a 3D map with the RGB-D sensor. The result is shown
in Figure 10. Finally, Figure 11 shows a 3D semantic map that
includes the result of the terrain classification. Compared to
Figure 10, the map was created by classifying terrain regions
as road, sidewalk, and obstacles that the robot cannot move
through, such as trees and buildings. However, the proposed
terrain classification has a disadvantage in that the region
between the sidewalk and road is classified as an obstacle that
the robot cannot pass through.

6. Conclusions
We proposed a self-supervised terrain classification framework in which a robot observes moving objects and learns
about its environment based on the paths of the moving
objects captured using a monocular camera. The results
were similar (ca. 39% worse) to those of supervised terrain
classification methods, which is sufficient for an autonomous
robot to recognize its environment. In addition, we built a 3D
dense map with a RGB-D sensor using the ICP algorithm.
Finally, the 3D semantic map was created by adding the
results of the terrain classification to the 3D map.
In future research, we will focus on better ways to extract
background data, using a vanishing point or depth data and
a total framework for navigation. We will also conduct tests
in various environments, such as indoors and unstructured
outdoor environments.

Conflict of Interests
The authors declare that there is no conflict of interests
regarding the publication of this paper.

Acknowledgment
This research was supported by the 2014 Scientific Promotion
Program funded by Jeju National University.

References
[1] C. Yi, S. Jeong, and J. Cho, Map representation for robots,
Smart Computing Review, vol. 2, no. 1, pp. 1827, 2012.
[2] X. Li and B. J. Choi, Design of obstacle avoidance system for
mobile robot using fuzzy logic systems, International Journal
of Smart Home, vol. 7, no. 3, pp. 321328, 2013.
[3] A. Talukder, R. Manduchi, R. Castano et al., Autonomous
terrain characterisation and modelling for dynamic control of
unmanned vehicles, in Proceedings of the IEEE International
Conference on Intelligent Robots and Systems, pp. 708713,
October 2002.
[4] A. Huertas, L. Matthies, and A. Rankin, Stereo-based tree traversability analysis for autonomous off-road navigation, in Proceedings of the 7th IEEE Workshop on Applications of Computer
Vision (WACV 05), pp. 210217, January 2005.

10
[5] C. A. Brooks and K. Iagnemma, Vibration-based terrain classification for planetary exploration rovers, IEEE Transactions on
Robotics, vol. 21, no. 6, pp. 11851191, 2005.
[6] M. J. Procopio, J. Mulligan, and G. Grudic, Learning terrain
segmentation with classifier ensembles for autonomous robot
navigation in unstructured environments, Journal of Field
Robotics, vol. 26, no. 2, pp. 145175, 2009.
[7] D. Kim, J. Sun, S. M. Oh, J. M. Rehg, and A. F. Bobick, Traversability classification using unsupervised on-line visual learning
for outdoor robot navigation, in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 06), pp.
518525, May 2006.
[8] A. B. Christopher and K. Iagnemma, Self-supervised terrain
classification for planetary rovers, in Proceedings of the NASA
Science Technology Conference, 2007.
[9] S. May, D. Droeschel, D. Holz et al., Three-dimensional mapping with time-of-flight cameras, Journal of Field Robotics, vol.
26, no. 11-12, pp. 934965, 2009.
[10] K. Konolige and M. Agrawal, FrameSLAM: from bundle
adjustment to real-time visual mapping, IEEE Transactions on
Robotics, vol. 24, no. 5, pp. 10661077, 2008.
[11] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, RGB-D
mapping: using depth cameras for dense 3D modeling of indoor
environments, in Proceedings of the International Symposium
on Experimental Robotics (ISER 10), 2010.
[12] Y. M. Kim, C. Theobalt, J. Diebel, J. Kosecka, B. Miscusik, and
S. Thrun, Multi-view image and ToF sensor fusion for dense
3D reconstruction, in Proceedings of the IEEE 12th International
Conference on Computer Vision Workshops (ICCV Workshops
09), pp. 15421546, October 2009.
[13] V. Vapnik, Statistical Learning Theory, 1998.
[14] M. Piccardi, Background subtraction techniques: a review, in
Proceedings of the IEEE International Conference on Systems,
Man and Cybernetics (SMC 04), vol. 4, pp. 30993104, October
2004.
[15] H. Sidenbladh, Detecting human motion with support vector
machines, in Proceedings of the 17th International Conference on
Pattern Recognition (ICPR 04), vol. 2, pp. 188191, August 2004.
[16] D. Mezghani, S. Boujelbene, and N. Ellouze, Evaluation of
SVM kernels and conventional machine learning algorithms for
speaker identification, International Journal of Hybrid Information Technology, vol. 3, no. 3, pp. 2334, 2010.
[17] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2013, http://www.csie.ntu.edu.tw/cjlin/libsvm.
[18] J.-C. Liu, C.-H. Lin, J.-L. Yu, W.-S. Lai, and C.-H. Ho, Anomaly
detection using LibSVM training tools, International Journal of
Security and Its Applications, vol. 2, no. 4, pp. 8998, 2008.
[19] C. S. Won, D. K. Park, and S.-J. Park, Efficient use of MPEG-7
edge histogram descriptor, ETRI Journal, vol. 24, no. 1, pp. 23
30, 2002.
[20] I. Sarker and S. Iqbal, Content-based image retrieval using
haar wavelet transform and color moment, Smart Computing
Review, vol. 3, no. 3, pp. 155165, 2013.
[21] D. Comaniciu and P. Meer, Mean shift: a robust approach
toward feature space analysis, IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603619,
2002.
[22] C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, vol. 2,
no. 2, pp. 121167, 1998.

International Journal of Distributed Sensor Networks


[23] C.-W. Hsu and C.-J. Lin, A comparison of methods for multiclass support vector machines, IEEE Transactions on Neural
Networks, vol. 30, no. 2, pp. 415425, 2002.
[24] S. Achar, B. Sankaran, S. Nuske, S. Scherer, and S. Singh, Selfsupervised segmentation of river scenes, in Proceedings of the
IEEE International Conference on Robotics and Automation,
2011.
[25] R. Labayrade, D. Aubert, and J. -P. Tarel, Real time obstacle
detection in stereovision on non-flat road geometry through vdisparity representation, in Proceedings of the IEEE Intelligent
Vehicle Symposium, vol. 2, pp. 646651, June 2002.
[26] Y. Cong, J.-J. Peng, J. Sun, L.-L. Zhu, and Y.-D. Tang, V-disparity based UGV obstacle detection in rough outdoor terrain,
Acta Automatica Sinica, vol. 36, no. 5, pp. 667673, 2010.
[27] G. Overett, L. Petersson, N. Brewer, L. Andersson, and N. Pettersson, A new pedestrian dataset for supervised learning, in
Proceedings of the IEEE Intelligent Vehicles Symposium (IV 08),
pp. 373378, June 2008.
[28] S. Agarwal, A. Awan, and D. Roth, Learning to detect objects
in images via a sparse, part-based representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no.
11, pp. 14751490, 2004.

International Journal of

Rotating
Machinery

The Scientific
World Journal
Hindawi Publishing Corporation
http://www.hindawi.com

Volume 2014

Engineering
Journal of

Hindawi Publishing Corporation


http://www.hindawi.com

Volume 2014

Hindawi Publishing Corporation


http://www.hindawi.com

Volume 2014

Advances in

Mechanical
Engineering

Journal of

Sensors
Hindawi Publishing Corporation
http://www.hindawi.com

Volume 2014

International Journal of

Distributed
Sensor Networks
Hindawi Publishing Corporation
http://www.hindawi.com

Hindawi Publishing Corporation


http://www.hindawi.com

Volume 2014

Advances in

Civil Engineering
Hindawi Publishing Corporation
http://www.hindawi.com

Volume 2014

Volume 2014

Submit your manuscripts at


http://www.hindawi.com

Advances in
OptoElectronics

Journal of

Robotics
Hindawi Publishing Corporation
http://www.hindawi.com

Hindawi Publishing Corporation


http://www.hindawi.com

Volume 2014

Volume 2014

VLSI Design

Modelling &
Simulation
in Engineering

International Journal of

Navigation and
Observation

International Journal of

Chemical Engineering
Hindawi Publishing Corporation
http://www.hindawi.com

Volume 2014

Hindawi Publishing Corporation


http://www.hindawi.com

Hindawi Publishing Corporation


http://www.hindawi.com

Volume 2014

Hindawi Publishing Corporation


http://www.hindawi.com

Advances in

Acoustics and Vibration


Volume 2014

Hindawi Publishing Corporation


http://www.hindawi.com

Volume 2014

Volume 2014

Journal of

Control Science
and Engineering

Active and Passive


Electronic Components
Hindawi Publishing Corporation
http://www.hindawi.com

Volume 2014

International Journal of

Journal of

Antennas and
Propagation
Hindawi Publishing Corporation
http://www.hindawi.com

Shock and Vibration


Volume 2014

Hindawi Publishing Corporation


http://www.hindawi.com

Volume 2014

Hindawi Publishing Corporation


http://www.hindawi.com

Volume 2014

Electrical and Computer


Engineering
Hindawi Publishing Corporation
http://www.hindawi.com

Volume 2014

S-ar putea să vă placă și