Sunteți pe pagina 1din 6

Robot Vision To Recognize Both Object And Rotation For

Robot Pick-And-Place Operation


Hsien-I Lin, Yi-Yu Chen, and Yung-Yao Chen

AbstractThis paper presents a robot vision system to


recognize both different target objects and their poses that can
be incorporated in robot programming by demonstration (robot
PbD). In recent years, robot PbD has become a popular topic in
the field of robotics. Programming robots is a time-consuming
process and requires technical knowledge. However, robot task
programming will be simplified by using a vision system to
analyze human demonstration. Usually, design of a
user-friendly human-robot interaction interface is crucial to
implement the robot PbD process. We take pick-and-place
operation as an example of robot PbD. Several objects are
placed on a table without overlapping, and people pick one of
them to other place, one after another. Before learning from
human demonstration, if the robot can obtain initial
information, e.g., to recognize the target object and to estimate
its pose, it helps robot to refine its following decision, such as
selecting the gripper type or the grip angle. Thus the authors
develop the robot vision ads the first step to realize such
working robot. The developed vision system has advantage of
interactive trainability with the help of the proposed graphical
user interface (GUI), in which the interpretable features can be
easily selected by ordinary users. We also propose a simple
scale-invariant pose estimation method. Experimental results
support feasibility and effectiveness of the system.

training samples for the neural network and discussed their


influences for the purpose of robot PbD.
Human-robot interaction interface is also essential in robot
PbD. It involves user-interfaces design, human factors,
performance evaluation, and so on. The goal is to design a
user-friendly interface that can efficiently lower the level of
difficulty of robot PbD. J. Aleotti et al. proposed a
demonstration interface of a PbD system for robotic assembly
tasks [3]. In their interface, in addition to graphical
information, other virtual fixtures such as auditory and tactile
sensory feedback are taken into account. They tried to find
whether there is dominating artificial fixture that can be
simply introduced to improve the effectiveness of user
interaction.

I. INTRODUCTION
Over the last few decades, robotics research has expanded
its topics from industrial robotics that release the human
operator from risky tasks, to the fields of service robotics that
focus more on assisting the human and on the interaction
between human and robot [1]. However, being used by
ordinary people without specialized knowledge, it is
imperative for the robots to be easily operated using a flexible
programming system. Robot programming by demonstration
(robot PbD) thus has become a central topic in the field of
robotics, which means that robot task programming is
simplified by using a vision system to learn from human
demonstration. Artificial neural networks are widely used in
this type of robot programming. M. Stoica et al. developed a
system to train an artificial neural network that can be
incorporated in industrial robot programming by
demonstration [2]. Their work explored how to deal with the
This work was supported in part by the National Chung-Shan Institute of
Science & Technology under Grants NCSIST-803-V101 and the National
Science Council under Grants NSC 102-2218-E-027-016-MY2 and NSC
102-2221-E-027-085.
Hsien-I Lin is with the Grad. Inst. of Automation Technology, Taipei
Tech. University, Taipei, Taiwan. (E-mail: sofin@mail.ntut.edu.tw).
Yi-Yu Chen is with the Grad. Inst. of Automation Technology, Taipei
Tech. University, Taipei, Taiwan. (E-mail: z7172930@yahoo.com.tw).
Yung-Yao Chen is with the Grad. Inst. of Automation Technology, Taipei
Tech. Univ., Taipei, Taiwan. (Correspondence. phone: +886-2-2771-2171;
fax: +886-8773-3217; e-mail: yungyaochen@mail.ntut.edu.tw).

978-1-4799-1851-5/15/$31.00 2015 IEEE

(a)

(b)

Figure 1. Goal of the proposed robot programming by demonstration example.


(a) The situation to assume: after learning from human, the robot can finish
pick-and-place operation with suitable grip angle and fine manipulation.
Objects are placed far enough from each other, and hence the image occlusion
problem is not considered here; (b) hardware configuration of our system:
robotic arm equipped with camera on its top (indicated by the red arrow mark).

Robot pick-and-place task, as shown in Fig. 1, is common


for indoor service robots or automatic assembly lines. In order
to achieve autonomous robot pick-and-place operation, many
studies have been performed in the areas of anthropomorphic
graspers design, sensor-motor coordination, and so on [4, 5].
The goal is to perform fine manipulation in stable and robust
ways and to pursue computational simplicity. Robot vision
systems play an important role in the robot pick-and-place task.
They can be used to detect objects, and to measure their
locations or poses. W. Cai et al. proposed a vision-based
kinematic calibration method which calibrates the downlook
vision system of the pick-and-place robot in flip chip bonder
[6]. Their method estimated the exact pose parameters of chip
and pad, and also improved the placement accuracy. T.P.
Cabre et al. proposed an automation cell that combines
computer vision system and robotic control [7]. In their

automation cell, the vision system was used to detect a small


object placed randomly on a target surface. A. Pretto et al.
proposed a flexible vision system for 3D localization of planar
parts for industrial bin-picking robots [8]. Based on a
mono-camera vision system, their approach estimated the
pose of planar objects from a single 2D image so that the
robotic gripper can grasp the object in a suitable way.
However, the target objects used in their system are restricted
to be planar shapes with thickness up to several centimeters. C.
Choi et al. proposed a voting-based pose estimation method
using 3D sensor for bin-picking robots [9]. They did not
restrict the shape of the target object, but the 3D CAD model
of the object was required to be generated in advance. Pose
estimation thus would involve the correspondences between
3D features in the real object and the predefined CAD model
data base. Their method solved the correspondence problem in
the presence of sensor noise, occlusions, and clutter. Their
method performed detection and pose estimation of the target
object using the scanned 3D point cloud from 3D sensor.
In this paper, we tried to design a robot PbD procedure for
pick-and-place task that do not restrict the target object shapes.
Using a simple camera, we proposed a robot vision system
that performed object recognition and 2D pose (rotational
angle) estimation for robot pick-and-place operation.
However, on the condition that the target objects are
unrestricted, it will be difficult to achieve fully autonomous
vision-based robot PbD since the possible objects will have
countless shapes. Our goal is to integrate the robot vision
system into a user-friendly interface that simplifies the robot
PbD process, and helps users to select interpretable features
for predictive model construction. With the help of the
proposed GUI, the potential users do not need to have
specialized knowledge. The results of object and pose
recognition will be used as the prior knowledge for the robot
before human demonstration, as shown in Fig. 2.
The remainder of this paper is organized as follows. In Sec.
2, we briefly describe the notations used in this paper, the
proposed graphical user interface, and the image processing
techniques used behind the interface. In Sec. 3, experimental
results are provided. Finally, we draw our conclusions in Sec.
4.

estimate its pose. When learning from human demonstration,


such information can be viewed as the prior knowledge of the
following robot decision-making problem. Therefore, in this
paper we want to develop the robot vision as the first step to
realize such working robot.
A. Preliminary
We use superscript to indicate different color channels,
e.g., red (R), green (G), and blue (B), or to indicate a grayscale
image (gray). We use [m] = [m, n] to represent discrete pixel
coordinates. We use subscript to distinguish whether the
image belongs to training data or test data, and to emphasize
whether the image is background or foreground. For example,
the function
denotes a color test image, and it contains
,
, and
. The corresponding
three channels:
is calculated as a weighted sum of the
grayscale image
three linear-intensity values:
R
G
B
I tgray
est [ m] = 0.213I test [ m] + 0.715I test [ m] + 0.072 I test [ m] .

(1)

Pick-and-place operation is a common task for indoor


service robots or automatic assembly lines. Without loss of
generality, limited target objects are predefined and trained as
training data. In this paper, five objects (a stapler, a screw, a
screwdriver, a correction tape, and a whiteboard wiper, as
shown in Fig. 3) are randomly selected since they are
commonly used in the work environments mentioned above.
For the training data, each target object is placed on a dial
scale (a disc-shaped protractor), as shown in Figs. 3(a) (e).
Totally 360 training samples are acquired at every five-degree
of rotation from 5 degree to 360 degree, to generate the
training data set S
= {I
} , where i = 1, .., 360
represents the i-th training sample. For the test data, the target
objects will be placed within a 20cm 20cm square area, with
an arbitrary rotational angle and an arbitrary displacement d,
as shown in Fig. 3(f).

(a)

(b)

(d)

(e)

(c)

(a)

(b)
Figure 2. Overall frameworks. (a) Framework of the robot PbD. In this paper,
we focus on the blocks within green rectangle; (b) framework of the proposed
robot vision system, whose goal is to provide prior knowledge for the robot
before human demonstration.

II. ALGORITHM
As to robot PbD in the topic of pick-and-place task, it is
useful if before human demonstration, the robot can obtain
initial information, e.g., to recognize the target object and to

(f)

Figure 3. Illustration of training data and the test environment. In this paper,
five target objects are selected since they are commonly used in our target
work environments. For each object, the training samples are acquired at every
five-degree of rotation from 5 degree to 360 degree, i.e., = j 5, j = 1,...,72.
(a) Object A with rotational angle = 15; (b) object B with = 15; (c) object
C with; (d) object D with = 240; (e) object E with = 320 ; (f) the test
environment. When testing, the object will be placed within a 20cm 20cm
square area, with an arbitrary rotational angle and an arbitrary displacement
d. Since the position and the angle of the camera are fixed, it is imperative to
deal the image scaling issue due to parameters (,d).

B. Foreground extraction
In addition to the training data set, the background of
is also recorded, which the disc
training data
_
image without any object on it is. In this paper, the target
object is considered as the foreground, and it can be extracted
pixel-by-pixel if the color difference of each channel between
the input image and the background image is too large. The
foreground extraction of training data can be expressed as

1,if I training
[m] I bg
_ training <
B fg _ training [m] =
,
0, othrtwise

(2)

where = R, G, or B, and = 20 is the empirical value. The


function
denotes the binary version of the
_
foreground of the training image, where the value 1 means a
foreground object pixel and the value 0 means a background
pixel. Similarly, the binary version of the foreground of the
, can also be obtained by (2).
test image, _
C. Shadow detection and noise removing
In this subsection, for all training and test images, their
binary foreground images ( _
and _ ) will be
further modified by shadow deduction and noise removing.
First, the difficulty arises from shadows when the target
objects are placed on table or training disc. In other words,
after background subtraction, there would be error which
comes from failing to distinguish between object itself and its
shadow from a light source. Hence, the shadow detection
method in [10] is adopted with modification. The
chrominance information is used here. Since shadow is
caused by less quantity of illuminance, most of the
chrominance pixel values are still within a range, i.e., the
degree of decrease of the pixel value should be in proportion
in the RGB channels. Therefore, every binary foreground
image from (2) will be further modified by

if ( 1[m] 2 [m])2 < 3


0,
B 'fg [m] = and gray [m] [ , ]
,
B [m],
otherwise
fg

(3)

} { , , }, , "#$ % & denotes the pixel


where {
intensity ratio of an image and its corresponding back- round
in the channel ,i.e.,

I x [m] ,
=
I bg [ x ] + 0.1

is, the lower ( and higher ) should be chosen. Before


retrieving the RGB color values from the binary foreground
image, the exterior noise (white isolated-pixels) and the
interior noise (black hole-clusters) of the target object as
shown in Fig. 5(a), must be removed. In this paper, both types
of noise are eliminated by the following steps:

(a)

(b)
Figure 4. Shadow detection results in the training data case. (a) Target objects
A E (from left to right), all with rotational angle= 10 ; (b) correspondding results of shadow detection, where shadow pixels are indicated as pink.

(1) To eliminate the white isolated-pixels, the connected


component analysis based on 8-connectivity is used. For
+
from (3), the total number of pixels of each connected
region is compared with a pre-defined threshold, and the
region whose connectivity is too small is discarded.
(2) To fill the black hole-clusters, we deal with the
background part first instead. That is, the background part of
a binary image is considered as a whole, i.e., a single
connected region. Then the pixels outside of this region are
all considered as the object pixels. In order to achieve this,
first the complement of + is given by

1,if B'fg [m] = 0


B 'fg [m] =
,
0, otherwise

(5)

+
where now for ,,,,,
, the 1 means a background pixel. The
result of connected component analysis in step one is used
+
again. In ,,,,,
, only the connected region that has the most
pixels is kept, and thus the new binary image can be
expressed by

1 and within the largest

'
1,if B fg [m] =
B''fg [m] =
connected region .
0,
otherwise

(6)

(4)

where = R,G,B, or gray. In (4), the value 0.1 is purposely


added to avoid a zero denominator. The meaning of (3) is to
find out the shadow pixels, as shown in Fig. 4(b), and remove
them from the binary foreground found in (2). In addition, for
both training and test data, the interval [(, )] takes into
account how strong the light source is and the reflectance
issue as well. The stronger the illuminance or the reflectance

++
Taking the complement of ,,,,,
the final foreground ++ is
obtained, as shown in Fig. 5(b). The function
denotes the
color version of the final foreground, where the RGB color
values are retrieved in pixel locations as ++ [-] = 1, as
shown in Fig. 5(c).

(a)

(b)

(c)

Figure 5. Illustration of noise-removing: (a) cropped version of the binary


foreground image + from (3). Examples of the exterior noise (white isolated-pixels) and the interior noise (black hole-clusters) are indicated by blue
and red arrow marks, respectively; (b) binary result after noise- removing, i.e.,
++
; and (c) color version of the foreground image
where the RGB color
values are retrieved only in pixel locations ++ [/] = 1.

D. Feature performance evaluation GUI


Important for pattern recognition or machine learning,
feature selection is the process that selects a relevant subset
of original features [11]. Compared to original features,
the selected features help to construct a general model that is
more interpretable, shortens training times, and predicts new
cases more accurately. Due to the nature of robot PbD, it is
desired that a robot can learn from human demonstration by a
fully automatic vision-based system. However, for a
pick-and-place operation that does not restrict the target
objects, it will be hard since the objects may have arbitrary
shapes, sizes, colors, and so on. In addition, when the target
object is placed in the different rotational angles and
displacements, as is allowed in this paper, the recognition
problem becomes more difficult.
As mentioned in Sect. 1, a user-friendly human-robot
interaction interface is usually required, especially for the
users who are not experts. Motivated by the need to lower
the level of difficulty, our strategy is to take advantage of
interactive trainability by the proposed feature performance
evaluation user interface, as shown in Fig. 6(a). In fact, the
work in the subsections 2.2 and 2.3 is included in the
=
proposed GUI. Given a training data set 0
{
} the foreground image of each training sample
) can be obtained from previous work. Noted
(i.e.,
that when recognizing, the foreground image of a test data
(i.e., _ ) will be obtained through the same process, as
well. It provides the fundamental basis for the following
recognition problem since the irrelevant information, such as
the background part of an image, has been discarded. The
proposed feature performance evaluation GUI further helps
us to extract the interpretable features from the foreground
images, and to decide whether some features will be used
to construct a predictive model or not. In other words,
before building the model, the user can visually evaluate the
performance of the selected features via looking into the
feature sub-space, as shown in Figs. 6(b) and 6(c). Hence
the proposed GUI simplifies the feature selection process, and
the users do not need to have any technical knowledge.
Currently, two types of features are considered in the
proposed GUI. The first type of feature is the size
information, which is defined by the total number of the
object pixels in ++ The second type of feature is the color
information, in each color channel, the standard deviation
that shows how much dispersion from the average exists
is used as a feature. However, every color space has its
advantages and disadvantages. The RGB model is generally

used for color display, but inappropriate for color analysis


due to its high correlation among R, G, and B components.
In addition, the distance in RGB color space does not
represent the perceptual difference in a uniform scale. In this
paper, we seek for doing color analysis in the color models
where the intensity and chromatic components can be easily
and independently controlled. As a result, the color image is
also transformed from RGB to YCbCr color space and to HSV
color space.
In this paper, we have created a software application in
Matlab version 7.8 where the built-in neural network toolbox
is used. The features that are used for neural network
construction are selected via the proposed GUI. The details of
our experiment and the experimental results will be shown in
Sect. 3.

(a)

(b)

(c)

Figure 6. The proposed feature performance evaluation GUI that can be used
before constructing a predictive model.(a) the user interface, where the user
can select the feature types and visually evaluate the performance of the
selected features by looking into the feature sub-space. The bottom-left
figure of GUI represents the distribution of all training samples. Five target
objects are tested, whose training samples are indicated by different colors.
For each object, there are 72 samples due to different rotational angles;
(b) the distribution of the 1D (size) feature sub-space; and (c) the distribution
of the 2D (1 1 color space) feature sub-space. As shown in (b) and (c), the
distributions among different objects are far from each other, which imply
these selected features are appropriate for model construction.

E. Pose estimation
When testing, the target object is allowed to be arbitrarily
placed within a square area. Since the position and the angle
of the camera are fixed, the problem of image scaling needs to
be considered before pose estimation. In this paper, we
proposed a simple scale-invariant pose estimation method
that is based on image resizing by bilinear interpolation and
takes the coverage ratio of the binary foreground into account.
For each binary foreground of the training image set, the
smallest bounding rectangle is calculated, as shown in Fig. 7.
Given an input test image, we use an exhaustive search
algorithm to search for the most suitable pose estimation.
That is, the test image is compared with all training images
one after another by (1) extracting the binary foreground

B ''fg as shown in Fig. 5(b), resizing its smallest bounding


rectangle to the same dimensions (height and width) as each
training image, and (2) searching for the largest coverage
ratio among all training images.

(a)

(b)

(c)

(d)

Figure 7. Illustration of the smallest bounding rectangle in a binary


foreground. Object A is placed at different rotational angles, where the red
rectangle represents the corresponding smallest bounding rectangle. (a) =
5;

(b) = 10; (c) = 15; (d) = 20.

Let D = {[/] | ++ [/] = 1} be the pixel location set of


the object region in a binary foreground image. After image
resizing by bilinear interpolation and rounding each pixel
value off to the nearest integer, the resized binary foreground
4 ++ and its pixel location set of object region D
5=
{[/]| 4 ++ [/] = 1}are obtained. Then, the Centroid (16 , 1 )
of the object region is calculate by

Cx =

[ m ,n ]D

[m,n]D

,Cy =

[ m ,n ]D

(7)

[ m ,n ]D

In our method, the centroid of an input test image is


first shifted to the same location as that of a training image.
Then the coverage ratio of the binary foreground is defined by
(8)
''
''
area(B fg _ test B fg _ training )

=
,
area(B''fg _ test )

For testing the proposed method in pose estimation, in


order to understand its efficiency separately, we assume that
the result of previous object recognition part is correct. In
other words, for each test image, only 72 candidate poses are
considered. In this paper, the tolerance range is set to be
plus/minus 5 degree. As a result, the overall pose recognition
accuracy rate is 92.5%, where the individual accuracy rates
are 96.88% (object A), 93.75% (object B), 93.75% (object C),
87.5% (object D), and 90.63% (object E). The accuracy rates
of the objects D and E are relatively low due to the smaller
object volumes. It means that the coverage ratio method is
influenced by the object volume more or less. In addition, the
accuracy rate of the object D is the lowest also because of its
semi-symmetric shape. That is, the shape of the object D is
almost bilaterally symmetric and hence increases the
difficulty. However, the overall accuracy rate of pose
estimation is still higher than 92%.
IV.

(8)

where the function area () represents the total number of


pixels where the pixel value equals 1.
III.

7< , 7= , 7> , object size} are used for training, where the first
six value represent the standard deviation value of
individual channels in the Y1 1 and HSV color channels,
respectively. For testing our method, in this paper we
have created a software application in Matlab version 7.8
where the built-in neural network toolbox is used. The
network type is feed-forward back-propagation, the method
of training used is the Levenberg-Marquardt method with
momentum option, and the performance function used is the
mean square error (MSE). The neural network proposed has
one hidden layers which has 15 neurons. For the hidden layer
we used the hyperbolic tangent sigmoid activation function
and for the other one we used the linear transfer activation
function. For object recognition, since these 5 target objects
have quite different shapes and colors, we would expect a
very high accuracy rate; and indeed, the average accuracy rate
is 99%. Only one image (object A) is recognized wrong.

EXPERIMENTAL RESULTS

The objective of this paper is to design a simple robot


vision system which can recognize both different target
objects and their rotational angles. As mentioned in Fig.
3(f), the target object is allowed to be placed within a 20cm
20cm square area on table, with an arbitrary rotational
angle and an arbitrary displacement d. In this paper, each
target object is tested 32 times, i.e., with different (, d)
combinations. Hence there are totally 160 test images. The
overall recognition accuracy is defined as
The total number of images correctly recognized
100%
The total number of images used for test

From the proposed GUI, finally 7 features {78 , 79: , 79; ,

(9)

CONCLUSION

In this paper, we proposed a robot vision system that can


recognize both different target objects and their poses for
robot pick-and-place operation. We not only designed the
feature performance evaluation GUI, but also proposed a
simple scale-invariant pose estimation method. Given a
training image set, the user interface helps us to do
background subtraction, to extract features from the
foreground image, and to decide whether some features will
be used to construct a predictive model or not. The proposed
GUI simplifies the feature selection process for the predictive
model construction, and the users do not need to have any
technical knowledge. From the results of object recognition
experiment, the accuracy rates show that for the 5 target
objects used in this paper, 7 selected features are good enough
to recognize them. However, for the future use, more other
interpretable features must be included in our GUI in order to
recognize more other different objects at the same time. From
the results of pose estimation experiment, we found that the
proposed pose estimation method is affected if the target
object has bilateral symmetry property, i.e., the pose of 5
degree might be mis-classified as 5+180 = 185 degree.
However, since our goal is to do robot pick-and-place
operation (to select the grip angle of the robotic arm), the

error case that comes from bilateral symmetry property may


not need to be concerned. Experimental results support
effectiveness of the proposed vision system, whose goal is to
provide prior knowledge for the robot before human
demonstration.
REFERENCES
[1]

E. Garcia, M. A. Jimenez, and P. G. D. Santos, , The evolution of


robotics research, IEEE Robotics & Automation Magazine, vol. 14, pp.
90103 , March 2007.
[2] M. Stoica, G. A. Calangiu, F. Sisak, and I. Sarkany, A method
proposed for training an artificial neural network used for industrial
robot programming by demonstration, Proc. 12th International Conf.
Optimization of Electrical and Electronic Equipment (OPTIM), pp.
831836 , May 2010.
[3] J. Aleotti, S. Caselli, and M. Reggiani, Evaluation of virtual fixtures
for a robot programming by demon-stration interface, IEEE Trans.
Systems, Man and Cybernetics, Part A: Systems and Humans, vol. 35,
pp. 536545, July 2005.
[4] Y. Zhang, B. K. Chen, X. Liu, and Y. Sun, Autonomous robotic
pick-and-place of microobjects, IEEE Trans. Robotics, vol. 26, pp.
200207, Feb 2010.
[5] J. Jin, S. Yuen, Y. Lee, C. Jun, Y. Kim, S. Lee, B. You, and N. Doh,
Minimal grasper: a practical robotic grasper with robust performance
for pick-and-place tasks, IEEE Trans. Industrial Electronics, vol. 60,
pp. 37963805, Sept 2013.
[6] W. Cai, T. Xiong, and Z. Yin, Vision-based kinematic calibration of a
4-dof pick-and-place robot, Proc. 12th International Conf.
Mechatronics and Automation (ICMA) , pp. 8791, Aug 2012.
[7] T. Cabre, M. Cairol, D. Calafell, M. Ribes, and J. Roca, Project-based
learning example: controlling an educational robotic arm with
computer vision, IEEE Journal of Latin-American Learning
Technologies(IEEE-RITA) , pp. 135142, Aug 2013.
[8] A. Pretto, S. Tonello, and E. Menegatti, Flexible 3d localization of
planar objects for industrial bin-picking with monocamera vision
system, Proc. International Conf. Automation Science and
Engineering (CASE) , pp. 168175, Aug 2013.
[9] C. Choi, Y. Taguchi, O. Tuzel, M. Liu, , and S. Ramalingam,
Voting-based pose estimation for robotic assembly using a 3d sensor,
Proc. IEEE International Conf. Robotics and Automation (ICRA) , pp.
17241731, May 2012.
[10] S. Lin, Y. Chen, and S. Liu, A vision-based parking lot management
system, Proc. IEEE International Conf. Systems, Man and
Cybernetics , pp. 28972902, Oct 2006.
[11] H. Liu, and L. Yu, Toward integrating feature selection algorithms for
classification and clustering, IEEE Trans. Knowledge and Data
Engineering, vol.17, pp. 491502, April 2005.

S-ar putea să vă placă și