Sunteți pe pagina 1din 21

sensors

Article
An Adaptive Track Segmentation Algorithm for
a Railway Intrusion Detection System
Yang Wang 1,2 , Liqiang Zhu 1,2, * , Zujun Yu 1,2 and Baoqing Guo 1,2
1 School of Mechanical, Electronic and Control Engineering, Beijing Jiaotong University, Beijing 100044, China;
12116331@bjtu.edu.cn (Y.W.); zjyu@bjtu.edu.cn (Z.Y.); bqguo@bjtu.edu.cn (B.G.)
2 Key Laboratory of Vehicle Advanced Manufacturing, Measuring and Control Technology (Beijing Jiaotong
University), Ministry of Education, Beijing 100044, China
* Correspondence: lqzhu@bjtu.edu.cn; Tel.: +86-10-51684151

Received: 6 May 2019; Accepted: 3 June 2019; Published: 6 June 2019 

Abstract: Video surveillance-based intrusion detection has been widely used in modern railway
systems. Objects inside the alarm region, or the track area, can be detected by image processing
algorithms. With the increasing number of surveillance cameras, manual labeling of alarm regions
for each camera has become time-consuming and is sometimes not feasible at all, especially for
pan-tilt-zoom (PTZ) cameras which may change their monitoring area at any time. To automatically
label the track area for all cameras, video surveillance system requires an accurate track segmentation
algorithm with small memory footprint and short inference delay. In this paper, we propose
an adaptive segmentation algorithm to delineate the boundary of the track area with very light
computation burden. The proposed algorithm includes three steps. Firstly, the image is segmented
into fragmented regions. To reduce the redundant calculation in the evaluation of the boundary
weight for generating the fragmented regions, an optimal set of Gaussian kernels with adaptive
directions for each specific scene is calculated using Hough transformation. Secondly, the fragmented
regions are combined into local areas by using a new clustering rule, based on the region’s boundary
weight and size. Finally, a classification network is used to recognize the track area among all local
areas. To achieve a fast and accurate classification, a simplified CNN network is designed by using
pre-trained convolution kernels and a loss function that can enhance the diversity of the feature
maps. Experimental results show that the proposed method finds an effective balance between the
segmentation precision, calculation time, and hardware cost of the system.

Keywords: railway intrusion detection; scene segmentation; scene recognition; adaptive feature extractor;
convolutional neural networks

1. Introduction
With a continuous increase in the public’s expectation for railway safety, railway intrusion
detection systems require more effective technology to detect objects intruding into the track area and
to provide real-time alarm information for the command center [1]. Railway intrusion behavior is
defined as an object intruding into the track area and endangering the safe operation of trains. Typical
intruding objects include rocks falling from a hill beside railway line or a tunnel entrance, pedestrians,
vehiclesand their cargo staying in the railroad crossing area or falling from the bridge over the railway.
Depending on the detecting principle, railway intrusion detection systems can be divided into
two categories: the contact type and the non-contact type. A representative of the contact type is
the protective metal net installed along the line to block an object from intruding into the clearance,
and the system will send the alarm information when the physical deformation of the net is measured
by a dual-power sensor [2] or fiber grating sensor [3,4]. The systems based on the non-contact

Sensors 2019, 19, 2594; doi:10.3390/s19112594 www.mdpi.com/journal/sensors


Sensors 2019, 19, x FOR PEER REVIEW 2 of 21
Sensors 2019, 19, 2594 2 of 21

a dual-power sensor [2] or fiber grating sensor [3,4]. The systems based on the non-contact
measurement
measurement technologyuse
technology useinfrared
infrared sensor
sensor[5]
[5]ororlaser
laserscanner [6,7]
scanner to get
[6,7] the the
to get size size
and location of
and location
the intruding object [8]. Video surveillance is also widely used as another kind of non-contact
of the intruding object [8]. Video surveillance is also widely used as another kind of non-contact
intrusion detection systems because of the large monitoring area, convenient installation,
intrusion detection systems because of the large monitoring area, convenient installation, maintenance,
maintenance, and good observable results [9]. As shown in Figure 1, we established an intrusion
and good observable results [9]. As shown in Figure 1, we established an intrusion detection system
detection system for the Shanghai–Hangzhou high-speed railway in China. The system contains data
for process
the Shanghai–Hangzhou high-speed railway in China. The system contains data process servers,
servers, communication networks, and 1550 cameras, including both of fixed and PTZ
communication
cameras. networks, and 1550 cameras, including both of fixed and PTZ cameras.

Figure1.1.Structure
Figure Structureof
ofthe
the railway
railway intrusion
intrusiondetection
detectionsystem.
system.

TheThe threat
threat level
level ofofananintrusion
intrusionbehavior
behavior will
will be
be evaluated
evaluatedby bythe
thecategory,
category,location, and
location, moving
and moving
trajectory
trajectory of of
thetheobject
objectwith
withrespect
respect to
to the
the track
trackarea.
area.The Theinformation
information of the
of intruding objectobject
the intruding can becan
extracted by
be extracted byimage
imageprocessing
processing algorithms,
algorithms,e.g.,e.g.,
density-based
density-basedspatial clustering
spatial of applications
clustering with
of applications
noise (DBSCAN) [10], fast background subtraction (FBS) [11], Kalman
with noise (DBSCAN) [10], fast background subtraction (FBS) [11], Kalman filtering [12], principal filtering [12], principal
components analysis (PCA) [13]. DBSCAN uses extremum points of scan sequence as core objects of
components analysis (PCA) [13]. DBSCAN uses extremum points of scan sequence as core objects
clustering, and the movement and distribution characters are used to judge whether the cluster is a
of clustering, and the movement and distribution characters are used to judge whether the cluster is
train or other foreground object. FBS projects the scene image into one dimension (x or y dimension)
a train or other foreground object. FBS projects the scene image into one dimension (x or y dimension)
to locate position of the foreground object by the change of the peak value. KF classifies the objects
to locate position of the foreground object by the change of the peak value. KF classifies the objects
acquired via image background subtraction by support vector machine (SVM), and then using the
acquired via image
Kalman-filter background
tracking algorithm subtraction
to analyzeby support
the behavior vector
and machine
moving trend (SVM), of and then using
the objects. PCAthe
Kalman-filter tracking algorithm to analyze the behavior and moving trend of
projects the statistic of the scene images and the successive images in a transformation space and the objects. PCA projects
thecalculates
statistic ofthe
theEuclidean
scene imagesdistance,
and which is greaterimages
the successive than ain threshold, is considered
a transformation spacelike
andbelonging
calculatestothe
motion objects.
Euclidean distance, Most of the
which above-mentioned
is greater algorithms
than a threshold, only focus on
is considered likethebelonging
foreground to object,
motionrather
objects.
Mostthan
of the track area in the background. Therefore, the position and boundary of track area are still
above-mentioned algorithms only focus on the foreground object, rather than the track area
delineated
in the manually
background. in advance,
Therefore, as shownand
the position in Figure
boundary 2. of track area are still delineated manually in
advance, as shown in Figure 2.
Sensors 2019, 19, 2594 3 of 21
Sensors 2019, 19, x FOR PEER REVIEW 3 of 21

(a) (b)
Figure 2.
Figure 2. Railway
Railway scene
scene and
and thethe local
local areas,
areas, labeled
labeled manually.
manually. The Theimage
imagequality
qualityisissusceptible
susceptibletoto
external influences,
external influences, such
such as
as the
the illumination,
illumination,weather,
weather,andandeven
eventhe
thedust
duston
onthe
thelens.
lens.(a)
(a)The
Thered
redarea
areais
is the
the track
track area
area toto
bebe surveilled.The
surveilled. Thetrack
trackarea
areaincludes
includes the
the rails,
rails, sleepers,
sleepers, subgrades,
subgrades,or orhigh-speed
high-speed
railway slabs.
railway slabs. (b)
(b) Labeling
Labelingthe thedifferent
differentarea
areaofofthe
therailway
railwayscene
scenewith
withdifferent
differentcolors
colorsbybymanual,
manual,
including track
including track area
area (red),
(red), sky
sky (blue),
(blue), catenary
catenary system
system (purple),
(purple), green
greenbelt
belt(green),
(green),andandancillary
ancillary
buildings (yellow).
buildings (yellow). The
The precision
precision depends
depends onon the
thepatience
patienceofofthe
themanual
manualoperator.
operator.

The precision
The precision of of the
the track
track area
area boundary
boundary directly
directlyaffects
affectsthe thereliability
reliabilityofofintrusion
intrusiondetection.
detection.
With an
With an increasing
increasing number
number of of surveillance
surveillance cameras
cameras along along thethe railway
railwayline, line,especially
especiallyasassome somePTZ PTZ
cameras will
cameras will change
change theirtheir focal
focal lengths
lengths and and angles
angles temporarily
temporarily for for different
different applications,
applications,manual manual
labeling has
labeling has become
become time-consuming
time-consuming and and laborious.
laborious. Thus,Thus,for forthe
theefficiency
efficiencyofofthe therailway
railwayintrusion
intrusion
detection system,
detection system, a a scene
scene segmentation
segmentation algorithm
algorithm is isneeded
neededto torecognize
recognizethe thetrack
trackarea
areaand anddelineate
delineate
the boundary
the boundary automatically.
automatically. The The algorithm
algorithm will will bebe applied
appliedto to initialize
initializesurveillance
surveillanceareas areasafter
afterthe
the
installation of
installation of all
all cameras,
cameras, and and to to relearn
relearn them
them when when thethe operator
operatoradjusts
adjustsPTZ PTZcameras.
cameras.Meanwhile,
Meanwhile,
the practical
the practical engineering
engineering application
application has has many
many requirements:
requirements: the therelevant
relevantimageimageparsing
parsingalgorithm
algorithm
should not only have good segmentation precision and classification accuracy, but also beable
should not only have good segmentation precision and classification accuracy, but also be abletoto
process temporarily
process temporarily changing
changing scenesscenes quickly.
quickly. In In addition,
addition,the thealgorithm
algorithmshould shouldhavehavesmallsmallnumber
number
of parameters
of parameters and and cancan be be easily
easily applied
applied into into the
the data
data processing
processingservers serverswith withdifferent
differenthardware
hardware
configurations and
configurations and even
even into
into the
the embedded
embedded surveillance
surveillance equipment
equipment in in the
the field.
field.
Currently, there
Currently, there are aretwo
twoways ways to to
parse
parsea scene. The traditional
a scene. The traditional way willwaysegment the scene
will segment theimage
scene
into superpixels,
image ultrametric
into superpixels, contour
ultrametric maps (UCM),
contour maps (UCM),or other or fragmented
other fragmented segment regions
segment [14,15],
regions and
[14,15],
then combine them into candidates of objects or local areas based on
and then combine them into candidates of objects or local areas based on Markov random fields (MRFs),Markov random fields (MRFs),
conditional random
conditional random fieldsfields (CRFs),
(CRFs), multiscale
multiscalecombinatorial
combinatorialgrouping grouping(MCG), (MCG),ororotherotherrules
rules[16–18].
[16–18].
These traditional methods will generate fragmented regions with
These traditional methods will generate fragmented regions with precise boundaries and require precise boundaries and require
time-consuming iterative
time-consuming iterative calculations
calculations to to form
form aa best
best candidate
candidate of ofan anobject
objector oraalocal
localarea.
area.InInaddition,
addition,
category information
category information of of objects
objects cannot
cannot be be produced.
produced.The Thesecond
secondway wayrelies
reliesonondeep
deepneural
neuralnetworks,
networks,
e.g., fully
e.g., fully convolutional
convolutional networks
networks (FCN) (FCN) [19,20],
[19,20], to to process
process the the feature
feature extraction,
extraction,combination,
combination,
segmentation, and recognition at the same time. A FCN can achieve the segmentationand
segmentation, and recognition at the same time. A FCN can achieve the segmentation andrecognition
recognition
in aa single
in single process.
process. OneOne drawback
drawback of of FCNs
FCNs is is that
that the
the boundary
boundaryline linegenerated
generatedisisusually
usuallyaasmoothsmooth
curve, which will miss the corner of the track area. In addition, FCN has big memory footprintand
curve, which will miss the corner of the track area. In addition, FCN has big memory footprint and
needs aa GPU
needs GPU to to accelerate
accelerate its its large
large amount
amount of of computation.
computation.
In this
In this paper,
paper, we we propose
propose an an adaptive
adaptive segmentation
segmentation algorithm
algorithmthat thatcancantake
takeadvantage
advantageofofboth both
methods while avoiding their shortcomings. Like the existing traditional methods,we
methods while avoiding their shortcomings. Like the existing traditional methods, weextract
extractthethe
texture distribution
texture distribution of of the
the image
image to to generate
generatethe theboundary
boundarypoint pointwithwithdifferent
differentweight
weightfor forsegmenting
segmenting
the image
the image into into small
small fragmented
fragmented regions,regions, and and then
then the
the regions
regionsare arecombined
combinedinto intolocal
localareas
areaswith
with
precise boundaries; finally, we apply a specially designed convolutional neural network (CNN) forfor
precise boundaries; finally, we apply a specially designed convolutional neural network (CNN) the
the area’s
area’s classification
classification without
without the need
the need of GPU.
of GPU. OurOur main main contributions
contributions include:
include:
Sensors 2019, 19, 2594 4 of 21

• To accelerate the generation of small fragmented regions, we propose a method to find the optimal
set of Gaussian kernels with adaptive directions for each specific scene. By making full use of
the straight-line characters of the railway scene, a smaller number of adaptive directions are
calculated according to the maximum points in Hough transformation rather than being chosen
from a set of fixed angles in the traditional way. As a result, the calculation time for the boundary
extraction and fragmented region generation is cut in half;
• A new clustering rule based on the boundary weight and the size of the region is set up to
accelerate the combination of the regions into local areas. The number of regions is reduced in
the process of weak boundary point removal by filtration, and the smallest remaining region is
combined with its neighbor region, which shares the weakest boundary;
• We propose a specially designed CNN model to achieve the fast classification of local areas without
the need of GPU. The local areas are divided into two categories: the track area which is used to
judge the intrusion behavior and the rest area which is unrelated to the intrusion. The convolution
kernels are pre-trained, and a sparsity penalty term is added into the loss function to enhance the
diversity of the convolutional feature maps.

The rest of this paper is organized as follows. In Section 2, we review the related works on image
parsing algorithms. Section 3 explains the proposed fast image segmentation process. Section 4 explains
the proposed simplified CNN network structure and the optimization process. Section 5 presents the
experimental results and discusses them. The last section summarizes our conclusions.

2. Related Work

2.1. Image Parsing by Traditional Methods


To segment an image using the traditional methods, the first step is to calculate the correlation
between the adjacent pixels in the scene image, and then segment the image into fragmented regions
by a certain convergence criterion [21,22]. The superpixel algorithm, for example, converts the image
from the RGB color space to CIE-Lab color space to form a five-dimensional vector (brightness, color
A, color B, and position x, position y), and the vector distance between two pixels representing their
similarity, is used to generate the small segments patches [14,23]. A spatial pyramid descriptor fuses
the gray, colored and edge gradient into one feature vector for the SVM classifier to recognize a traffic
sign [24]. The image can also be converted into the YCbCr color space, and the local texture features in
different channels are matched with the artificially designed template to locate the position of the traffic
sign [25]. Therefore, converting the image from the RGB color space into another feature space can
obtain more dimensional information channels: brightness, texture, and other feature maps besides
RGB color.
To achieve the final segmentation, the fragmented regions need to be combined. The internal
correlations among the adjacent regions are calculated according to different rules, and the regions are
combined into local areas according to their correlation values. For example, the K-means clustering
rules are used in different practical engineering applications, such as object detection for the synthetic
aperture radar (SAR) image and the sea scene [26–28]. The MCG algorithm is another grouping strategy
using random forests to combine the multiscale regions into highly accurate object candidates.
MCG can process one image (pixel size 90 × 150) in 7 s, and the mean Intersection over Union (IU)
is about 80% [18,29,30]. The clustering rules influence the combination precision, which is also directly
proportional to the calculation time; as a result, MCG is suitable for the initial or post processing of
a fixed scene, not for real-time processing of temporarily changing scenes. Therefore, to accelerate the
whole scene segmentation process, we choose to improve the traditional methods in both generation
and combination of the fragmented regions while maintaining the segmentation precision.
Sensors 2019, 19, 2594 5 of 21

2.2. Image Parsing by Deep Learning Methods


Deep learning methods have also been widely used in image parsing recently, e.g., various
convolutional networks, which have better robustness to image translation, rotation, scaling,
and distortion. Deep learning methods can be divided into three types: image classification [19], object
detection [31], and pixelwise prediction [20]; and the complexity of their network structures increases
from image-wise to pixel-wise.
For the pixelwise segmentation of a scene, the convolutional networks can be combined with
the superpixels, the random effect model and the texture segmentation to generate the pixelwise
labels [32], and also can be used as a classifier to classify the feature maps containing RGB and
depth information [33–35]. FCN can even process feature extraction, combination, segmentation,
and recognition at the same time, also achieving a pixelwise prediction [20].
Depending on the details of different FCN structures, the mean IU of FCN is about 80%, the accuracy
is about 90%, and the quantity of the parameter is about 57 M to 134 M. The massive number of
parameters and computation need a GPU with big memory to handle the operation, leading to a high
cost for practical applications. Therefore, we choose to use the traditional methods to get the precise
boundary of the local area first, and then use a simplified CNN only to classify the local areas without
the need of GPU. However, the reduction of the network size causes low accuracy in the classification,
so extra care has to be taken in optimizing the network structure and the training process.

3. Railway Scene Segmentation


As shown in Figure 2b, typical railway scene consists of different areas, including track area, sky,
catenary system, green belt, and ancillary buildings. The precision of the track area boundary directly
affects the reliability of the judgement about whether the intrusion occurs or not. The track area is
defined as the clearance area including rails, sleepers, subgrades or high-speed railway slabs, as shown
in Figure 2a. To avoid manual labeling, a fast and precise railway scene segmentation algorithm
is proposed.
Figure 3 illustrates the outline of the proposed algorithm. We first calculate the feature distribution
in a small image patch (pixel size 15 × 15) representing the central pixel of the patch, then evaluate the
central pixel’s probability of being a boundary point, and finally use the boundary weights to segment
the image according to a fast combination rule. Unlike the traditional method, we use a smaller
set of adaptive Gaussian kernels to extract the pixel color (PC) distribution and pixel similarity (PS)
distribution of the image in different channels C and by different scales S. The Gaussian kernels are
rotated by a set of adaptive θs, calculated from Hough transformation. The detailed procedure of
boundary weight generation is described in the remainder of this section.
Sensors 2019, 19, 2594 6 of 21
Sensors 2019, 19, x FOR PEER REVIEW 6 of 21

Image

Hough
Lightness Color A Color B Texture
Transformation
4 Chanels
3 Scales 4θ

* + * + * + * Rotate the Gaussian


Kernels By θ

PC(c,s,θ) Pixel Color Distribution

PC PC PC PC PC PC
Top t E E E E E E

PC PC PC PC PC PC Eigenvectors E E E E E E

of the Matrix
PC PC PC PC PC PC E E E E E E

PC PC PC PC PC PC E E E E E E

PC PC PC PC PC PC E E E E E E

PC PC PC PC PC PC E E E E E E

t Chanels *

Pixel Similarity Distribution PS(t,s,θ)

PS PS PS PS PS PS

PS PS PS PS PS PS

PS PS PS PS PS PS

PS PS PS PS PS PS

PS PS PS PS PS PS

PS PS PS PS PS PS

Pixel Color Pixel Similarity


Distribution + Distribution

Pixel’s Boundary Weight

B is the threshold to select


Delete the boundary point boundary point, its value
(whose weight  B) equals the minimum level
of the boundary weight in
the first loop.

Combine the smallest region with the neighbor region


(with which it shares the weakest boundary)

S is the threshold to limit


No
Region size > S the minimum size of the
local areas.
Upgrade the B
Yes
using a higher
level Q is the threshold to limit
Yes
Region quantity > Q the maximum quantity of
the local areas.
No
Local Areas

Figure 3. The procedure to segment an image into fragmented regions and combine them into local areas.
Figure 3. The procedure to segment an image into fragmented regions and combine them into local areas.
Sensors 2019, 19, 2594 7 of 21

3.1. Generation of Fragmented Regions


Firstly, we convert the image into the CIE lab color space, getting 3 channels: brightness, color A,
and color B. Images in different channels will be scaled by s = (0.5, 1, 2). In each channel, the image
is convoluted with Gaussian kernels to get the color value distribution; each kernel has a special
orientation angle θ. Define G(x, y, θ, c, s) as the convolution result at pixel P(x, y), with angle θ,
in channel c, by scale s. Then PC, the pixel’s color distribution, can be obtained by
XX
PC(x, y, θ) = αc,s G(x, y, θ, c, s) (1)
s c

where αc,s is a weighting coefficient.


Secondly, define Similarity (i, j) as the maximum PC value of all pixels on the line li,j connecting two
pixels i and j in an small image patch by Equation (2), representing the similarity between pixel i and j.
n o
Similarity (i, j) = exp(−Max PC(x, y) (x, y) ∈ li,j ) (2)

Calculate the similarity of each pixel ix,y in the patch and the central pixel jcenter , assign
Similarity (ix,y , jcenter ) to each element MS(x, y) of the Matrix of Similarity MS, and assemble MS
representing the similarity matrix between each pixel in the image patch and the central pixel.
Calculate the top t eigenvalues and eigenvectors of MS. Assign the eigenvector to the central pixel
P(x, y) marked as e(x, y, t), forming a feature map E of the image, representing the similarity of the
adjacent points. Again, in each dimension of the feature map, convolute E(t) with Gaussian kernel
of orientation θ to get the similarity distribution. Define g(x, y, θ, t, S) as the convolutional result at
location E(x, y), with angle θ, in dimension t, by scale s. Then, the pixel’s similarity value distribution
can be obtained as XX
PS(x, y, θ) = βc,s g(x, y, θ, t, s) (3)
s t

where βc,s is a weighting coefficient.


Finally, B(x, y), the possibility of the pixel P(x, y) being a boundary point, can be estimated by
X X
B(x, y) = PC(x, y, θ) + PS(x, y, θ) (4)
θ θ

3.2. Finding the Optimal Set of Gaussian Kernels


It can be found that, in the process of estimating B(x, y), convolution operations (Equations (1) and
(3)) using Gaussian kernels with different orientation angles θ cost most of the computation, which can be
reduced if a smaller set of Gaussian kernels are used. The traditional UCM algorithms choose fixed size
of θ = (θ1, θ2, θ3...) with 8 or 16 values uniformly distributed from 0 to π. Here we propose to utilize the
characteristics of the railway scene to find a much smaller set of useful orientation angles and thus a smaller
set of Gaussian kernels. Usually, in railway scene, there is a clear vanishing point (VP), and the boundaries of
many local areas are lines passing through the VP. Therefore, if we can automatically adjust the candidate θ for
each specific scene to enhance the weights of the line boundary points of the relevant areas, then we will be
able to use a smaller set of θ to accelerate the process.
We propose to find the candidate θ by filtering the original image with a Canny kernel [36],
and then convert the obtained texture feature into the Hough coordinate system using
π π
ρ = x cos θ0 + y sin θ0 , − < θ0 < (5)
2 2
As shown in Figure 4a, each curve in the Hough coordinate system stands for one point in the
Cartesian coordinate system. If the curves (colorful curve lines in Figure 4a) have one intersection
Sensors 2019, 19, x FOR PEER REVIEW 8 of 21

 
  x cos  ' y sin  ',  ' (5)
Sensors 2019, 19, 2594 2 2 8 of 21
As shown in Figure 4a, each curve in the Hough coordinate system stands for one point in the
Cartesian coordinate system. If the curves (colorful curve lines in Figure 4a) have one intersection
point in the
point Hough
in the coordinate
Hough coordinatesystem,
system,then
then the
the corresponding points(blue
corresponding points (bluepoint
pointinin Figure
Figure 4a)4a) in the
in the
Cartesian coordinate system are collinear.
Cartesian coordinate system are collinear.

(a) (b)

(c) (d)
Figure
Figure 4. Using
4. Using Houghtransformation
Hough transformation to to detect
detect the
the most
mostsignificant
significantlines
linesininthe Hough
the Hough coordinate
coordinate
system. (a) The intersection point of a group of curves in the Hough coordinate
system. (a) The intersection point of a group of curves in the Hough coordinate system means there system means thereare
are a group of collinear points in the Cartesian coordinate system. (b) The more the
a group of collinear points in the Cartesian coordinate system. (b) The more the curves intersect in the curves intersect
in thecoordinate
Hough Hough coordinate
system, thesystem, thethe
lighter lighter the intersection
intersection point is,point is, meaning
meaning thatare
that there there arecollinear
more more
collinear
points alongpoints along
this line this Cartesian
in the line in the Cartesian
coordinatecoordinate system.
system. (c) (c) The feature
The texture texture feature maps filtered
maps filtered by canny
by canny filter. (d) The top four
filter. (d) The top four significant lines.significant lines.

H (θH0 , ρ)',be
LetLet  the
benumber
the number of curves
of curves intersecting
intersecting (θ0 , ρ)and
at point
at point  and
 ', find thefind thewith
point point with
maximum
maximum
H (θ0 , ρ), where
Hthere
 ',  are 
, where therenumber
the largest are theoflargest
pointsnumber of
which are points which
collinear on theare collinear on line
corresponding the in
thecorresponding
Cartesian coordinateline in the Cartesian coordinate system. The line can be expressed as
system. The line can be expressed as

11 ρ
y =y−  tan 0x' x+ sin  '0=  b+ b
kxkx (6) (6)
tan θ sin θ

To find a small set of four orientation angles, one can take the top four maximum θ in H (θ0 , ρ),
e.g., the points with highest ‘lightness’ in Figure 4b: θ0 = 68◦ , 52◦ , 0◦ and −88◦ . Here we change the
θ = 90◦ − θ0 = 22◦ , 38◦ , 90◦ , and 178◦ in order to obtain a range of values from 0◦ to 180◦ (0–π). Based
on the selected set of orientation angles, the Gaussian kernels can be constructed correspondingly by
W W
Y, point P( x, y) rotates around the point o( , ) an angle  to P '( x ', y ') , which ca
2 2
ulated as
Sensors 2019, 19, 2594 9 of 21
 1 0 0   cos   sin  0   1 0 0
 x yrotating
1  the
 ' y ' 1As shown
x Gaussian. 0 in Figure5,1in the   sin coordinate
00Cartesian cos    0point P(x,y)1
system0X-O-Y, 0 
rotates around the point o( 2 , 2 ) an angle θ to P (x , 
W W 0 y 0 ), which can be formulatedas 
 0.5 W 0.5W 1  0 0  1   0.5W  0.5W
 1 
1 0 0  cos θ − sin θ 0  1 0 0 
 0 
i
x y  1 = x0 ycos 0  sin θ sin 0 
h i h
0 1  cos θ
  
−1 0  0 −1 0 

  x ' y ' 1  h  0.5W 0.5W 1


  
−0.5W 0.5W 1 0 0 1
 sin 
i
 cos θ cossin θ 0  0 
 (7)

=0.5W (cos−0.5W
x0 y0 1 
 sin(cos − sin
 θ
  cos θ
    1 

0 
1) 0.5 W (sin cos 1)

θ − sin θ − 1) −0.5W (sin θ + cos θ − 1) 1 

Y
Kernel size W W
P '  x ', y '

P  x, y 

W
o W W 
Rotation center  , 
 2 2

W
O X

Figure 5. Calculating the rotation matrix of the Gaussian kernel. The rotation center is on the kerne
Figure 5. Calculating the rotation matrix of the Gaussian kernel. The rotation center is on the kernel center.

center. Figure 6 shows several Gaussian kernels rotated by the optimal set of θ =22◦ , 38◦ , 90◦ , and 178◦
obtained above and θ = 112.5◦ , one of the eight uniformly-distributed values commonly used in
traditional UCM algorithms, respectively. The results show that the features of the horizontal catenary
Figure 6 shows several
bracket, Gaussian
the vertical kernels
catenary column rotated
and the decliningby the
track optimal set
are strengthened of in the
obviously 22°, 
first38°, 90°, and

ined above and 112.5°, one of the eight uniformly-distributed values commonly us
four filters, contrasting with the feature extraction equality in the fifth filter. The universality of using
8 or 16 uniform values in different angle θ causes a redundant calculation when applied to the railway
tional UCMscene.
algorithms, respectively.
Therefore, adjusting a smaller numberThe of θresults
adaptively show that
to filter the the
feature mapfeatures
can accelerateof the horiz

nary bracket, the vertical catenary column and the declining track are strengthened obviou
the boundary weighting to generate the fragmented regions.

irst four filters, contrasting with the feature extraction equality in the fifth filter. The univer
sing 8 or 16 uniform values in different angle  causes a redundant calculation when ap
e railway scene. Therefore, adjusting a smaller number of  adaptively to filter the feature
accelerate the boundary weighting to generate the fragmented regions.
Sensors 2019, 19, 2594 10 of 21
Sensors 2019, 19, x FOR PEER REVIEW 10 of 21

(a) (b)
Figure
Figure 6. Different
6. Different kernels
kernels andthe
and theconvolution
convolution results
results of
of CIE-lab
CIE-lab color
color LL channel.
channel.(a)(a)First order
First order
derivative
derivative Gaussian
Gaussian kernels
kernels rotated
rotated bybyfive
fiveangles.
angles.(b)
(b)Results
Results of
of the
the Gaussian
Gaussianconvolution.
convolution.

3.3. 3.3. Combination


Combination Rule
Rule
TheThe fragmented
fragmented regions
regions generatedby
generated bythe
theadaptive
adaptive boundary
boundarydetection
detectionareareshown
shown in in
Figure 7a.7a.
Figure
TheThe higher
higher thethe boundary
boundary weight
weight is,is,the
thebrighter
brighterthe
thepoint
point is
is shown
shown ininthe
thegray
grayfeature
featuremap,
map, indicating
indicating
thatthat
the the point
point is more
is more likely
likely to to becomea aboundary
become boundarypoint.
point.
A clustering
A clustering rulerule
based based on both
on both of theofboundary
the boundaryweightweight andregion
and the the region
size is size is proposed
proposed to
to combine
combine the fragmented regions into local areas. The number of the regions will
the fragmented regions into local areas. The number of the regions will be reduced in the process of be reduced in the
weakprocess
boundaryof weak
pointboundary
removal by point removalThe
filtration. bysmallest
filtration. The smallest
remaining regionremaining region will
will be combined beits
with
combined with its neighbor region, with which it shares the weakest boundary. Repeat
neighbor region, with which it shares the weakest boundary. Repeat this iteration until the statistical this iteration
until the statistical parameters meet the requirements. The process is as follows:
parameters meet the requirements. The process is as follows:
Sensors 2019, 19, 2594 11 of 21

1. Let B(m) be the normalized value of the boundary point’s weight B(xm , ym ), where m = 1, 2, 3...M,
and Mis the total number of boundary points:
Sensors 2019, 19, x FOR PEER REVIEW 11 of 21

1
1. Let B(m) be the normalized value of(Bthe
B(m) = sigmoid ym )) = point’s weight B( xm , ym ) , where
(xm ,boundary (8)
1 + e−B(xm ,ym )
m  1, 2,3...M , and M is the total number of boundary points:

2. The statistical distribution of the boundary point weight 1B(i) is shown in Figure 7b. There are
B ( m )  sigmoid ( B ( xm , y m ))  (8)
many levels of boundary point weights. Choose the minimum 1  e  B ( x , y )level B as the threshold to delete
m m

the weak
2. boundary
The statistical points Bof
distribution (mthe) ≤boundary
B; point weight B(i) is shown in Figure 7b. There are
3. The many
fragmented regions will be reduced
levels of boundary point weights. Choose by reconnecting
the minimum the breakpoints
level B as of thethe boundary
threshold to line
usingdelete the weak boundary points B(m)  B ;
expansion and corrosion operations, as shown in Figure 7c. The new regions are shown
3. The fragmented
in Figure 7d; regions will be reduced by reconnecting the breakpoints of the boundary line
4. The using expansion
statistical and corrosion
distribution operations,
of region size f (as
n)shown
is shown in Figure
in Figure 7c. The 7e,new regions
where n =are shown
1, 2, 3...N,inand N
Figure 7d;
is the serial number of the regions. Choose the smallest region along its boundary line and find
4. The statistical
the neighbor regiondistribution of region
which shares thesize f (n) boundary
weakest is shown inwith Figure it. 7e,
Thenwhere n  1, 2,3...
combine them , anda new
N into
N is
region. Asthe serial in
shown number
Figureof7d,
the regions
regions. in
Choose
number the smallest
1, 2, 3, andregion 4 arealong its boundary
combined as one line
new andregion
find the
in Figure 7f; neighbor region which shares the weakest boundary with it. Then combine them into a
new region. As shown in Figure 7d, regions in number 1, 2, 3, and 4 are combined as one new
5. Repeat Step 4 to reduce N until the area of the smallest region is larger than a threshold S, which is
region in Figure 7f;
usedRepeat
5. to limit the4 to
Step minimum
reduce N area
until of
thethe
arearemained regions;
of the smallest region is larger than a threshold S, which
6. is used to limit the minimum area of the remained regions;
Compare the final N with another threshold Q to limit the minimum quantity of the remained
6. Compare
regions. > Q,
If N the final N with
select another threshold
the second minimumQlevel to limit the minimum
weight B and goquantityback toofStep
the remained
2;
regions. If N  Q , select the second minimum level weight B and go back to Step 2;
Figure 7g is the original railway scene image, and the Figure 7h is the result of our segmentation
algorithm.Figure 7g is the original railway scene image, and the Figure 7h is the result of our segmentation
The railway scene only contains five categories of areas, and the shape of the area is usually
algorithm. The railway scene only contains five categories of areas, and the shape of the area is
in a large and radial pattern. Therefore, we set the minimum area threshold S to 10% of the whole
usually in a large and radial pattern. Therefore, we set the minimum area threshold S to 10% of the
imagewhole
and image
the maximum quantity threshold Q to 10, which will prevent the remained regions from
and the maximum quantity threshold Q to 10, which will prevent the remained regions
beingfrom
too fragmented. The
being too fragmented. remained regionsregions
The remained will bewill
adjusted into ainto
be adjusted standard sizesize
a standard of 64 × 64
of 64 × 64andandRGB 3
RGB 3after
channels, channels,
beingafter being classified
classified by the CNN by theinCNN in Section
Section 4, the4,remaining
the remaining regions
regions with
with thesame
the samelabels
labels
will be will be combined
combined as one local as one local area.
area.

(a) (b)

3
1

2
4

(c) (d)

Figure 7. Cont.
Sensors 2019, 19, 2594 12 of 21
Sensors 2019, 19, x FOR PEER REVIEW 12 of 21

(e) (f)

(g) (h)

Figure Figure
7. The7. procedures
The procedures of combining
of combining the fragmented
the fragmented regions
regions into areas.
into local local areas. According
According to theto the
adjustment
adjustment and experiments, for the railway scene, the scene image is set to a pixel size of 90 × 150,
and experiments, for the railway scene, the scene image is set to a pixel size of 90 × 150, the number of
the number of adaptive  is reduced to 4, the number of reserved areas Q is set to 10, and the
adaptive θ is reduced to 4, the number of reserved areas Q is set to 10, and the smallest fragmented area S is
smallest fragmented area S is set to 10% of the total size of the image. (a) Boundary with weight. (b)
set to 10% of the total size of the image. (a) Boundary with weight. (b) Distribution of boundary weight and
Distribution of boundary weight and quantity. (c) Delete the weak boundary. (d) Fragmented regions.
quantity. (c) Delete the
(e) Distribution of weak boundary.
the region (d)serial
size and Fragmented
number.regions.
(f) Local (e) Distribution
areas of the region
after the fragmented size and
regions are serial
number. (f) Local areas after the fragmented regions are combined.
combined. (g) The original railway scene image, (h) is the result. (g) The original railway scene image,
(h) is the result.
4. Local Area Recognition in Railway Scene
4. Local Area Recognition in Railway Scene
To automatically label the local areas in real time without the help a GPU, we design a simplified
CNN
To with less layers
automatically andthe
label kernels. To compensate
local areas the without
in real time reduced accuracy,
the help the convolution
a GPU, kernels
we design are
a simplified
CNNpre-trained, and a sparsity
with less layers penaltyTo
and kernels. term is added into
compensate thethe loss function
reduced to enhance
accuracy, the diversity kernels
the convolution of the are
feature maps.
pre-trained, and a sparsity penalty term is added into the loss function to enhance the diversity of the
feature maps.
4.1. Structure of Simplified CNN
Before
4.1. Structure designing CNN
of Simplified and applying a simplified CNN, we first construct a dataset of local area
images for training it. As shown in Figure 8, there are mainly five basic categories of elements in a
Before designing and applying a simplified CNN, we first construct a dataset of local area images
typical railway scene, including track area, sky, catenary system, green belt, and ancillary buildings.
for training it. As shown in Figure 8, there are mainly five basic categories of elements in a typical
To sample the dataset, five solid line rectangles are manually defined to cover the five different areas.
railway
Wescene, including
program a simpletrack area,code
extraction sky, to
catenary
take thesystem, green belt,
image patches usingand
the ancillary
dotted-linebuildings. To sample
box as samples
with thefive
the dataset, same category
solid of the outer
line rectangles rectangle.
are manually Wedefined
set up ato
group
coverofthe
constraint parameters
five different areas.to We
control
program
the dotted
a simple box tocode
extraction extracttothe patches
take at a random
the image position,
patches usingbythe
a random scale, box
dotted-line maintaining insidewith
as samples of the
sameeach rectangle.
category Theouter
of the imagerectangle.
patches are We
adjusted
set upintoa agroup
pixel size of 64 × 64 and
of constraint RGB 3 channels
parameters to the
to control
dottedassemble
box to our five-category
extract the patchesdatasets of railwayposition,
at a random local area.by
However,
a random for scale,
the specific application
maintaining of this
inside of each
paper, our target is focused on the track area for judging intrusion behavior, so besides
rectangle. The image patches are adjusted into a pixel size of 64 × 64 and RGB 3 channels to assemble the ‘track’
label, we merge the other four elements into one category labeled as ‘others’. There are 9000 image
our five-category datasets of railway local area. However, for the specific application of this paper,
patches in total, in which 5000 images are used for training our net, 2000 images are used for cross-
our target is focused on the track area for judging intrusion behavior, so besides the ‘track’ label,
validation, and 2000 images are used for testing.
we merge the other four elements into one category labeled as ‘others’. There are 9000 image patches
in total, in which 5000 images are used for training our net, 2000 images are used for cross-validation,
and 2000 images are used for testing.
Sensors 2019, 19, 2594 13 of 21

Sensors 2019, 19, x FOR PEER REVIEW 13 of 21


Sensors 2019, 19, x FOR PEER REVIEW 13 of 21

(a) (b)
(a) (b)
Figure
Figure 8. 8. Collectingsamples
Collecting samplesof oflocal
local areas
areas for
for CNN
CNN training.
training.(a)(a)Solid-line rectangles
Solid-line areare
rectangles delineated
delineated
byby Figure
manual
manual 8. Collecting
with
with labels, samples
labels,including
including ofthe
local
the areas
track
track for(red),
area
area CNNsky
(red), training.
sky (blue),
(blue),(a) Solid-line
catenary
catenary rectangles
system
system aregreen
(purple),
(purple),delineated
beltbelt
green
by manual
(green), with buildings
ancillary labels, including
(yellow). theThe
track area (red),boxes
dotted-line sky (blue),
are catenarywindows.
extractor system (purple),
(b) The green belt
dataset
(green), ancillary buildings (yellow). The dotted-line boxes are extractor windows. (b) The dataset
(green), ancillary
containing two buildings
categories (yellow).the
fortraining
training The dotted-line boxes are extractor windows. (b) The dataset
CNN.
containing two categories for the CNN.
containing two categories for training the CNN.
AA simplified
simplified CNN
CNN structure
structure is is designedfor
designed forfast
fastrecognition,
recognition,which
whichconsists
consistsofofan
aninput
inputlayer,
layer,two
two A simplified
convolution CNN
layers C1structure
and C2, is designed
two mean for fastlayers
pooling recognition,
S1 and which
S2, andconsists
a of an
logistic input layer,
classification
convolution layers C1 and C2, two mean pooling layers S1 and S2, and a logistic classification layer,
two convolution layers C1
9. and C2, two mean pooling layers S1 and S2, and a logistic classification
as layer,
shownasinshown in9.
Figure Figure
layer, as shown in Figure 9.

C1 C2 Logistic
Input 70C1 S1 10C2 S2 Logistic Output
Input Layer Output
70
kernels S1 10
kernels S2 Layer
kernels kernels

Figure 9. Structure of the simplified CNN. The size of the input image is a pixel size of 64 × 64 with
RGB
Figure Figure 9. Structure
3 channels.
9. Structure The of the
output
of the simplified
is one ofCNN.
simplified CNN.
the two The sizeofof
category
The size theinput
labels.
the inputimage
image is
is aa pixel
pixel size
sizeofof64
64××6464with
with
RGBRGB 3 channels.
3 channels. TheThe output
output is one
is one of of
thethe twocategory
two categorylabels.
labels.
As shown in Table 1, we conducted five experiments with different kernel quantities and sizes.
It can
As be Asseen
shownshown that inincreasing
in Table Table 1, we
1, we the conducted
kernel size
conducted five
and
five experiments
quantity may
experiments with
with different
increase kernel
kernelquantities
the accuracy,
different and sizes.
but the accuracy
quantities and sizes.
It
It can can
is still be
be less seen
seenthan that
that80%. increasing
Although
increasing the
thethe kernel
railway
kernel size and
sizescene quantity may
is very simple,
and quantity increase the
only containing
may increase accuracy, but
several
the accuracy, the accuracy
typical
but area
the accuracy
is still
is still less less
categories, than than
the 80%. 80%.
shapes, Although
color, and
Although thethe railway
texture
railway sceneof
features
scene isisvery
very
the simple,
area only containing
belonging
simple, only to the sameseveral
containing category
several typical area
are still
typical area
categories,
very
categories, the the
complex shapes,
and
shapes, color,
different.
color, and and texture
Therefore,
texture features
the ofofthe
training
features the areabelonging
process
area belonging
must be to to the
the same
optimized
sameto category
increase
category are still
the
are still
very
accuracy.
very complex and different. Therefore, the training process must be
complex and different. Therefore, the training process must be optimized to increase the accuracy. optimized to increase the
accuracy.
Table1.1.Experimental
Table Experimental results
results of
of different
differentCNN
CNNnetwork
networkstructures.
structures.
Table 1. Experimental results of different CNN network structures.
KernelQuantity
Quantity Calculation
Kernel Size Kernel
Kernel Quantity Calculation
Calculation Accuracy
Kernel Size
Kernel Size C1C1 C2 Time(s)
Time (s)
Accuracy
Accuracy
50C1 C2
10C2 Time(s)
0.00372 72.25%
3×3 507050 1010
10 0.00372
0.00372
0.00495 72.25%
72.25%
73%
3 × 33 × 3 70 70 10 10 0.00495
0.00495 73%
73%
100 10 0.00689 75%
100100 10 10 0.00689
0.00689 75%
75%
5×5 100 10 0.0125 76%
5 × 55 × 5 100100 10 10 0.0125
0.0125 76%
76%
7×7 100 10 0.0217 76.5%
7×7 7 × 7 100100 10 10 0.0217
0.0217 76.5%
76.5%
4.2. Optimization of the Simplified CNN
4.2. Optimization of the Simplified CNN
4.2. Optimization
To increaseofthe
the accuracy,
Simplified kernels
CNN are pre-trained to extract better low-level features. The pre-
To increase the accuracy, kernels are pre-trained to extract better 1low-level features. The pre-
training strategy
To increase the is based on
accuracy, autoencoder-decoder
kernels are pre-trained tonetwork; and the
extract better Wi ,3133features.
low-level after training in first
The pre-training
training strategy is based on autoencoder-decoder network; and 1 the W i ,333 after training in first
layer is applied as the convolution kernel in the first convolution layer
strategy is based on autoencoder-decoder network; and the W i,3×3×3
C1, as shown in Figure 10layer
after training in first for is
applied as the convolution kernel in the first convolution layer C1, as shown in Figure 10 for the for
layer is applied as the convolution kernel in the first convolution layer C1, as shown in Figure 10 case
Sensors2019,
Sensors 2019,19,
19,xxFOR
FOR PEER
PEER REVIEW
REVIEW 14 of
14 of 21
21
Sensors 2019, 19, 2594 14 of 21
the case
the case with
with kernel
kernel size of 3333 and
size of and in
in RGB
RGB 33 channels.
channels. During
During the training, 3333patches
the training, patches in
in RGB
RGB
3 channels
3 channels are
areofrandomly
randomly selected
selected from
from random railway
random During scene
railway the
scene images,
images,3as as shown in
shown in inFigure
Figure 11a. The
with kernel
result of
size
the
3 × 3 and
pre-trained
in RGB
kernels is
3 channels.
shown in Figure 11b,
training,
where the
× 3patches
patches and the
RGB11a.
kernels are
The
3 channels
all in
result of the pre-trained kernels is shown in Figure 11b, where the patches and the kernels are all in
are RGB
randomly
3 selected from random railway scene images, as shown in Figure 11a. The result of the
channels.
RGB 3 channels.
pre-trained kernels is shown in Figure 11b, where the patches and the kernels are all in RGB 3 channels.

Hidden
Hidden
Input Layer
Layer Output
Input Output
layer
layer layer
layer
3331
3331 3331
3331

Patches Patches
Patches
Patches
333 333
333
333

1
W11,1-N
W bb Hidden
1,1-N
Hidden
1 neural
neural
W
W
12,1-N
2,1-N

1 2
W1i,ji,j
W W2i,ji,j
W

Figure
Figure
Figure 10.
10. 10. Structure
Structure
Structure of autoencoder-decoder
of
of the the autoencoder-decoder
the autoencoder-decoder network.
network.
network. The hidden
The The hidden
hidden layer
layerlayer contains
contains
contains 70 hidden
70
70 hidden hidden
neurons;
neurons;
neurons;the
W denotes W denotes
W denotes the
weight the weight associated
weight associated
associated with the connection
with the connection
with the connection between
between between neurons;
neurons;neurons; and
and the and the network
the network
network is to
is
is trained
trained to
trainedoutput
produce produce
to produce output
output
the same the
as the same as its input.
same as its input.
its input.

(a)
(a) (b)
(b)
Figure
Figure
Figure 11.
11.11. Pre-trained
Pre-trained
Pre-trained convolutionkernels
convolution
convolution kernelsusing
kernels using the
using the autoencoder-decoder
the autoencoder-decoder algorithm.
autoencoder-decoder algorithm.
algorithm. (a)(a)
(a) The
The image
image
The image
patches
patches
patches are
areare extracted
extracted
extracted fromthe
from
from theleft
the leftrailway
left railwayscene
railway scene image
scene image for
image for the
for the kernel
thekernel training.
kerneltraining. (b)
training.(b) The
The
(b) The pre-trained
pre-trained
pre-trained
kernels
kernels
kernels used
used
used in
in in convolution
convolution
convolution layerC1.
layer
layer C1.
C1.

AfterAfter
After pre-training,the
pre-training,
pre-training, theinput
the inputweights
input weights of
weights of each
each neuron
each neuron in
neuron in the
inthe hidden
thehidden layer
hiddenlayer are
are
layer areused
used
used as as
as thethe
the initial
initial
initial
weights
weights
weights of
of of kernels
kernels
kernels in
inin thefirst
the
the firstconvolution
first convolution layer
convolution layer C1
layer C1 in
C1 in Figure
in Figure 9.
Figure 9. The
9.The rest
Therest
rest ofof
of CNN
CNNCNN in in
in Figure
Figure
Figure99 are
are
9 are
randomly
randomly
randomly initialized
initializedand
initialized and
andthen then trained
thentrained
trained byby using
by using
using a a backpropagation
a backpropagation algorithm
backpropagationalgorithm (stochastic
algorithm(stochastic
(stochastic gradient
gradient
gradient
descent,
descent,
descent, SGD).
SGD).
SGD). ToToTo enhance
enhance
enhance thediversity
the
the diversityof
diversity ofthe
of thefeature
the feature maps,
feature maps,aaasparsity
maps, sparsity penalty
sparsitypenalty
penalty term
term
term is added
is added
is added into
into thethe
the
into
loss function J as
lossloss function
function J asJ as
 
P 10 
 
χ 1 − χ 
 
1 1 2
 X h   i  X
+ τ χlg + (1 − χ)lg
 
J= h ep − lp 

(9)

ηf 1 − ηf 

 P p=1 2

 

 f =1
Sensors 2019, 19, 2594 15 of 21

where
P 29 29
1 X X X (2)
ηf = O f ,e (u, v) (10)
P p
p=1 u=1 v=1

ep is the p-th input image, lp is the ground truth label, there are totally P images in the dataset, h(ep ) is
the output label, τ is the weight of the sparsity penalty term, χ is the sparsity parameter (a smaller
value close to 0, e.g., 0.05), η f is the average output of the f -th feature map in convolution layer C2
(2)
(averaged over the training dataset), and O f ,ep (u, v) is the value at position (u, v) in the f -th feature
map of the input ep in the second convolutional layer C2, the size of the feature map is 29 × 29 pixels.
In the process of backpropagation, the sparsity penalty item will suppress the average output of
all feature maps in the second convolutional layer C2, but enforce the output of one feature map at the
same time, so as to enhance the diversity of the feature maps and improve the accuracy. The learning
rate is set to 0.1, and the decay of the learning rate is 0.001 after each iteration, the final value of J
should be less than 0.05.

4.3. Performance of the Simplified CNN


As shown in Table 2, the accuracies of the simplified CNNs with different structures are all
increased by using the proposed optimization method, compared with the results of traditional training
method shown in Table 1, e.g., the simplified CNN with 70 kernels (3 × 3, 3 channels) in C1 and
10 kernels (3 × 3, 70 channels) in C2 is used for the proposed segmentation algorithm. The quantity
of the network parameters is only 0.02912M. After the railway scene is segmented and classified,
the regions with track labels can be combined together as the final track areas.

Table 2. Experiment results of different CNN network structures after the optimization.

Kernel Quantity
Kernel Size Accuracy
C1 C2
50 10 98%
3×3 70 10 98.5
100 10 98.5%
5×5 100 10 98.75%
7×7 100 10 99.25%

5. Experiments and Results

5.1. Railway Scene Dataset


We collect images from 16 PTZ cameras at straight lines, curves and bridges in the high-speed
railway from Shanghai to Hangzhou, China. For each camera, images are collected from 10 different
shooting angles, lenses, and under different illumination conditions from 8:00 a.m. to 5:00 p.m.
Examples are shown in Figure 12a. There are totally 1760 scene images in the dataset, in which
1000 images are used in the training dataset, 400 images are used in the cross-validation dataset and
360 images are used in the test dataset. These datasets are used to generate the datasets for our
simplified CNN (Section 4.1) and the dataset for training the FCN for the comparison experiments.
Sensors 2019, 19, 2594 16 of 21
Sensors 2019, 19, x FOR PEER REVIEW 16 of 21
Sensors 2019, 19, x FOR PEER REVIEW 16 of 21

(a)
(a)

(b)
(b)

Figure
Figure 12.12.Samples
Samplesin in railway
railway dataset.
dataset.(a)(a)
Images from
Images PTZPTZ
from cameras under
cameras different
under conditions.
different (b)
conditions.
Figure
Ground 12. Samples
truth of the in railway
track area.
(b) Ground truth of the track area. dataset. (a) Images from PTZ cameras under different conditions. (b)
Ground truth of the track area.
5.2.5.2. Modification
Modification of of
thethe Workflowfor
Workflow forthe
theCase
CaseofofSmall
Small Track
Track Portion
Portion
5.2. Modification of the Workflow for the Case of Small Track Portion
ForFor cameras
cameras onon line line sections,
sections, track
track area
area onlyonly takesup
takes upa asmall
smallportion
portionofofthe thescene
sceneimage,
image,whilewhile for
for theFor cameras
ones at tunnelon line sections,
entrances and track area
bridges only
over takes
railway
the ones at tunnel entrances and bridges over railway line, track area usually takes up most up a
line,small
track portion
area of
usually the scene
takes upimage,
most while
of the
of the
for the
scene. ones
As at
shown tunnel
in entrances
Figure 12b, and
the bridges
red track over
area railway
takes
scene. As shown in Figure 12b, the red track area takes about 25–70% of the whole scene image line,
about track
25–70% areaofusually
the wholetakes up
scene most
image of thefor
for
scene.
different As shown
cameras. in Figure
That means,12b, the
the red track area
complete-processing takes
different cameras. That means, the complete-processing workflow (Sections 3 and 4) would waste about
workflow 25–70% of
(Sectionsthe 3 whole
and 4) scene
would image
waste fora
different
lotofoftime cameras.
timecalculating That
calculating the means, the complete-processing workflow (Sections 3 and 4) would waste ona
a lot theboundaries
boundariesbetween between thethe‘others’
‘others’ areas
areas(Figure 13b)13b)
(Figure rather than focusing
rather than focusing
lot of
the time calculating
potential track area theasboundaries
the red dotted between line the ‘others’shown
rectangle areas (Figure
in Figure 13b) rather
13a. than focusing
In order to find the on
on the potential track area as the red dotted line rectangle shown in Figure 13a. In order to find the
the potential
potential tracktrackareaarea andasreduce
the redthe dotted line rectangle
segmentation shown furthermore,
calculation in Figure 13a.we In design
order toa find partial-the
potential track area and reduce the segmentation calculation furthermore, we design a partial-scanning
potential workflow
scanning track areatoand reduce
locate the segmentation
the potential position of calculation
the track furthermore,
area before the wesegmentation
design a partial- and
workflow to locate the potential
scanning workflow to locate
position of the track area before the segmentation and classification by
classification by scanning overthe thepotential
railway position
scene roughly of the using
track area before theCNN.
the proposed segmentation
As shownand in
scanning
Figure
over
classification the
13c, weby
railway
scanning
firstly
scene
divideover
roughly
the the
using
railway
railway scene
the
scene
image
proposed
roughly
into 6using
CNN.
 10 cells As shown
the proposed in
(yellow cell);
Figure
CNN. each
13c,
Ascell we
shown
and its
firstly
in
divide
Figure the railway
13c, we scene
firstly image
divide the into 6
railway × 10 cells
scene (yellow
image into cell);
 each
peripheral zone (red dotted line rectangle) are resized to 64  64 pixels, define their classified labels
6 10 cell
cells and
(yellow its peripheral
cell); each cell zone
and (red
its
peripheral
dotted
as line zone (red dotted
rectangle)
the representation are resizedline rectangle)
of its central to cell (red are
64 × 64 arearesized
pixels, define
in Figure 64  64
to 13c);
their thepixels,
classified
proposed define
labels
CNN their
as classified
the
is used labels
representation
to classify
as central
of its thecells
these representation
and
cell the
(redoutput of itsin
area central
labels
Figure arecell (redthe
used
13c); toarea in Figure
identify
proposed 13c); the
the potential
CNN proposed
track
is used area
to CNN is
roughly
classify used
as the
these tored
cellsclassify
areathe
and
theselabels
shown
output cells and
in Figure the
are used 13d;output labelsthe
toAidentify
minimum are used to dotted
enclosing
potential identify
track line therectangle
area potentialis
roughly astrack
the area
used roughly
to adjust
red area as the
the potential
shown redtrack
in Figure area 13d;
shown
area
A minimum into inaFigure
regular
enclosing 13d;
shapeA minimum
dottedas shown enclosing
in Figure is
line rectangle dotted
13d.usedline rectangle
to adjust theispotential
used to adjusttrackthe areapotential
into a track
regular
areaasThe
shape into
shown a regular
strategy ofshape
in Figure the 13d. as shown in Figure
partial-scanning 13d. reduces the segmentation area, but spends extra
workflow
scanningThe
The strategy strategy
time. Thus, of
of thethethe partial-scanning
overall processing
partial-scanning workflow
time depends
workflow reducesonthe
reduces thesegmentation
the proportion ofarea,
segmentation the but
area, track
but spends
area
spends toextra
the
extra
scanning
railway
scanning time.
scene,
time. Thus,
as
Thus, shown
the theoverall
overall
in Table processing
3, the numbers
processing time on
time depends
the lefton
depends on the
arethethe proportion
scene images
proportion ofoftheintrack
the Figure
track area
12,
areato thethe
from
to
railway
the
railway left toscene,
scene, as shown
theasright.
shown If the in Table
in track
Table area 3,takes
3, the the numbers
over more
numbers onon the
than
the left
areare
88.1%
left ofthe
the the scene
scenerailway images
images scene,ininFigure
Figure
the 12,from
performance
12, fromthe
the
of theleft to the right.
partial-scanning If the track
workflow area takes
would beover
worse more than thanthe88.1% of the
complete-processing
left to the right. If the track area takes over more than 88.1% of the railway scene, the performance railway scene,
workflow.the performance
These two of
of the partial-scanning
workflows can be chosen workflow
for would
different be worse
cameras: forthanthosethewith
complete-processing
short focus lens workflow.
and focus on These
the two
near
the partial-scanning workflow would be worse than the complete-processing workflow. These two
workflows can be chosen for different cameras: for those with
scene full of track area, the complete-processing workflow should be used; for the ones with long short focus lens and focus on the near
workflows can be chosen for different cameras: for those with short focus lens and focus on the near
scene lens
focus full and
of track
trackarea,area the
onlycomplete-processing
take a small part of the workflow
scene, the should be used; forworkflow
partial-scanning the ones shouldwith long be
scene full of track area, the complete-processing workflow should be used; for the ones with long focus
focus lens and track area only take a small part of the scene, the partial-scanning workflow should be
used.
lens and track area only take a small part of the scene, the partial-scanning workflow should be used.
used.

(a) (b)
(a) (b)

Figure 13. Cont.


Sensors 2019, 19, 2594 17 of 21
Sensors 2019, 19, x FOR PEER REVIEW 17 of 21

(c) (d)
Figure
Figure 13.13.Rough
Roughscanning
scanningover
over the
the scene
scene to
to find
find the
thepotential
potentialtrack
trackarea. (a)(a)
area. Segmentation
Segmentationresult of of
result
the whole railway scene images. (b) Different local areas. (c) Scanning the railway scene
the whole railway scene images. (b) Different local areas. (c) Scanning the railway scene image roughly image
roughly
using using proposed
proposed CNN. (d)CNN.
Area (d) Area
in red in redrectangle
dotted dotted rectangle is the potential
is the potential track track
area, area,
which which
will will
reduce
reduce segmentation calculation by three-quarters.
segmentation calculation by three-quarters.

Table
Table 3. 3.Calculation
Calculationtime
time of
of the
the comparison
comparison experiments
experimentswith
withdifferent workflow
different to segment
workflow the the
to segment
railway
railway scene.
scene.

Complete-Processing
Complete-Processing
Partial-Scanning
Partial-ScanningWorkflow
Workflow
Workflow
Workflow
Scan Segmentation
Segmentation andand
Proportion of Total Time
Time Scan Time Proportion Classification
Classification Time
Time
Total Time
(s) Track Area
of Track Area (s) (s) (s) (s)
(s) (s)
(s)
1 0.297
1 0.297 41.7% 41.7% 1.042
1.042 1.339
1.339 2.5 2.5
2 0.297
2 0.297 25% 25% 0.625
0.625 0.922
0.922 2.5 2.5
3 0.297
3 0.297 75% 75% 1.875
1.875 2.172
2.172 2.5 2.5
4 0.297
4 0.297 40% 40% 11 1.927
1.927 2.5 2.5
5 5
0.297 0.297 30% 30% 0.75
0.75 1.047
1.047 2.5 2.5

5.3.5.3. Metrics
Metrics
ToTo evaluate
evaluate the the segmentation
segmentation performance,
performance, threethree criteria
criteria are used.
are used. Theone
The first first oneintersection
is the is the
intersection
over union (IU)over union defined
generally (IU) generally defined
as Equation as where
(11), Equation (11), where
L represents forLthe
represents for theRground
ground truth, represents
thetruth, R represents
segmentation thethe
result; segmentation
second oneresult; the second
is the pixel one(PA)
accuracy is the pixel accuracy
defined (PA)(12)
in Equation defined in
to evaluate
Equation (12) to evaluate the portion of the area which need to be surveilled are segmented;
the portion of the area which need to be surveilled are segmented; and the extra pixel (EP) as Equation and the
extra pixel (EP) as Equation (13) is used to evaluate the portion of segmented areas which do not need
(13) is used to evaluate the portion of segmented areas which do not need to be surveilled. PA would
to be surveilled. PA would influence the missing part of the track area which would cause a missing
influence the missing part of the track area which would cause a missing alarm, and the EP would
alarm, and the EP would influence the extra part of the track area which would cause a false alarm.
influence the extra part of the track area which would cause a false alarm.
LR
IU  L ∩ R (11)
IU = L  R (11)
L∪R
LL∩RR
PA =
PA (12) (12)
LL
RR−LL∩RR

EP =
EP (13) (13)
LL
5.4. Performance of the Proposed Segmentation Algorithm
5.4. Performance of the Proposed Segmentation Algorithm
The proposed algorithm is compared with MCG and FCN using images from railway dataset and
The proposed
some examples algorithm
are shown is compared
in Figure with
14. In the MCG and the
experiment, FCN using images
computation from railway
platform datasetwith
is equipped
and some examples are shown in Figure 14. In the experiment, the computation platform
an Intel i5-6500 CPU, 8 GB DDR3 memory, without GPU and MATLAB 2012, and images in the dataset is equipped
arewith an Intel i5-6500 CPU, 8 GB DDR3 memory, without GPU and MATLAB 2012, and images in the
resized to 90 × 150. The MCG method is the pre-trained demo from [17]. The FCN network uses
Sensors 2019, 19, 2594 18 of 21
Sensors 2019, 19, x FOR PEER REVIEW 18 of 21

Sensors 2019, 19, x FOR PEER REVIEW 18 of 21


dataset are resized to 90  150 . The MCG method is the pre-trained demo from [17]. The FCN
a standard VGG16 structure trained by VOC2012 dataset for the feature extracting, and upsampled the
network uses a standard VGG16 structure trained by VOC2012 dataset for the feature extracting, and
datasetofare
outputs theresized to 90  150
third, fourth, and .seventh
The MCG method islayers.
convolution the pre-trained demo from [17]. The FCN
upsampled the outputs of the third, fourth, and seventh convolution layers.
network uses a standard VGG16 structure trained by VOC2012 dataset for the feature extracting, and
upsampled the outputs of the third, fourth, and seventh convolution layers.

(a) (b) (c) (d) (e)

Figure(a)
Figure
14.14. Using
Using different (b)algorithm
different
algorithm to detect the (c)
to detect the track
track area.area.
(a) The The(d)original
(a) original railway
railway (e) Ground
scenes.
scenes. (b) (b)
Ground
truth truth
of track of track
areas. areas. (c)
(c) Results of Results
the MCG of the MCG algorithm.
algorithm. (d)area. (d) Results
Results of the FCN algorithm. (e)
Figure 14. Using different algorithm to detect the track (a) of
Thetheoriginal
FCN algorithm. (e) Results
railway scenes. (b) of
ourResults of our algorithm.
algorithm.
Ground truth of track areas. (c) Results of the MCG algorithm. (d) Results of the FCN algorithm. (e)
Results of our algorithm.
The
The missingpart
missing partand
andthetheextra
extrapart
part of
of the
the segmented
segmentedtrack trackarea
areaare
areshown
shown in in
Figure
Figure 15.15.
ForFor
the the
MCG algorithm, itused used the
the CRFs to combine the fragmented regions into into
one unified area based on
MCG The missingitpart
algorithm, and theCRFsextra to
part of the segmented
combine the track area
fragmented are shown
regions in Figure
one unified15. area
For thebased
the
onMCG texture which caused the missing part (as shown in Figure 15e) because of the difference texture
algorithm,
the texture it used
which the CRFs
caused the to combine
missing the(as
part fragmented
shown in regions
Figure into onebecause
15e) unified area
of thebased on
difference
between the nearby tracktrack
and andthe distant track. The Theperformances of the the
FCN algorithms were
the texture
texture betweenwhich thecaused
nearby the missing partdistant
the (as shown track. in Figure 15e) becauseof
performances of theFCN
difference texturewere
algorithms
improved slightly from their original results in [19] because of the monotonous railway scene and the
between slightly
improved the nearby fromtrack
theirand the distant
original results track.
in [19]The performances
because of the FCNrailway
of the monotonous algorithms
scenewere
and the
small amount of categories; but not too significantly because the shape and color textures of the scene
improved
small amount slightly from theirbut
of categories; original
not tooresults in [19] because
significantly because of the
the monotonous
shape and colorrailway sceneof
textures and
thethe
scene
images sampled with different illuminations, weather, and in different seasons were still complex.
small amount
images sampled of with
categories;
differentbut not too significantly
illuminations, because
weather, andtheinshape and color
different textures
seasons wereofstill
the scene
complex.
As shown in Figure 15f,i, the smooth boundary line of the FCN algorithm was not suitable for our
Asimages
shownsampled
in Figure with different
15f,i, illuminations,
the smooth boundary weather,
line ofand theinFCN
different seasons
algorithm
railway scene parsing because of the concave and convex shapes at the straight and sharp edge of the
was were
notstill complex.
suitable for our
As
railwayshown
scenein Figure
parsing 15f,i, the
because smooth
of the boundary
concave line
and of the
convex FCN
shapesalgorithm
at the was not
straight suitable
and for edge
sharp our of
region, especially near the area with an acute angle and straight line. Concave and convex shapes
therailway
caused
scene
region,both parsingnear
especially
a missing
because
parttheand
of the
area concave
an with
extra an
and convex
acute
part of the angle
trackand
shapes at theline.
areastraight
straight
when compared
and sharp
Concave
with theand edge of the
convex
ground shapes
truth,
region, especially near the area with an acute angle and straight line. Concave and convex shapes
whichboth
caused would release part
a missing both andthe missing
an extraalarms
part ofandthefalsetrackalarms. For the
area when engineering
compared withapplication,
the groundour truth,
caused both a missing part and an extra part of the track area when compared with the ground truth,
system would rather release a false alarm than miss a true alarm.
which would release both the missing alarms and false alarms. For the engineering application, our
which would release both the missing alarms and false alarms. For the engineering application, our
system would rather release a false alarm than miss a true alarm.
system would rather release a false alarm than miss a true alarm.

(a) (b) (c) (d)


(a) (b) (c) (d)

(e) (f) (g) (h) (i) (j)


(e) 15. Missing and
Figure (f)extra areas of different
(g) methods comparing(h) (i)
with the ground (j)
truth. (a) Manual
label of track areas. (b) Results of the MCG. (c) Results of the FCN. (d) Results of our method. (e)
Figure15.
Figure 15.Missing
Missingand and extra
extra areas
areas of
ofdifferent
differentmethods
methodscomparing
comparing with thethe
with ground
groundtruth. (a) Manual
truth. (a) Manual
Missing part of MCG. (f) Missing part of FCN. (g) Missing part of our method. (h) Extra part of MCG.
label of track areas. (b) Results of the MCG. (c) Results of the FCN. (d) Results of our method.
label of track areas. (b) Results of the MCG. (c) Results of the FCN. (d) Results of our method. (e) Missing (e)
(i) Extra part of FCN. (j) Extra part of our method.
Missing part of MCG. (f) Missing part of FCN. (g) Missing part of our method. (h) Extra
part of MCG. (f) Missing part of FCN. (g) Missing part of our method. (h) Extra part of MCG. (i) Extra part of MCG.
(i) Extra
part of FCN.part(j)ofExtra
FCN.part
(j) Extra part
of our of our method.
method.
Sensors 2019, 19, 2594 19 of 21

The performances of the three algorithms are shown in Table 4. It can be found that the proposed
algorithm with four optimal Gaussian kernels achieves the highest score in PA, which means that the
greatest portion of the surveillance area is found out and thus is preferred for applications.

Table 4. Experimental results of different algorithms.

Mean Mean Mean Time


Algorithm
IU PA EP (s)
MCG 72.05% 79.94% 10.63% 7
FCN 89.83% 91.26% 16.20% 41
Our Four optimal Gaussian kernels 81.94% 95.90% 18.17% 0.9–2.8
Algorithm
Eight regular Gaussian kernels 85.23% 93.85% 17.56% 1.1–4.4

6. Conclusions
The proposed algorithm uses an adaptive feature distribution extractor for railway track
segmentation by making full use of the strong linear characteristics of railway scenes and the
typical categories of the local areas. A good balance between segmentation precision, recognition
accuracy, calculation time, and complexity of manual operation can be achieved. By using the
proposed algorithm, the railway intrusion detection system can automatically and accurately delimit
the boundaries of a surveillance scene in real time and greatly improve the efficiency of the system
operation. Considering the fact that, in China, there are over 29,000 km of high-speed railways and
the average density of cameras on high-speed railway lines is about 2.92 cameras/km, the proposed
algorithm is of great significance to improve the efficiency.
The proposed algorithm can be applied into the surveillance system of public places such as
airport aprons, highway pavement, and squares. These places share some common characteristics:
simple structure full of straight lines—such as airplane runways and different functional areas, vehicles
and different lanes, pedestrians and sidewalk lines. Before applying this method, however, the training
dataset of the simplified CNN has to include new categories in such scenes, then the proposed algorithm
can segment the scene and label each local area.

Author Contributions: Conceptualization, Y.W., L.Z., and Z.Y.; Investigation, Y.W. and B.G.; Methodology,
Y.W. and L.Z.; Project administration, Z.Y.; Software, Y.W.; Validation, Y.W.; Writing—original draft, Y.W.;
Writing—review & editing, Y.W. and L.Z.
Funding: This research was funded by National Key Research and Development Program of China (2016YFB1200401).
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Wang, Y.; Yu, Z.; Zhu, L.; Guo, B. Fast feature extraction algorithm for high-speed railway clearance intruding
objects based on CNN. J. Sci. Instrum. 2017, 38, 1267–1275.
2. Hou, G.; Yu, Z. Research on flexible protection technology of high speed railways. J. Railw. Stand. Des. 2006,
11, 16–18.
3. Cao, D.; Fang, H.; Wang, F.; Zhu, H.; Sun, M. A fiber bragg-grating-based miniature sensor for the fast
detection of soil moisture profiles in highway slopes and subgrades. Sensors 2018, 18, 4431. [CrossRef]
[PubMed]
4. Zhang, Y. Application study of fiber bragg-grating technology in disaster prevention of high-speed railway.
J. Railw. Signal. Commun. 2009, 45, 48–50.
5. Oh, S.; Kim, G.; Lee, H. A monitoring system with ubiquitous sensors for passenger safety in railway platform.
In Proceedings of the 7th International Conference on Power Electronics, Daegu, Korea, 22–26 October 2007;
pp. 289–294.
Sensors 2019, 19, 2594 20 of 21

6. Wang, Y.; Shi, H.; Zhu, L.; Guo, B. Research of surveillance system for intruding the existing railway lines
clearance during Beijing-Shanghai high speed railway construction. In Proceedings of the 3rd International
Symposium on Test Automation and Instrumentation, Xiamen, China, 22–25 May 2010; pp. 218–223.
7. Luy, M.; Cam, E.; Ulamis, F.; Uzun, I.; Akin, S.I. Initial results of testing a multilayer laser scanner in a collision
avoidance system for light rail vehicles. Appl. Sci. 2018, 8, 475. [CrossRef]
8. Guo, B.; Yu, Z.; Zhang, N.; Zhu, L.; Gao, C. 3D point cloud segmentation, classification and recognition
algorithm of railway scene. Chin. J. Sci. Instrum. 2017, 38, 2103–2111.
9. Zhan, D.; Jing, D.; Wu, M.; Zhang, D.; Yu, L.; Chen, T. An accurate and efficient vision measurement approach
for railway catenary geometry parameters. IEEE Trans. Instrum. Meas. 2018, 67, 2841–2853. [CrossRef]
10. Guo, B.; Zhu, L.; Shi, H. Intrusion detection algorithm for railway clearance with rapid DBSCAN clustering.
J. Sci. Instrum. 2012, 33, 241–247.
11. Guo, B.; Yang, L.; Shi, H.; Wang, Y.; Xu, X. High-speed railway clearance intrusion detection algorithm with
fast background subtraction. J. Sci. Instrum. 2016, 37, 1371–1378.
12. Shi, H.; Chai, H.; Wang, Y. Study on railway embedded detection algorithm for railway intrusion based on
object recognition and tracking. J. China Railw. Soc. 2015, 37, 58–65.
13. Vazquez, J.; Mazo, M.; Lazaro, J.L.; Luna, C.A.; Urena, J.; Garcia, J.J.; Hierrezuelo, L. Detection of moving
objects in railway using vision. In Proceedings of the IEEE Intelligent Vehicles Symposium, Parma, Italy,
14–17 June 2004; pp. 872–875.
14. Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Susstrunck, S. SLIC superpixels compared to state-of-
the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [CrossRef] [PubMed]
15. Arbeláez, P. Boundary extraction in natural images using ultrametric contour maps. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, New York, NY, USA,
17–22 June 2006; p. 182.
16. Verbeek, J.; Triggs, B. Region classification with Markov field aspect models. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8.
17. Ladický, L.; Russell, C.; Kohli, P.; Torr, P. Associative hierarchical CRFs for object class image
segmentation. In Proceedings of the 12th International Conference on Computer Vision, Kyoto, Japan,
29 September–2 October 2009; pp. 739–746.
18. Arbeláez, P.; Pont-Tuset, J.; Barron, J.; Marques, F.; Malik, J. Multiscale combinatorial grouping. In Proceedings of
the 27th IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 328–335.
19. LeCun, Y.; Boser, B.; Denker, J.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L. Handwritten digit
recognition with a back-propagation network. In Neural Information Processing Systems; Morgan-Kaufmann:
San Francisco, CA, USA, 1990; pp. 396–404.
20. Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans.
Pattern Anal. Mach. Intell. 2014, 39, 640–651. [CrossRef] [PubMed]
21. Ren, X.; Malik, J. Learning a classification model for segmentation. In Proceedings of the 9th International
Conference on Computer Vision, Nice, France, 13–16 October 2003; Volume 1, pp. 10–17.
22. Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22,
888–905.
23. Zhu, Y.; Luo, K.; Ma, C.; Liu, Q.; Jin, B. Superpixel segmentation based synthetic classifications with clear
boundary information for a legged robot. Sensors 2018, 18, 2808. [CrossRef] [PubMed]
24. Liu, Y.; Chen, Y.; Zhang, S. Traffic sign recognition based on pyramid histogram fusion descriptor and
HIK-SVM. J. Transp. Syst. Eng. Inf. Technol. 2017, 17, 220–226.
25. Fang, Z.; Duan, J.; Zheng, B. Traffic signs recognition and tracking based on feature color and SNCC algorithm.
J. Transp. Syst. Eng. Inf. Technol. 2014, 14, 47–52.
26. Liu, K.; Ying, Z.; Cui, Y. SAR image target recognition based on unsupervised K-means feature and data
augmentation. J. Signal Process. 2017, 33, 452–458.
27. Zhang, X.; Fan, J.; Xu, J.; Shi, X. Image super-resolution algorithm via K-means clustering and support vector
data description. J. Image Graph. 2016, 21, 135–144.
28. Ma, G.; Tian, Y.; Li, X. Application of K-means clustering algorithm in color image segmentation of grouper
in seawater background. J. Comput. Appl. Softw. 2016, 33, 192–195.
Sensors 2019, 19, 2594 21 of 21

29. Arbeláez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour detection and hierarchical image segmentation.
IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 898–916. [CrossRef] [PubMed]
30. Pont-Tuset, J.; Arbeláez, P.; Barron, J.; Marques, F.; Malik, J. Multiscale combinatorial grouping for image
segmentation and object proposal generation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 128–140.
[CrossRef] [PubMed]
31. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A. SSD: Single shot multibox
detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV 2016), Amsterdam,
The Netherlands, 8–16 October 2016; Volume 9905, pp. 21–37.
32. Farabet, C.; Couprie, C.; Najman, L.; LeCun, Y. Learning hierarchical features for scene labeling. IEEE Trans.
Pattern Anal. Mach. Intell. 2013, 35, 1915–1929. [CrossRef] [PubMed]
33. Couprie, C.; Farabet, C.; Najman, L.; LeCun, Y. Indoor semantic segmentation using depth information.
arXiv, 2013; arXiv:1301.3572.
34. Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning rich features from RGB-D Images for object detection
and segmentation. In Proceedings of the 13th European Conference on Computer Vision (ECCV 2014),
Zurich, Switzerland, 6–12 September 2014; Volume 8695, pp. 345–360.
35. Petrelli, A.; Pau, D.; Di Stefano, L. Analysis of compact features for RGB-D visual research. In Proceedings
of the 18th International Conference on Image Analysis & Processing, Genoa, Italy, 7–11 September 2015;
Volume 9280, pp. 14–24.
36. Canny, J.F. A computation approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 8,
769–798.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).