A Deep Learning Gated Architecture For UGV Navigation Robust To Sensor

Robotics and Autonomous Systems 116 (2019) 80–97
Contents lists available at ScienceDirect
Robotics and Autonomous Systems

journal homepage: www.elsevier.com/locate/robot
A deep learning gated architecture for UGV navigation robust to sensor

failures
∗
Naman Patel, Anna Choromanska, Prashanth Krishnamurthy, Farshad Khorrami
Control/Robotics Research Laboratory (CRRL), Department of Electrical and Computer Engineering, NYU Tandon School of Engineering, 5 MetroTech
Center, USA
highlights
• An end-to-end learning system for UGV indoor corridor tracking using camera and LiDAR.
• A deep learning framework for sensor fusion that is robust to sensor failure.
• Gating based dropout regularization to enable robustness to various data corruptions.
• Experimental demonstration of the efficacy of the proposed system on our in-house UGV.
article info a b s t r a c t
Article history: In this paper, we introduce a novel methodology for fusing sensors and improving robustness to
Received 6 July 2018 sensor failures in end-to-end learning based autonomous navigation of ground vehicles in unknown
Received in revised form 22 January 2019 environments. We propose the first learning based camera–LiDAR fusion methodology for autonomous
Accepted 7 March 2019
in-door navigation. Specifically, we develop a multimodal end-to-end learning system, which maps raw
Available online 18 March 2019
depths and pixels from LiDAR and camera, respectively, to the steering commands. A novel gating based
Keywords: dropout regularization technique is introduced which effectively performs multimodal sensor fusion
Robustness to sensor failures and reliably predicts steering commands even in the presence of various sensor failures. The robustness
Deep learning for autonomous navigation of our network architecture is demonstrated by experimentally evaluating its ability to autonomously
Vision/LiDAR based navigation navigate in the indoor corridor environment. Specifically, we show through various empirical results
Learning from demonstration
that our framework is robust to sensor failures, partial image occlusions, modifications of the camera
Sensor fusion
image intensity, and the presence of noise in the camera or LiDAR range images. Furthermore, we show
Autonomous vehicles
that some aspects of obstacle avoidance are implicitly learned (while not being specifically trained for
it); these learned navigation capabilities are shown in ground vehicle navigation around static and
dynamic obstacles.
© 2019 Published by Elsevier B.V.
1. Introduction resilient to data corruption for self-driving vehicles equipped

with multiple sensors (mainly camera and LiDAR, although the
There have been significant advances in machine learning same methodology may be utilized for fusing a larger number of
based approaches for robotic applications in recent years due sensors). Our previous work [7] deals with the problem of deep
to the rapid developments in deep learning techniques. Deep learning based sensor fusion by modifying the training method-
learning approaches have the ability to leverage large amounts of ology to include cases wherein either of the two modalities was
labeled and contextually rich data to give desired outputs. Some randomly shut off to deal with the possibility of sensor failure.
recent applications of deep learning that are relevant to this paper This work is motivated by three primary objectives. The first
include autonomous car driving systems [1–12]. objective is to design an architecture which inherently performs
In this work, we address the sensor fusion problem (Fig. 1) robust sensor fusion leveraging the complementary strengths of
in the context of developing end-to-end learning frameworks the two sensors. The second objective is to design a system frame-
work, which is robust to corrupted data (e.g., partial occlusion,
∗ Corresponding author. noisy images, and varying intensity) without explicitly/separately
E-mail addresses: naman.patel@nyu.edu (N. Patel), ac5455@nyu.edu
being trained for this purpose; thus, requiring less training data.
(A. Choromanska), prashanth.krishnamurthy@nyu.edu (P. Krishnamurthy), The third primary objective is to study the possibility of obtaining
khorrami@nyu.edu (F. Khorrami). better performance characteristics with the multimodal fused
https://doi.org/10.1016/j.robot.2019.03.001
0921-8890/© 2019 Published by Elsevier B.V.
N. Patel, A. Choromanska, P. Krishnamurthy et al. / Robotics and Autonomous Systems 116 (2019) 80–97 81
Fig. 1. End-to-end modality fusion framework for learning autonomous

navigation policy in indoor environments.
Fig. 2. Instances of undesirable behavior of the UGV when using only camera
or LiDAR for autonomous navigation.
system than with either sensor modality separately (e.g., see

Fig. 2).
Our work focuses on the problem of navigation of an au-
• experimental demonstration of the efficacy of the proposed
system on our in-house developed ground vehicle that in-
tonomous unmanned ground vehicle (UGV) using vision from a
cludes a real-time autopilot, single board computer with
monocular camera and depth measurements from a LiDAR in
graphics processing unit (GPU), and integrated camera and
an indoor environment with deep learning. A novel approach
LiDAR sensors.
to modality fusion is presented to generate steering commands
for autonomous navigation of a ground vehicle. The proposed The paper is organized as follows. Related literature is briefly
methodology naturally extends to the setting with multiple sen- summarized in Section 2. The problem formulation is presented
sors, where one or more sensor’s data might be corrupted. in Section 3. The architectures for sensor fusion developed in this
Previous approaches for navigating in an unknown environ- paper are discussed in Section 5, including an architecture based
ment are based on classical simultaneous localization and map- on a gating mechanism as well as two other architectures more
ping (SLAM) based frameworks where a 3D map is obtained using similar to prior literature (although in a different context than in
photometric and depth information from sensors such as camera this paper). The training mechanisms are also discussed in Sec-
and LiDAR and then path planning is performed on these recon- tion 5. Empirical verification studies are presented in Section 6.
structed maps. The limitations of these approaches are that the Finally, concluding remarks are provided in Section 7.
reconstructed maps have to be accurate to navigate safely. Addi-
tionally, these approaches only use geometric information from 2. Related work
the map and nothing about the planning decision can be inferred
before the environment is explicitly observed, even when there Autonomous robot navigation in a variety of environments has
are obvious patterns. Thus, learning based approaches are a good been extensively studied in prior works [14–17]. Many traditional
alternative as these frameworks learn the planning decisions methods for autonomous navigation use geometric approaches
from past observations. such as SLAM for path planning in both indoor and outdoor envi-
In this work, we propose a deep learning architecture for ronments using vision and depth based sensors, such as camera,
the sensor fusion problem that consists of two convolutional stereo camera, and LiDAR [18–20]. Vision processing for indoor
neural networks (CNNs), each processing one of the input modal-
wall detection and corridor following has also been studied using
ities, which are then fused with a dropout regularization aided
optical flow [21], visual landmark [22], and object recognition
gating mechanism. The gating mechanism is realized as a fully-
based techniques [23,24]. Obstacle avoidance and navigation in
connected network that is trained to generate environment-
uncertain environments using these traditional approaches have
appropriate scalar weights for LiDAR and camera using the CNN-
also been applied to aerial vehicles [25], humanoids [26,27], and
generated feature vectors. The scalar weights are then passed
unmanned sea surface vehicles [28].
through a dropout layer [13], which randomly makes the scalar
These traditional approaches use geometry of the environ-
weights zero. These scalar weights are then utilized to obtain the
ment without considering the semantic information about the
fused embedding including both modalities. The fused embedding
landmarks. These approaches only take the observed states into
is then passed through additional network layers to generate the
consideration for autonomous navigation and do not use any
steering command for the vehicle. The training of the proposed
information learned from similar environments. They might also
network does not rely on the introduction of corrupted data in the
fail to localize in environments such as long corridors with tex-
training batches (to mimic sensor failures) as introduced in our
tureless walls, which are common in indoor environments. To
previous work [7]. The novel aspects of this paper are as follows:
resolve these issues, learning based approaches are being used for
• an end-to-end learning system for the problem of indoor autonomous navigation. These learning based approaches use two
corridor tracking with a ground vehicle registering camera different paradigms to learn from data, namely imitation learning
and LiDAR data, and reinforcement learning.
• a new deep learning architecture for sensor fusion that An early work addressed end-to-end navigation of a car using
leads to a system for autonomous driving of ground vehicles laser range finder and a camera [29] using a single-layer neural
indoors that is robust to the presence of partial data from a network to process 30 × 32 camera images and 8 × 32 LiDAR
single modality, depth range images. Neural networks have been used for learning
• a gating based dropout regularization technique which ef- driving decision rules (by extracting key indicators from sensor
fectively performs sensor fusion and makes the network data) for autonomous navigation using camera [30,31]. End-to-
robust to various data corruptions without retraining (thus end navigation of cars has been addressed using stereo camera [2]
reducing the required training data), as well as a single front-facing camera [3,4,32,33] (also, some
82 N. Patel, A. Choromanska, P. Krishnamurthy et al. / Robotics and Autonomous Systems 116 (2019) 80–97
debugging tools were developed for these autonomous systems

to understand the visual cues that the network uses to produce
a steering command, e.g., [34]). Model predictive control based
approach which learn navigation from demonstration has also
been proposed [35]. Self-supervised learning methods have also
been introduced to learn autonomous navigation safely [1,9,36–
38] using stereo and monocular cameras. Learning based sensor
fusion techniques have also been proposed for navigation of a
ground vehicle in an indoor [7,8,39] as well as in simulation [40]
environment. Conditioned imitation learning [12] based naviga-
tion framework where the vehicle can be guided to move in
specific directions has also been introduced. These learning based
approaches have also been used for autonomous aerial vehicle
navigation [41–43]. It is to be noted the particular problem con-
sidered in this paper (i.e., learning based indoor navigation using
vision and LiDAR) has not been considered in prior literature.
Furthermore, unlike the prior approaches discussed above, the
primary focus in this paper is the development of a network Fig. 3. Our unmanned ground vehicle system with integrated LiDAR (VLP-16)
architecture and regularization and training methodologies to and camera sensors (LI-OV5640).
achieve robustness to sensor noise/non-idealities including inter-
mittent loss of sensor modalities, occlusions, lighting variations,
and sensor noise. 3. Problem formulation
Reinforcement learning techniques have also been utilized to
teach the mobile robot to avoid obstacles and navigate through Our goal is to learn in an end-to-end manner, autonomous
the corridor using sensors such as laser range finder [44]. Policy navigation policies for a ground vehicle in an indoor environ-
search using reinforcement learning through an auxiliary task like ment using camera and LiDAR sensors. The visual RGB image
estimating depth from monocular images to estimate steering gives information about the texture and color of objects in the
directions was introduced in [45]. Deep reinforcement learn- nearby environment as well as the overall type of environment.
ing (DRL) based approaches addressing the problem of integrat- The depth range image provides geometric information which
ing path planning and learning have been presented in simu- complements visual information provided by the camera. Within
lated [46,47] as well as real-world environments [6]. DRL was this context, we aim to explore effective deep neural network ar-
used to develop efficient policies for socially aware motion plan- chitectures and novel training techniques for fusion of the camera
ning [48] enabling the vehicle to navigate safely and efficiently in and LiDAR modalities to obtain policies robust to sensor failures
pedestrian-rich environments. Deep inverse reinforcement learn- and data corruption as well as achieve performance superior
ing based approaches have also been proposed to learn cost to what is achieved by either sensor separately (e.g., Fig. 2).
functions for path planning [10,11,49], which can be used to These policies are learned using data recorded under human
refine manually handcrafted cost functions. Additionally, end-to- teleoperation of the UGV as described in detail in Section 4.
end memory based mapping and planning frameworks different The aforementioned approach is applicable when multiple
from reactive approaches have been introduced recently [50–53]. cameras or LiDARs are utilized as well. It should be pointed out
In the deep learning literature, the fusion of different modal- that a single modality system will have less robustness to sensor
ities has been studied for various other applications (i.e., in con- failures/corruptions than a multi-modal sensor system. Under a
single sensor scenario (either a camera or LiDAR), the control sys-
texts other than the ground vehicle application considered here)
tem is more likely to fail when partial or complete sensor failure
in recent years such as in [54] for indoor scene recognition
is present. As illustrated in Fig. 2, each sensor separately can have
and in [55] for object detection and segmentation using images
limitations in environment perception (e.g., lack of resolution,
and their corresponding depth map representation. A framework
blind spots, lighting conditions, or data corruption) causing poor
to embed point-cloud and natural language instruction combi-
performance or collision of the vehicle. Additionally, generating
nations and manipulation trajectories in the same semantically
training dataset even under no sensor failure condition requires
meaningful space was introduced to perform end-to-end exper-
a reasonable amount of training dataset. The NetGatedDropout
iments on a PR2 robot in [56]. A recurrent neural network [57]
architecture presented here alleviates these issues.
was applied to implicitly learn the dependencies between RGB Classical methods for navigation [14] are based on the SLAM
images and depth map to perform semantic segmentation. In [58], framework where sensory measurements from camera, LiDAR,
RGB image and its corresponding 3D point cloud are used as etc are fused using traditional filtering and estimation methods
inputs for 3D object detection. RGB image, optical flow, and to obtain vehicle pose which is used for path planning. As shown
LiDAR range images are combined to form a six channel input in Fig. 4, maps based on geometry and no semantic information
to a deep neural network for object detection [59]. Various other (in textureless environment), using a framework such as ORB-
multimodal learning frameworks have also been proposed for SLAM2 [64] and maps based on vision from camera, using a
object detection, recognition and segmentation [60–63]. framework such as ORB-SLAM [65] do not reconstruct an accurate
Our proposed system introduces several novel aspects (out- map of the environment. It is to be noted that the results shown
lined in the introduction) compared to the prior works sum- in Fig. 4 are when all the sensors used by the framework are
marized above. Specifically, we introduce a new gating mecha- operating normally. The performance of these SLAM frameworks
nism based dropout regularization that enables modality fusion will further degrade significantly during sensor failures or sensor
for robust end-to-end learning of autonomous corridor driving. data corruptions (e.g., occlusions, sensor noise, etc.) which our
The efficacy of the proposed approach is demonstrated through proposed framework is robust to. Additionally, from Fig. 4, it
experimental studies on our UGV platform (Fig. 3). is noticed that in challenging indoor environments with scarce
the motors are generated at a rate of 10 Hz. Both the velocity

command, v (kept constant – 0.5 m/s), and the network steering
command predictions, ω, lie in the interval [−100, 100] and are
mapped to their respective PWM ON periods (in seconds) as
shown in Eqs. (1) and (2).
v v
ton = 1.5 + 0.5 × (1)
100
ω ω
ton = 1.5 + 0.5 × . (2)
100
The mapping from differential PWM command to angular
velocity of the vehicle has its own dynamics due to power ampli-
fiers, motor dynamics, and gear train and ground friction. Hence,
the plots are in terms of the percentage of the differential PWM
from its neutral value (0% corresponds to no turn, i.e., moving
straight forward) given to the motors driving the wheels on the
opposite sides of the vehicle. For example, the vehicle’s angular
velocity corresponding to differential PWM commands (steering
Fig. 4. Reconstructed ground vehicle trajectory using monocular ORB-SLAM
commands) 4%, 8%, and 16% in steady state are 0.15, 0.56, and
(yellow) [65] and ORB-SLAM2 (black) [64] (with camera and LiDAR) tracking
the center (blue) of a rectangular indoor corridor environment (anti-clockwise 1.89 rad/s.
trajectory) . (For interpretation of the references to color in this figure legend,
the reader is referred to the web version of this article.) 5. Proposed sensor fusion framework
In this section, our proposed framework and the correspond-

features, monocular SLAM frameworks such as ORB-SLAM are ing deep learning based network architectures with their imple-
ineffective and have to be replaced by frameworks using mul- mentation details are presented.
tiple modalities. Thus, sensor fusion is a crucial component for
autonomous navigation. 5.1. Network architectures
One approach in navigating in an unknown environment is
to plan a path based on a constructed map of the environment Various network architectures are considered for the sensor
and localize with respect to that map (i.e., simultaneous local- fusion task. The primary architecture (Fig. 5), denoted as Net-
ization and mapping — SLAM). A controller ensures tracking of GatedDropout, is a gating based architecture aided by dropout
the desired path using current position of the robot obtained regularization described further below. We also consider three
from SLAM. However, in challenging environments where the other architectures (denoted as NetEmb, NetConEmb and Net-
number of features is small (e.g., a long textureless hallway) or Gated) that are more similar to prior literature; these networks
varying (objects are moved around), typical SLAM algorithms do are also described further below.
not perform well (Fig. 4). Moreover, the control loop is entirely The architectures of NetEmb, NetConEmb, NetGated, and Net-
based on perceived geometry. The decoupled state estimation and GatedDropout are described in Tables 1, 2, 3, and 4, respectively.
execution of the control loop makes the system less adaptive due In NetEmb (which shares the same first 20 layers as NetConEmb,
to vulnerability to sensor failures and data corruption. Our work NetGated, and NetGatedDropout), feature maps from RGB image
addresses these issues by forming the estimation and control and LiDAR depth range image are extracted through a series of
problem as an end-to-end learning problem where both geometry
convolution layers. Next, the features extracted from the convo-
and semantic cues are simultaneously utilized.
lution layers in both the parallel networks are embedded into
a feature vector using a fully connected network. The intuition
4. Experimental testbed
behind embedding features is that the features extracted from
camera image and depth range image will have the same di-
The proposed system is implemented on our in-house inte-
grated UGV (Fig. 3) with differential drive and Tegra TX1 onboard mension. This ensures that one modality does not have a greater
GPU for implementation of the deep networks. The vehicle is effect on the result than the other due to unequal size. The
equipped with Leopard Imaging LI-OV5640 camera running at number of parameters in NetEmb are 4,754,357. In NetConEmb,
30 Hz (down sampled to 10 Hz with the same time stamp as the the convolution feature maps are passed into a fully connected
LiDAR), Velodyne ‘Puck’ VLP-16 LiDAR running at 10 Hz, an in- network. As shown in Table 1, the network architecture con-
house designed autopilot that translates generated velocity and sists of 8 convolution layers and 1 fully connected network for
heading commands (differential speed to left and right wheels) each modality and 2 fully connected networks for information
to PWM signals driving the motor controllers commutating the fusion from the two modalities. Each convolution layer consists
actual command signals to the motors for the given speed and of 3 × 3 kernels which convolve through the input with a stride
heading. The PWM signals for both speed and heading are gen- of 1 to generate feature maps which are then passed through
erated by the autopilot with a period of 20 ms and an ON period Rectification (ReLU) non-linearity. The inputs are padded during
varying from 1 ms to 2 ms. The training dataset (camera image, convolution to preserve the spatial resolution. The feature maps
LiDAR range image and steering commands) is collected from are downsampled after every two convolution layers by max-
UGV while a human driver operates it (teleoperation). pooling operation with a window size and stride of 2 × 2. All
The front and rear wheels are all driven independently and hidden layers including the fully connected layers are equipped
the middle wheels are passive and carry encoders. Additional with Rectification (ReLU) non-linearity. NetConEmb has 9,243,573
sensors are built into the platform (i.e., inertial navigation units parameters. The network learns its parameters by minimizing the
and wheel encoders) but are not used in this paper. The outer- Huber loss (δ = 1) between the predicted steering command and
loop commands generated by the end-to-end network driving the command of the human driver.
Table 1
NetEmb: Deep learning based modality fusion architecture using embeddings. The left side of the table is for processing of the RGB image from the camera and
the right side of the table is for processing of the depth range image from the LiDAR. The feature vectors (of length 512) constructed from camera and LiDAR are
concatenated at layer 24.
Layer name Layer input (for Layer output Kernel Stride No. Layer name Layer input (for Layer output Kernel Stride No.
RGB image) size kernels LiDAR range size kernels
image)
1 Spatial 3 × 120 × 160 16 × 120 × 160 3 × 3 1 16 Spatial 1 × 900 × 16 16 × 900 × 16 3 × 3 1 16
convolution convolution
2 Rectified linear 16 × 120 × 160 16 × 120 × 160 – – – Rectified linear 16 × 900 × 16 16 × 900 × 16 – – –
unit unit
3 Spatial 16 × 120 × 160 16 × 120 × 160 3 × 3 1 16 Spatial 16 × 900 × 16 16 × 900 × 16 3 × 3 1 16
unit unit
5 Max pooling 16 × 120 × 160 16 × 60 × 80 2 × 2 2 – Max pooling 16 × 900 × 16 16 × 450 × 8 2 × 2 2 –
6 Spatial 16 × 60 × 80 32 × 60 × 80 3 × 3 1 32 Spatial 16 × 450 × 8 32 × 450 × 8 3 × 3 1 32
unit unit
8 Spatial 32 × 60 × 80 32 × 60 × 80 3 × 3 1 32 Spatial 32 × 450 × 8 32 × 450 × 8 3 × 3 1 32
unit unit
11 Spatial 32 × 30 × 40 48 × 30 × 40 3 × 3 1 48 Spatial 32 × 225 × 4 48 × 225 × 4 3 × 3 1 48
unit unit
13 Spatial 48 × 30 × 40 48 × 30 × 40 3 × 3 1 48 Spatial 48 × 225 × 4 48 × 225 × 4 3 × 3 1 48
unit unit
16 Spatial 48 × 15 × 20 64 × 15 × 20 3 × 3 1 64 Spatial 48 × 113 × 2 64 × 113 × 2 3 × 3 1 64
unit unit
18 Spatial 64 × 15 × 20 64 × 15 × 20 3 × 3 1 64 Spatial 64 × 113 × 2 64 × 113 × 2 3 × 3 1 64
unit unit
21 Flatten 64 × 8 × 10 5120 – – – Flatten 64 × 57 × 1 3648 – – –
22 Fully connected 5120 512 – – – Fully connected 3648 512 – – –
23 Rectified linear 512 512 – – – Rectified linear 512 512 – – –
unit unit
24 Concatenate 512,512 1024 – – – – – – – – –
25 Fully connected 1024 32 – – – – – – – – –
26 Rectified linear 32 32 – – – – – – – – –
unit
27 Fully connected 32 10 – – – – – – – – –
28 Rectified linear 10 10 – – – – – – – – –
unit
29 Fully connected 10 1 – – – – – – – – –
Table 2
NetConEmb: Fusion architecture where the convolution feature maps are directly passed through
a fully connected network instead of first converting them into feature embeddings as done in
NetEmb. The first 20 layers and layers 25 to 29 are identical to NetEmb and the layers in bold are
the unique part of the network.
Layer name Layer input Layer output Layer name Layer input Layer output
1 .. 20 Same as Table 1
21 Flatten 64 × 8 × 10 5120 Flatten 64 × 57 × 1 3648
22 Concatenate 5120,3648 8768 – – –
23 Fully connected 8768 1024 – – –
24 Rectified linear unit 1024 1024 – – –
Table 3
NetGated: Fusion architecture with gating mechanism based on computing scalar weights from the feature embeddings and then
constructing a combination of the feature embeddings based on the scalar weights. The first 24 layers are identical to NetEmb and
layers in bold are the unique part of the architecture.
21 Flatten 64 × 8 × 10 5120 Flatten 64 × 57 × 1 3648
22 Fully connected 5120 512 Fully connected 3648 512
23 Rectified linear unit 512 512 Rectified linear unit 512 512
24 Concatenate 512,512 1024 – – –
28 Split 2 1,1 – – –
29 Multiplication with output 23 1 512 Multiplication with output 23 1 512
30 Addition 512,512 512 – – –
Table 4
NetGatedDropout: Fusion architecture with gating mechanism based on computing scalar weights from the feature embeddings and
then constructing a combination of the feature embeddings based on the scalar weights. This network architecture is the same as
NetGated except that one additional layer (layer 28 shown in bold) is introduced.
21 Flatten 64 × 8 × 10 5120 Flatten 64 × 57 × 1 3648
22 Fully connected 5120 512 Fully connected 3648 512
23 Rectified linear unit 512 512 Rectified linear unit 512 512
24 Concatenate 512,512 1024 – – –
28 Dropout (with p = 0.5) 2 2 – – –
29 Split 2 1,1 – – –
30 Multiplication with output 23 1 512 Multiplication with output 23 1 512
31 Addition 512, 512 512 – – –
In NetGated, the feature embedding architecture is similar to generalized Bernouilli distribution) are 2N − 1 for N sensors as
NetEmb. The embeddings are passed through a gating network we do not include the configuration corresponding to when both
to fuse the information from both the modalities to generate the modalities are turned off.
steering command. The gating network takes the two embeddings
obtained from RGB image and range image as input and outputs 5.2. Implementation and training
the corresponding two weights. These weights are then used
to perform a weighted sum of the embeddings. This weighted
The inputs to the networks are the normalized RGB image with
sum is then passed through two fully connected networks to
a field of view of 72◦ and the LiDAR range image which is cropped
obtain the steering command. Each of the considered network
such that the front half with a field of view of 180◦ is visible.
architectures is an end-to-end deep learning system that takes
Both the modalities are normalized by making each channel of
an RGB image and a LiDAR depth range image as input and
fuses the modalities using a deep neural network to predict the modality in the training dataset zero mean with a standard
the appropriate steering command of the ground vehicle for deviation of 1. At testing time, the mean and standard deviation
autonomous navigation. NetGated has 4,802,945 parameters. In calculated during training are used to normalize the input.
NetGatedDropout, we add an extra dropout layer to the two To train the networks, camera and LiDAR datasets were ob-
weights which randomly makes them zero during training. This tained by manually driving the vehicle (with constant speed)
essentially regularizes the network by making the network not through the corridor environment obtaining approximately the
dependent on necessarily having both the modalities to predict same amount of training data for straight motion, left turns,
the steering command and thus essentially results in the network and right turns. The network was trained on a dataset of 14 456
learning to perform robust sensor fusion. The proposed dropout images and the corresponding range images (around 24 min of
regularization is different from the usually utilized dropout. The data). It was trained using Adagrad optimizer with a base learning
usually utilized dropout layer is based on Bernouilli distribution rate of 0.01. Bias terms for all the layers in the networks are
whereas the proposed dropout regularization is based on gen- disabled. The network is implemented using the PyTorch frame-
eralized Bernoulli distribution where a configuration is selected work [66] and trained using an NVIDIA Titan X Pascal GPU based
based on the given probability. The number of configurations (for workstation.
Fig. 5. NetGatedDropout: Our proposed end-to-end learning based architecture for fusion of camera and LiDAR sensors.
Our end-to-end learning framework learns to predict the ap- also compare the performance of the original NetGated, Net-
propriate steering command by learning the weights of the net- Gated retrained with training procedure described in the previ-
work which minimize the Huber loss between the predicted ous section, and NetGatedDropout for various conditions of the
steering commands and the recorded human steering commands. modalities (e.g., sensor data corruption).
We use the Huber loss instead of mean squared error since an
instability due to the divergence of the gradients was noted with 6.1. Performance of different network architectures
mean squared error loss. The Huber loss [67] for bounding box
regression and is given by To evaluate the performance of the original architectures
⎧
1 (namely NetEmb, NetConEmb, and NetGated as described in Ta-
⎨ (y − f (x))2 ,
⎪ for ∥y − f (x)∥ ≤ δ bles 1, 2, and 3, respectively), steering command predictions of
L(y, f (x)) = 2 (3) each network were compared with the steering commands of
⎩ ∥y − f (x)∥ − 1 δ 2 ,
⎪
other w ise. a human operator. To evaluate the performance of the above
2 fusion architecture impartially, all the networks have similar
We follow the training procedure explained in our previous structure except for their respective fusion mechanisms. This
work [8] for NetEmb, NetConEmb, and NetGated. To train the evaluation was done using a different dataset in a different cor-
network to utilize either sensors when available and also to be ridor environment (test dataset) than the one used for training.
robust to the possibility of sensor failure, the training of these The results of each of the architectures compared to the human
networks was performed in two stages. In the first stage of operator are shown in Fig. 6 where the steering commands
training, the network is trained with the corresponding LiDAR given by a human operator (during teleoperation of the UGV)
depth range images and camera RGB images for each time step are denoted as the ground truth. Additionally, Fig. 7 also shows
as inputs. In the second stage, the training of the network is the error (between network steering command prediction and
continued with corrupted data (i.e., with one modality shut down ground truth) frequency plots of the steering command to better
to mimic sensor failure). Specifically, the network is trained with interpret the performance of each architecture. The left and
60% corrupted data for each epoch out of which 50% of the data right plot of both the figures show clockwise (right turns) and
is with the camera shut off (i.e., zero values for all elements in counterclockwise (left turns) navigations of a complete floor of a
the RGB image) and 50% is with the LiDAR shut off. corridor environment.
Due to the architecture of NetGatedDropout which consists of As shown in Figs. 6 and 7, the utilization in NetEmb of an
the dropout layer, modality embeddings are randomly dropped equal-size embedding (constructed using a fully connected layer)
(to zero) which is similar to randomly shutting down one of the for each modality after the last convolution layer provides better
networks. As a result, retraining is not required for the NetGated- performance than NetConEmb as hypothesized in Section 5.1. The
Dropout architecture and it learns to fuse modalities through NetEmb architecture performs better when one of the modalities
end-to-end learning. The probability of dropping either weight is is switched off and also oscillates less compared to the Net-
set to 50% for the dropout layer. ConEmb architecture. As discussed in Section 5.1, for NetConEmb,
It is seen in Section 6 that the proposed network architecture the number of features for the camera after the last convolution
provides robust performance under sensor failures and various layer is much larger than for the LiDAR. This causes the output to
data corruptions and implicitly learns to use the relevant infor- become more dependent on one modality resulting in unbalanced
mation from both modalities to generate steering commands. We fusion. As a result, the steering commands oscillate more, similar
compare the NetGated and NetGatedDropout networks trained to the behavior of the camera-only network. Motivated by the
only on the original dataset and the NetGated network retrained observations above, fully connected layer based embeddings for
with the corrupted dataset and show that the NetGatedDropout each modality were also used in the NetGated architecture. An
network provides superior performance than both the original additional advantage of using an equal-size embedding for each
NetGated network and the retrained NetGated network (retrained modality is that it is then easier and more natural to fuse the
with the corrupted dataset) when one of the sensor modalities embeddings by the learned gated weights by simply taking a
fails or is partially occluded or noisy. Also, the NetGatedDropout weighted linear combination. As shown in Fig. 6, the NetGated
network retains the performance of the original/retrained Net- architecture based network learns to move straight with fewer
Gated networks when both the sensors are present. oscillations than even the human operator. The fusion of camera
and LiDAR results in a smoother output than a LiDAR-only system
6. Experimental studies as shown in Fig. 6.
Since a desirable characteristic of motion in the indoor corri-
In this section, experimental results are presented for the de- dor environment is that the ground vehicle should approximately
scribed architectures (NetEmb, NetConEmb, NetGated, NetGated- track the center of the corridor and should not come too close
Dropout), which are trained using both camera and LiDAR. We to walls when turning, a useful metric for the performance of
Fig. 6. Steering command predictions of NetEmb, NetConEmb, and NetGated architecture when both camera and LiDAR are working.
Fig. 7. Error frequency plots for steering commands of NetEmb, NetConEmb, and NetGated architecture when both camera and LiDAR are working.
the system is the distance of the vehicle to the left side and Table 5
Standard deviation of minimum distances to the wall for a clockwise trajectory
right side walls/objects. The closest distances on the left and
under fully autonomous mode.
right sides vary quite significantly even for an ‘‘ideal’’ motion
Network type Network Left wall distance Right wall distance
due to several objects such as trash cans, boxes, empty spaces input standard deviation standard deviation
and open office doors at some locations. To remove such ‘‘noise’’ (in m) (in m)
effects, an effective performance metric is the standard deviation 1 NetConEmb Camera and 0.48 0.43
(rather than mean) of distances to the left side and right side network LiDAR
2 NetEmb Camera and 0.38 0.35
walls/objects. In addition to standard deviation, Fig. 8 shows the network LiDAR
histogram plots of left and right distances (computed from LiDAR 3 NetGated Camera and 0.32 0.31
range images) for human teleoperation and different architecture. network LiDAR
4 Human Camera and 0.34 0.3
These standard deviations were recorded under fully
Teleoperation LiDAR
autonomous mode (i.e., with the network providing the com-
mands to the autopilot) with the different networks for both
clockwise and counterclockwise directions. The measured stan-
dard deviations for a clockwise motion through the building 6.2. Saliency map visualization
corridor environment (one complete floor of the building) are
shown in Table 5 and it is noted that the NetGated network A saliency analysis of the proposed framework was performed
architecture provides the best (lowest) standard deviation; a sim- to determine which parts of the camera image and LiDAR range
ilar observation was also noted for a counterclockwise motion. image were important for prediction. These salient features of the
NetGatedDropout performs similarly to NetGated in the absence image can be visualized as a grayscale image (as shown in Figs. 9
of sensor noise/dropouts. The performance of NetGatedDropout and 10) with the same dimension as the input. For both camera
image and LiDAR range image, saliency maps are shown in Figs. 9
is shown in Table 6.
and 10. The brightness of each pixel of the saliency map is directly
For all the considered network architectures, it is noted that
proportional to its importance in determining the output of the
the system trained on a dataset with both camera and LiDAR
network.
data is not directly robust to the possibility of a sensor failing
The saliency map for a particular input is computed through
(i.e., only one sensor modality available and the other zeroed
gradient of the output with respect to the input as it determines
out). For example, NetGated places much more trust on the the change in output value with respect to the change in input.
LiDAR input than on the camera input and does not provide any The saliency maps were computed based on the guided back-
reasonable performance in the event of a LiDAR failure. Hence, in propagation method [68,69] where during backpropagation for
order to achieve robustness to sensor failure, we introduce the computing gradients, only positive gradients for positive acti-
training strategy described in Section 5.2 to continue retraining vations are propagated. Thus, to compute the saliency map for
of the network with corrupted data generated by synthetically each input, the ReLU activations are determined in the forward
turning off either of the two modalities. The performance of the pass and positive gradients are determined during the back-
retrained NetGated network (after retraining with this corrupted propagation. Next, using these positive gradients and positive
data based technique) is compared in Section 6.3 with the original activation as switches in backpropagation, the gradient of the
trained NetGated network and the human operator. output with respect to the inputs are determined. These gradients
Fig. 8. Histogram of distances of the ground vehicle from the left wall and right wall for trajectories of different networks and human teleoperation in the corridor
environment.
Fig. 9. Saliency map visualization of activations of the NetGatedDropout architecture for various camera images. We have camera image inputs for the UGV in
various scenarios on the top and their corresponding saliency maps in the bottom.
are visualized by normalizing it through subtracting and dividing compared with ground truth. Fig. 12 shows the error frequency
the minimum and maximum gradient value from each pixel. plots for the above conditions. We simulate the conditions of
In the top row of Fig. 14, we have the camera image inputs camera or LiDAR being shut off by turning all the pixels of that
for the UGV in various scenarios and the corresponding activation modality to zero. The left and right plot of both the figures
saliency maps in the bottom. The most salient parts of the images show clockwise (right turns) and counterclockwise (left turns)
are near the intersection of walls and floors and edges of objects navigations of a complete floor of a corridor environment.
in the corridor environment like trashcans and doors. The LiDAR The performance characteristics (in terms of trajectory tra-
range images (during straight, left and right turns) and their versed) of the retrained NetGated network and NetGatedDropout
corresponding saliency maps are shown in Fig. 15 in the left and were also evaluated (under the possibilities of both camera and
right column respectively. As shown in the figure, the salient LiDAR available, only camera available, and only LiDAR available)
parts of the range image are around the wall and obstacles near using the distance standard deviation based metric as discussed
the UGV. Thus, from Figs. 14 and 15 it can be inferred that the above under fully autonomous operation of the UGV (Fig. 13). It is
proposed framework intuitively attends to nearby walls, floor and noted in Table 6 that both the NetGatedDropout and the retrained
obstacles to decide the output. NetGated network achieve autonomous navigation through the
corridor although the camera-only and LiDAR-only situations pro-
6.3. Performance of networks during data corruption vide lower performance (i.e., higher distance standard deviation)
than the camera + LiDAR situation. We also observe that Net-
We compare the performance of the original NetGated archi- GatedDropout has lower standard deviation than the retrained
tecture, the NetGated architecture when retrained with corrupted NetGated network in all the cases.
data as explained in 5.2, and the NetGatedDropout architecture We also compare the predictions of all three networks (Net-
for various conditions of sensor failures as described below. Gated, NetGatedDropout, and retrained NetGated) with the hu-
man teleoperation based ground truths under the conditions of
6.3.1. Either modality turned off only camera working, only LiDAR working, and both modalities
As shown in Figs. 11 and 12, when both camera and LiDAR working using root mean squared error and discretized accu-
are working, all the network architectures perform well and racy metric. As discussed before, all three network architectures
have very similar performance; but, when one of the modalities predict the duty cycle of the PWM signal, which in turn is con-
is shut off, the original NetGated architecture gives erroneous verted to the steering command. To compare ‘‘correctness’’ of
predictions. In Fig. 11, the predictions of the various architectural the predicted outputs, we discretize the outputs as described
frameworks when only camera is working, only LiDAR is working, below. We set a threshold of 10% to differentiate between straight
and when both sensor modalities are working are shown and movement, left turn, and right turn. A duty cycle between −10%
Fig. 10. Saliency map visualization of the LiDAR range image for the activations of the NetGatedDropout network. We have the LiDAR range image input for the
UGV in three different scenarios on the left and their corresponding saliency maps on the right.
Fig. 11. Steering command predictions using the NetGated and NetGatedDropout architectures trained only with the camera + LiDAR dataset and the NetGated
retrained with corrupted data under cases of only camera working (top), only LiDAR working (middle), and both camera and LiDAR working (bottom).
and 10% corresponds to the ground vehicle moving straight. If 6.3.2. Partial occlusion
the duty cycle is greater than 10%, then it is equivalent to a right One of the most common instances of failures in vision and
LiDAR-based navigation is when there are partial occlusions of
turn and if the duty cycle is less than −10%, it is equivalent to a
camera or LiDAR range image. Our framework takes into consid-
left turn. Thus, by using the above methodology, the predictions eration performance of the network when either of the modalities
and ground truth are discretized into left/straight/right. The dis- is partially occluded. We show that our proposed framework pro-
cretized accuracy metrics of all three networks on the test dataset vides robustness to partial occlusion of the sensor data without
ever being specifically trained for it.
are shown in Table 7. The root mean squared error between
For a camera image, various types of partial occlusions were
the network predictions and human teleoperation ground truth considered and some of the worst-case scenarios are shown in the
for the same trajectory are shown in Table 8. It is observed last two images of the bottom row of Fig. 14. To test our network,
from the tables that NetGatedDropout performs better than both we selected a part of the camera image and made the values of
the pixels in that part zero. The steering command predictions
the retrained NetGated and the original NetGated architecture.
compared to the ground truth on a test dataset in a different
The NetGated architecture fails to predict the steering command corridor environment for the three networks (NetGated, retrained
when either of the modalities is not present. NetGated, and NetGatedDropout) are shown in the top row of
Fig. 12. Error frequency plots for steering commands of the NetGated and NetGatedDropout architectures trained only with the camera + LiDAR dataset and the
NetGated retrained with corrupted data under cases of only camera working (top), only LiDAR working (middle), and both camera and LiDAR working (bottom).
Fig. 13. Distances of the ground vehicle from the left wall and right wall (in m) for counterclockwise (left turns) navigations in the corridor environment for
NetGatedDropout architecture.
Fig. 16. The top row of Fig. 17 shows the error frequency metric To empirically verify the results, we also compare the predic-
for the same. As shown in the figures, the NetGatedDropout tions of all three networks with the ground truth from human
architecture performs the best out of the three architectures. teleoperation using discretized accuracy metric and root mean
The performance of the three architectures was also tested squared error metric as explained in the previous subsection. The
for partially occluded LiDAR range image and some of the test results are shown in Tables 7 and 8 from which we can ascertain
that the NetGatedDropout architecture performs better than both
cases of occlusion are shown in the middle row of Fig. 15. Partial
the retrained NetGated and original NetGated architecture when
occlusion is simulated for a LiDAR range image during testing by
the camera image or the range image is partially occluded.
making the pixels of a part of the range image zero. We observe
from the graphs in the bottom row of Fig. 16 and error frequency 6.3.3. Camera image with varying image formations
graph in Fig. 17 that both NetGated and the retrained NetGated As shown in the top row of Fig. 14, the pixel intensities of the
fail to give accurate steering command predictions on the test camera image are varied by changing brightness, contrast, and
dataset and only NetGatedDropout is able to correctly predict saturation. The brightness of the image is varied by alpha blend-
steering commands. ing the original camera image with an image with all its pixels
Fig. 14. Examples of various types of data corruption for a camera image. In the top row, we have the original camera image, image with brightness changed, image
with varying contrast, and image with varying saturation, respectively. In the bottom row, we have the image with additive random noise, multiplicative random
noise, and partially occluded images in different directions, respectively.
Fig. 15. Examples of various types of data corruption for a LiDAR range image. In the top row, we have the original LiDAR range image. We have partially occluded
LiDAR range image in various directions in the second row and image with additive random noise and multiplicative random noise in the bottom row.
Fig. 16. Steering command predictions using the different network architectures when camera images are partially occluded (top) and when LiDAR range images
are partially occluded (bottom).
zero. The saturation of the image is changed by alpha blending the of its own grayscale image. Alpha used for alpha blending or alpha
original camera image with its own grayscale image. The contrast
compositing is randomly generated between 0 and 1 for each
of the image is modified by alpha blending the original camera
image with an image whose pixel values are equal to the mean camera image.
Fig. 17. Error frequency plots for steering commands of the different network architectures when camera images are partially occluded (top) and when LiDAR range
images are partially occluded (bottom).
Fig. 18. Steering command predictions using the different network architectures when camera brightness is changed.
Fig. 19. Error frequency plots for steering commands of the different network architectures when camera brightness is changed.
The performance of each network is shown in Fig. 18 and test the proposed network for various types of noise, namely
error frequency plot in Fig. 19, when the brightness of the camera additive random noise, multiplicative random noise, and salt and
image is modified. As observed from the figures, we see that pepper noise.
the original NetGated network gives oscillating and inaccurate For the camera image, we generate an image with noise am-
steering command predictions. The retrained NetGated performs plitude of half the maximum intensity and add it to the original
better than the original NetGated but does not perform as well image, resulting in an image such as the first image in the bottom
as the NetGatedDropout network. Similar results were observed row of Fig. 14. We also test our network on images whose pixels
when contrast or saturation was varied. are multiplied by random numbers generated between 0 and 1.
An example of this multiplicative random noise is shown in the
6.3.4. Data corruption by random noise second image in the bottom row of Fig. 14. The other type of noise
Inaccurate predictions due to image noise are one of the that we experimented with is the salt and pepper noise wherein
prevalent issues for any vision-based method. The noise is usually randomly selected sets of pixels in the image are made black or
caused by problems in the electronic circuitry of the sensors. We white.
Fig. 20. Steering command predictions using the different network architectures when random noise is added to the camera images (top) and when random noise
is added to LiDAR range images (bottom).
Fig. 21. Error frequency plots for steering commands of the different network architectures when random noise is added to the camera images (top) and when
random noise is added to LiDAR range images (bottom).
Similar to the generation of the noisy camera images, three is observed from both the figures that the NetGatedDropout
different types of noisy LiDAR images were generated: (1) by architecture performs the best out of all the considered archi-
adding random noise with amplitude of half the maximum range tectures. This observation is empirically verified by comparing
value to each pixel of the range image, (2) with multiplica-
the predictions of all three networks with the ground truth using
tive random noise, (3) with salt and pepper noise. The random
noise in 2nd and 3rd cases are generated following the same discretized accuracy metric as explained in the previous subsec-
procedures as for generating camera images with multiplicative tion and root mean square error metric. The results are shown in
random noise and with salt and pepper noise, respectively. Tables 7 and 8 from which we can validate that the NetGated-
The performance of the three architectures (NetGated, re- Dropout architecture provides the best results in both the cases
trained NetGated, and NetGatedDropout) when the camera and
of the noisy camera images and the noisy LiDAR range images.
LiDAR images are corrupted with additive random noise is shown
in Fig. 20 (top and bottom rows, respectively). The error fre- Similar results were observed for the cases when the camera and
quency plots for the same, are shown in top and bottom rows LiDAR range images were corrupted with multiplicative random
of Fig. 21 for camera and LiDAR range images, respectively. It noise and salt and pepper noise.
Table 6
Standard deviation of minimum distances to the walls using the retrained Net-
Gated and NetGatedDropout architectures when various modalities are turned
off for a counterclockwise trajectory.
Network type Network Left wall distance Right wall distance
input standard deviation standard deviation
(in m) (in m)
1 NetGated Camera 0.43 0.39
(Retrained)
architecture
2 NetGated- Camera 0.33 0.36
Dropout
architecture
3 NetGated LiDAR 0.37 0.32
(Retrained)
architecture
4 NetGated- LiDAR 0.31 0.32
Dropout
architecture
5 NetGated Camera and 0.33 0.32
(Retrained) LiDAR
architecture
6 NetGated- Camera and 0.31 0.3
Dropout LiDAR
architecture
Fig. 22. Example trajectories of a UGV autonomously navigating in an indoor
environment: left turn (top row), straight motion (middle row), right turn
(bottom row). These pictures were taken from behind the UGV.
6.4. Autonomous indoor navigation of the ground vehicle
The NetGatedDropout architecture based framework is able and right turns while the LiDAR enables the fused network to
to successfully navigate through the indoor corridor environ- detect the appropriate turn. When passing an open door or other
ment and is robust to sensor failures and various other data open spaces (such as a short corridor leading to a dead end),
corruptions. Autonomous navigation through corridors is shown the LiDAR being a more geometric sensor measuring distances
in Fig. 22. The ground vehicle is able to appropriately make
to points tends to make a LiDAR-only system move towards the
turns at corners enabling it to be equidistant from the walls
open space. However, the visual features implicitly detected from
after the turn. It is also able to navigate through narrower spaces
the camera enable the fused network to completely ignore such
(e.g., between trash cans) as shown in the middle two rows of
an ‘‘unintended’’ open space and remain at the center of the
Fig. 22.
corridor (Fig. 24).
Furthermore, the system implicitly learns to avoid static as
well as dynamic obstacles as shown in Fig. 23 without ever being
specifically trained for this purpose, i.e., the training dataset did 7. Conclusion
not include any specific demonstrations of moving around obsta-
cles. Also, the fused camera + LiDAR network performs better in A convolutional neural network based architecture was intro-
several scenarios than the camera-only or LiDAR-only situations. duced for fusing vision and depth measurements from camera
While a LiDAR-only network can enable avoiding of obstacles and LiDAR, respectively, for learning an autonomous navigation
such as humans, it does not typically detect small (low-profile) policy for a ground robot operating in an indoor environment.
objects since these register only a few points in the LiDAR scan. Multiple network architectures were considered including a novel
In such situations, the camera image enables the fused network dropout regularization aided gating based network architecture.
to avoid the obstacle. When approaching a visually featureless This architecture enables the ground vehicle navigation to be
wall, a camera-only system cannot disambiguate between left robust to the possibility of sensor failure or data corruption and to
Table 7
Comparison of accuracy metric values for the steering command predictions of NetGated, NetGated retrained by randomly switching off modalities, and
NetGatedDropout with the steering command ground truth (UGV driven by human operator).
Network type Both modalities Camera LiDAR additive Camera only LiDAR only Camera image LiDAR range
(in %) additive noise noise (in %) (in %) (in %) partially image partially
(in %) occluded (in %) occluded (in %)
NetGated 92.25 88.83 86.93 8.81 1.76 72.57 31.42
NetGated (Retrained) 92.24 88.85 80.83 91.84 88.09 90.09 45.39
NetGatedDropout 93.65 93.40 86.95 95.234 90.235 94.22 92.89
Table 8
Comparison of root mean squared error metric values for the steering command predictions of NetGated, NetGated retrained by randomly switching off modalities,
and NetGatedDropout with the steering command ground truth (UGV driven by human operator).
Network type Both modalities Camera LiDAR Camera only LiDAR only Camera image LiDAR range image
(in PWM %) additive noise additive noise (in PWM %) (in PWM %) partially occluded partially occluded
(in PWM %) (in PWM %) (in PWM %) (in PWM %)
NetGated 4.71 5.28 4.96 14.41 25.04 10.16 15.56
NetGated (Retrained) 4.88 5.40 5.54 5.35 8.24 7.02 13.35
NetGatedDropout 4.40 4.63 5.39 4.67 5.85 4.88 4.59
[9] G. Kahn, A. Villaflor, B. Ding, P. Abbeel, S. Levine, Self-supervised deep

reinforcement learning with generalized computation graphs for robot
navigation, in: Proceedings of the NIPS Workshop on Acting and Interacting
in the Real World: Challenges in Robot Learning, Long Beach, USA, 2017.
[10] M. Wulfmeier, D.Z. Wang, I. Posner, Watch this: Scalable cost-function
learning for path planning in urban environments, in: Proceedings of the
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
Daejeon, South Korea, 2016, pp. 2089–2095.
[11] M. Wulfmeier, D. Rao, D.Z. Wang, P. Ondruska, I. Posner, Large-scale cost
function learning for path planning using deep inverse reinforcement
learning, Int. J. Robot. Res. 36 (10) (2017) 1073–1087.
Fig. 23. Examples of trajectories of the UGV autonomously navigating in the [12] F. Codevilla, M. Müller, A. Dosovitskiy, A. López, V. Koltun, End-to-end
presence of static (top row) and dynamic obstacles (bottom row). These pictures driving via conditional imitation learning, CoRR, abs/1710.02410, 2017.
were taken from behind the UGV.
[13] N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,
Dropout: a simple way to prevent neural networks from overfitting, J.
Mach. Learn. Res. 15 (1) (2014) 1929–1958.
[14] S. Thrun, W. Burgard, D. Fox, Probabilistic Robotics (Intelligent Robotics
and Autonomous Agents), The MIT Press, 2005.
[15] G.N. DeSouza, A.C. Kak, Vision for mobile robot navigation: A survey, IEEE
Trans. Pattern Anal. Mach. Intell. 24 (2) (2002) 237–267.
[16] B.D. Argall, S. Chernova, M. Veloso, B. Browning, A survey of robot learning
from demonstration, Robot. Auton. Syst. 57 (5) (2009) 469–483.
Fig. 24. Trajectory comparison of LiDAR-only network and fused network when [17] B. Paden, M. Čáp, S.Z. Yong, D. Yershov, E. Frazzoli, A survey of motion
passing through a corridor with misleading open spaces (e.g., open doors). planning and control techniques for self-driving urban vehicles, IEEE Trans.
Ground vehicle navigating with LiDAR-only network shows a tendency to deviate Intell. Veh. 1 (1) (2016) 33–55.
(from the center of the corridor) towards open doors or other open spaces as [18] C. Thorpe, M.H. Hebert, T. Kanade, S.A. Shafer, Vision and navigation for
compared to the fusion network. the Carnegie-Mellon Navlab, IEEE Trans. Pattern Anal. Mach. Intell. 10 (3)
(1988) 362–373.
[19] C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. Clark, J. Dolan,
D. Duggins, T. Galatali, C. Geyer, et al., Autonomous driving in urban
properly leverage the complementary strengths of the two sen- environments: Boss and the urban challenge, J. Field Robotics 25 (8) (2008)
sors so as to achieve better performance than with either sensor 425–466.
separately. It was experimentally demonstrated that the pro- [20] J.J. Leonard, H.F. Durrant-Whyte, Simultaneous map building and local-
posed deep learning based system is able to fully autonomously ization for an autonomous mobile robot, in: Proceedings of the IEEE/RSJ
navigate in the indoor environment with robustness to the failure International Workshop on Intelligent Robots and Systems, Osaka, Japan,
1991, pp. 1442–1447.
of either the camera or the LiDAR and to various types of sensor
[21] A. Dev, B.J.A. Kröse, F.C.A. Groen, Navigation of a mobile robot on the
data corruption. temporal development of the optic flow, in: Proceedings of the IEEE/RSJ In-
ternational Conference on Intelligent Robots and Systems (IROS), Grenoble,
Acknowledgment France, 1997, pp. 558–563.
[22] S. Se, D. Lowe, J. Little, Mobile robot localization and mapping with
This work was funded in part by ONR grant number N00014- uncertainty using scale-invariant visual landmarks, Int. J. Robot. Res. 21
(8) (2002) 735–758.
15-12-182.
[23] Z. Zheng, X. He, J. Weng, Approaching camera-based real-world navigation
using object recognition, Procedia Comput. Sci. 53 (2015) 428–436.
References [24] H. Cheng, H. Chen, Y. Liu, Topological indoor localization and navigation
for autonomous mobile robot, IEEE Trans. Autom. Sci. Eng. 12 (2) (2015)
[1] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier, K. Kavukcuoglu, 729–738.
U. Muller, Y. LeCun, Learning long-range vision for autonomous off-road [25] S. Shen, N. Michael, V. Kumar, Autonomous multi-floor indoor navigation
driving, J. Field Robotics 26 (2) (2009) 120–144. with a computationally constrained mav, in: Proceedings of the IEEE
[2] Y. LeCun, U. Muller, J. Ben, E. Cosatto, B. Flepp, Off-road obstacle avoidance International Conference on Robotics and Automation (ICRA), Shanghai,
through end-to-end learning, in: Proceedings of the Advances in Neural China, 2011, pp. 20–25.
Information Processing Systems, Vancouver, Canada, 2006, pp. 739–746. [26] P. Krishnamurthy, F. Khorrami, GODZILA: A low-resource algorithm for
[3] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L.D. path planning in unknown environments, J. Intell. Robot. Syst. 48 (3)
Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, K. Zieba, End to (2007) 357–373.
end learning for self-driving cars, CoRR, abs/1604.07316, 2016. [27] G. Brooks, P. Krishnamurthy, F. Khorrami, Humanoid robot navigation and
[4] H. Xu, Y. Gao, F. Yu, T. Darrell, End-to-end learning of driving models obstacle avoidance in unknown environments, in: Proceedings of the Asian
from large-scale video datasets, in: Proceedings of the IEEE Conference on Control Conference, Washington, DC, USA, 2013, pp. 1–6.
Computer Vision and Pattern Recognition (CVPR), Honolulu, USA, 2017, pp. [28] P. Krishnamurthy, F. Khorrami, A hierarchical control and obstacle avoid-
3530–3538. ance system for Unmanned Sea Surface Vehicles, in: Proceedings of the
[5] N. Patel, P. Krishnamurthy, F. Khorrami, Semantic segmentation guided IEEE Conference on Decision and Control/ European Control Conference,
SLAM using vision and LIDAR, in: Proceedings of the 50th International Orlando, USA, 2011, pp. 2070–2075.
Symposium on Robotics, Munich, Germany, 2018, pp. 352–358. [29] D. Pomerleau, Alvinn: An autonomous land vehicle in a neural network,
[6] J. Bruce, N. Suenderhauf, P. Mirowski, R. Hadsell, M. Milford, One-shot in: Proceedings of the Advances in Neural Information Processing Systems,
reinforcement learning for robot navigation with interactive replay, in: Denver, USA, 1988, pp. 305–313.
Proceedings of the NIPS Workshop on Acting and Interacting in the Real [30] D. Silver, J.A. Bagnell, A. Stentz, Learning from demonstration for au-
World: Challenges in Robot Learning, Long Beach, USA, 2017. tonomous navigation in complex unstructured terrain, Int. J. Robot. Res.
[7] N. Patel, A. Choromanska, P. Krishnamurthy, F. Khorrami, Sensor modality 29 (12) (2010) 1565–1592.
fusion with CNNs for UGV autonomous driving in indoor environments, in: [31] C. Chen, A. Seff, A. Kornhauser, J. Xiao, Deepdriving: Learning affordance
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots for direct perception in autonomous driving, in: Proceedings of the IEEE
and Systems (IROS), Vancouver, Canada, 2017, pp. 1531–1536. International Conference on Computer Vision (ICCV), Santiago, Chile, 2015,
[8] N. Patel, P. Krishnamurthy, Y. Fang, F. Khorrami, Reducing operator pp. 2722–2730.
workload for indoor navigation of autonomous robots via multimodal [32] V.N. Murali, S.T. Birchfield, Autonomous navigation and mapping us-
sensor fusion, in: Proceedings of the Companion of the 2017 ACM/IEEE ing monocular low-resolution grayscale vision, in: Proceedings of the
International Conference on Human-Robot Interaction, Vienna, Austria, IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
2017, pp. 253–254. Workshops, Anchorage, USA, 2008, pp. 1–8.
[33] I. Kostavelis, A. Gasteratos, Learning spatially semantic representations for [56] J. Sung, I. Lenz, A. Saxena, Deep multimodal embedding: manipulat-
cognitive robot navigation, Robot. Auton. Syst. 61 (12) (2013) 1460–1475. ing novel objects with point-clouds, language and trajectories, in: 2017
[34] M. Bojarski, A. Choromanska, K. Choromanski, B. Firner, L.D. Jackel, U. IEEE International Conference on Robotics and Automation, ICRA 2017,
Muller, K. Zieba, Visualbackprop: visualizing CNNs for autonomous driving, Singapore, Singapore, May 29 - June 3, 2017, 2017, pp. 2794–2801.
CoRR, abs/1611.05418, 2016. [57] C. Hazirbas, L. Ma, C. Domokos, D. Cremers, Fusenet: Incorporating depth
[35] S. Lefevre, A. Carvalho, F. Borrelli, A learning-based framework for velocity into semantic segmentation via fusion-based CNN architecture, in: Pro-
control in autonomous driving, IEEE Trans. Autom. Sci. Eng. 13 (1) (2016) ceedings of the Asian Conference on Computer Vision, Taipei, Taiwan,
32–42. 2016, pp. 213–228.
[36] C. Richter, N. Roy, Learning to plan for visibility in navigation of un- [58] S. Song, J. Xiao, Deep sliding shapes for amodal 3D object detection in
known environments, in: Proceedings of the International Symposium on RGB-D images, in: Proceedings of the IEEE Conference on Computer Vision
Experimental Robotics, Springer, 2016, pp. 387–398. and Pattern Recognition (CVPR), Las Vegas, USA, 2016, pp. 808–816.
[37] C. Richter, N. Roy, Safe visual navigation via deep learning and novelty [59] M. Giering, V. Venugopalan, K. Reddy, Multi-modal sensor registration for
detection, in: Proceedings of the Robotics: Science and Systems Conference, vehicle perception via deep neural networks, in: Proceedings of the High
Cambridge, Massachusetts, 2017. Performance Extreme Computing Conference (HPEC), Waltham, USA, 2015,
[38] A. Dairi, F. Harrou, M. Senouci, Y. Sun, Unsupervised obstacle detection pp. 1–6.
in driving environments using deep-learning-based stereovision, Robot. [60] A. Eitel, J.T. Springenberg, L. Spinello, M.A. Riedmiller, W. Burgard, Multi-
Auton. Syst. 100 (2018) 287–301. modal deep learning for robust RGB-D object recognition, in: 2015 IEEE/RSJ
[39] H.F.M. Zaki, F. Shafait, A.S. Mian, Learning a deeply supervised multi-modal International Conference on Intelligent Robots and Systems, IROS 2015,
RGB-D embedding for semantic scene and object category recognition, Hamburg, Germany, September 28 - October 2, 2015, 2015, pp. 681–687.
Robot. Auton. Syst. 92 (2017) 41–52. [61] A. Valada, G.L. Oliveira, T. Brox, W. Burgard, Deep multispectral semantic
[40] G. Liu, A. Siravuru, S. Prabhakar, M.M. Veloso, G. Kantor, Learning end-to-
scene understanding of forested environments using multimodal fusion,
end multimodal sensor policies for autonomous navigation, in: Proceedings
in: International Symposium on Experimental Robotics, ISER 2016, Tokyo,
of the 1st Annual Conference on Robot Learning (CoRL), in: ser. Proceedings
Japan, October (2016) 3-6, 2016, pp. 465–477.
of Machine Learning Research, vol. 78, PMLR, California, USA, 2017, pp.
[62] O. Mees, A. Eitel, W. Burgard, Choosing smartly: Adaptive multimodal fu-
249–261.
sion for object detection in changing environments, CoRR, abs/1707.05733,
[41] A. Giusti, J. Guzzi, D.C. Cireşan, F.-L. He, J.P. Rodríguez, F. Fontana, M.
2017.
Faessler, C. Forster, J. Schmidhuber, G. Di Caro, et al., A machine learning
[63] A. Valada, J. Vertens, A. Dhall, W. Burgard, Adapnet: Adaptive semantic
approach to visual perception of forest trails for mobile robots, IEEE Robot.
segmentation in adverse environmental conditions, in: 2017 IEEE Inter-
Automat. Lett. 1 (2) (2016) 661–667.
national Conference on Robotics and Automation, ICRA 2017, Singapore,
[42] F. Sadeghi, S. Levine, CAD2RL: Real single-image flight without a single real
Singapore, May 29 - June 3, 2017, 2017, pp. 4644–4651.
image, in: Proceedings of the Robotics: Science and Systems Conference,
[64] R. Mur-Artal, J.D. Tardós, ORB-SLAM2: an open-source SLAM system for
Cambridge, Massachusetts, 2017.
monocular, stereo, and RGB-D cameras, IEEE Trans. Robot. 33 (5) (2017)
[43] D. Gandhi, L. Pinto, A. Gupta, Learning to fly by crashing, in: Proceedings
1255–1262.
of the IEEE/RSJ International Conference on Intelligent Robots and Systems
[65] R. Mur-Artal, J.M.M. Montiel, J.D. Tardos, Orb-slam: a versatile and accurate
(IROS), Vancouver, Canada, 2017, pp. 3948–3955.
monocular slam system, IEEE Trans. Robot. 31 (5) (2015) 1147–1163.
[44] W.D. Smart, L.P. Kaelbling, Effective reinforcement learning for mobile
robots, in: Proceedings of the IEEE International Conference on Robotics [66] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A.
and Automation (ICRA), Washington, DC, USA, 2002, pp. 3404–3410. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, 2017.
[45] J. Michels, A. Saxena, A.Y. Ng, High speed obstacle avoidance using [67] R.B. Girshick, Fast R-CNN, in: Proceedings of the IEEE International
monocular vision and reinforcement learning, in: Proceedings of the Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp.
International Conference on Machine Learning (ICML), in: ser. ACM In- 1440–1448.
ternational Conference Proceeding Series, vol. 119, ACM, Bonn, Germany, [68] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional net-
2005, pp. 593–600. works, in: Proceedings of the European Conference on Computer Vision,
[46] Y. Zhu, R. Mottaghi, E. Kolve, J.J. Lim, A. Gupta, L. Fei-Fei, A. Farhadi, Zurich, Switzerland, 2014, pp. 818–833.
Target-driven visual navigation in indoor scenes using deep reinforcement [69] J.T. Springenberg, A. Dosovitskiy, T. Brox, M. Riedmiller, Striving for
learning, in: Proceedings of the IEEE International Conference on Robotics simplicity: The all convolutional net, in: Proceedings of the International
and Automation (ICRA), Singapore, Singapore, 2017, pp. 3357–3364. Conference on Learning Representations (ICLR) Workshop, San Diego, USA,
[47] W. Gao, D. Hsu, W.S. Lee, S. Shen, K. Subramanian, Intention-net: Integrat- 2015.
ing planning and deep learning for goal-directed autonomous navigation,
in: Proceedings of the 1st Annual Conference on Robot Learning (CoRL), in:
ser. Proceedings of Machine Learning Research, vol. 78, PMLR, California,
Naman Patel received his Master degree in Electrical
USA, 2017, pp. 185–194. Engineering from NYU Tandon School of Engineering,
[48] Y.F. Chen, M. Everett, M. Liu, J.P. How, Socially aware motion planning with Brooklyn, New York, in 2016, where he is cur-
deep reinforcement learning, in: Proceedings of the IEEE/RSJ International rently working towards his Ph.D. degree with the
Conference on Intelligent Robots and Systems (IROS), Vancouver, Canada, Control/Robotics Research Laboratory (CRRL) headed
2017, pp. 1343–1350. by Professor Farshad Khorrami. His research inter-
[49] C. Xia, A.E. Kamel, Neural inverse reinforcement learning in autonomous ests include developing algorithms for autonomous
navigation, Robot. Auton. Syst. 84 (2016) 1–14. unmanned ground and aerial vehicles, simultaneous
[50] A. Khan, C. Zhang, N. Atanasov, K. Karydis, V. Kumar, D.D. Lee, Mem- localization and mapping systems, computer vision and
ory augmented control networks, in: Proceedings of the International artificial intelligence.
Conference on Learning Representations (ICLR), 2018.
[51] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, J. Malik, Cognitive mapping
and planning for visual navigation, in: Proceedings of the IEEE Conference Anna Choromonska received her Ph.D. degree from
the Department of Electrical Engineering at Columbia
on Computer Vision and Pattern Recognition, 2017.
University in the City of New York in 2014, and a
[52] J. Zhang, L. Tai, J. Boedecker, W. Burgard, M. Liu, Neural SLAM, CoRR,
M.Sc. degree with distinctions from the Department
abs/1706.09520, 2017.
of Electronics and Information Technology at Warsaw
[53] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, F. Shentu, E.
University of Technology. She did her Post-Doctoral
Shelhamer, J. Malik, A.A. Efros, T. Darrell, Zero-shot visual imitation, in:
studies in the Computer Science Department at Courant
International Conference on Learning Representations (ICLR), 2018. Institute of Mathematical Sciences in NYU. She joined
[54] H. Zhu, J.-B. Weibel, S. Lu, Discriminative multi-modal feature fusion for the Department of Electrical and Computer Engineering
RGBD indoor scene recognition, in: Proceedings of the IEEE Conference on at NYU Tandon School of Engineering in the Spring
Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016, 2017 as an Assistant Professor. Prof. Choromanska’s
pp. 2969–2976. research interests focus on numerical optimization, deep learning, and large
[55] S. Gupta, R. Girshick, P. Arbeláez, J. Malik, Learning rich features from RGB- data analysis with applications to autonomous car driving. She is also working
D images for object detection and segmentation, in: Proceedings of the on learning from data streams, learning with expert advice, supervised and
European Conference on Computer Vision, Springer, Zurich, Switzerland, unsupervised online learning, clustering, and structured prediction. Prof. Choro-
2014, pp. 345–360. manska has co-authored several international conference papers and refereed
journal publications, as well as book chapters. She is also a contributor to Farshad Khorrami received his Bachelors degrees in
the open source fast out-of-core learning system Vowpal Wabbit (aka VW). Mathematics and Electrical Engineering in 1982 and
Prof. Choromanska gave over 50 invited and conference talks and serves as a 1984 respectively from The Ohio State University. He
book editor (MIT Press volume), organizer of machine learning events at top also received his Master’s degree in Mathematics and
venues like NIPS, and a reviewer and area chair for several top machine learning Ph.D. in Electrical Engineering in 1984 and 1988 from
conferences (e.g., ICML, NIPS, and AISTATS) and journals (e.g., Transactions on The Ohio State University. Dr. Khorrami is currently a
Pattern Analysis and Machine Intelligence and Machine Learning). professor of Electrical & Computer Engineering Depart-
ment at NYU where he joined as an assistant professor
in Sept. 1988. His research interests include adaptive
Prashanth Krishnamurthy received his B.Tech. degree and nonlinear controls, robotics and automation, un-
in electrical engineering from Indian Institute of Tech- manned vehicles (fixed-wing and rotary wing aircrafts
nology, Chennai in 1999, and M.S. and Ph.D. degrees in as well as underwater vehicles and surface ships), resilient control for industrial
electrical engineering from Polytechnic University (now control systems, cyber security for cyber–physical systems, large-scale systems
NYU), Brooklyn, NY in 2002 and 2006, respectively. He and decentralized control, and real-time embedded instrumentation and control.
is currently a Research Scientist and Adjunct Faculty Prof. Khorrami has published more than 240 refereed journal and conference
with the Department of Electrical and Computer En- papers in these areas and holds thirteen U.S. patents. His book on ‘‘Modeling and
gineering at NYU Tandon School of Engineering, NY, adaptive nonlinear control of electric motors’’ was published by Springer Verlag
and a Senior Researcher with FarCo Technologies, NY. in 2003. He also has thirteen U.S. patents on novel smart micropositioners and
He has co-authored over 110 journal and conference actuators, control systems, and wireless sensors and actuators. He has developed
papers in the broad areas of autonomous systems, and directed the Control/Robotics Research Laboratory at Polytechnic University
robotics, and control systems. He has also co-authored the book ‘‘Modeling and (Now NYU). His research has been supported by the Army Research Office,
Adaptive Nonlinear Control of Electric Motors’’ published by Springer Verlag National Science Foundation, Office of Naval Research, DARPA, Air Force Research
in 2003. His research interests include robust and adaptive nonlinear control, Laboratory, Sandia National Laboratory, Army Research Laboratory, NASA, and
resilient control, autonomous vehicles and robotic systems, path planning and several corporations. Prof. Khorrami has served as general chair and conference
obstacle avoidance, sensor data fusion, machine learning, real-time embedded organizing committee member of several international conferences.
systems, electromechanical systems modeling and control, cyber–physical sys-
tems and cyber-security, decentralized and large-scale systems, high-fidelity and
hardware-in-the-loop simulation, real-time software implementations for robotic
systems, and distributed multi-agent systems.

A Deep Learning Gated Architecture For UGV Navigation Robust To Sensor

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A Deep Learning Gated Architecture For UGV Navigation Robust To Sensor

Încărcat de

Drepturi de autor:

Formate disponibile

Robotics and Autonomous Systems 116 (2019) 80–97

Contents lists available at ScienceDirect

Robotics and Autonomous Systems

A deep learning gated architecture for UGV navigation robust to sensor

1. Introduction resilient to data corruption for self-driving vehicles equipped

Fig. 1. End-to-end modality fusion framework for learning autonomous

system than with either sensor modality separately (e.g., see

debugging tools were developed for these autonomous systems

the motors are generated at a rate of 10 Hz. Both the velocity

In this section, our proposed framework and the correspond-

[9] G. Kahn, A. Villaflor, B. Ding, P. Abbeel, S. Levine, Self-supervised deep

S-ar putea să vă placă și