Sunteți pe pagina 1din 13

Title: CNN refinement based object recognition through optimized segmentation

Journal: Optik

Author: Hao Wua, Rongfang Biea, Junqi Guoa,∗, Xin Meng b, Chenyun Zhang

Abstract:

As one classic technique, object recognition could identify objects in an image effectively and it has
been improved by deep learning model significantly. However, in the process of object recognition,
complicated background could have negative on the feature extraction which directly reduces the
quality of object recognition. Although some methods have targeted for the drawbacks, the quality
of feature extraction is still not realistic. Aiming at the problem above, we proposed one CNN
refinement based object recognition through optimized segmentation method which could improve
the quality of object recognition. On the one hand, optimized segmentation method could contribute
to the process of feature extraction. On the other hand, CNN refinement method could contribute to
achieve the final object recognition. At last, the database with a large number of images was built.
Based on it, adequate experiments verify our model’s effectiveness and robustness.

Methodology:

we collected 112902 images from Internet and personal database. Compared to traditional image
databases, such as ImageNet [44], LabelMe [45] and Caltech 256 [46], our database is more
abundant. On the one hand, the image is more complicated, especially for the background. On the
other hand, the database concludes images as more categories as possible. In the process of
experiments, GoogLeNet [41], VGG [47], SPP [48] and AlexNet [43] are considered as baselines.
Firstly, we used our method and baseline models to achieve recognition results under the same
experimental environments. From the mAP value of Fig. 3, we could see that our method is realistic
than other methods. Compared to traditional methods, our methods could achieve higher quality
recognition results. Except for this, we combined our optimized segmentation algorithm with other
methods, we could see in Fig. 4 that the optimized segmentation algorithm not only could improve
CNN refinement model’s recognition results but also could improve other baseline models’
recognition results.

Algorithms used:

 Optimized segmentation
 CNN refinement model

Results:

we proposed one CNN refinement based object recognition model. Combined with optimized
segmentation and CNN refinement model, our method could achieve realistic recognition results.
More importantly, adequate experiments based on the database concluding a large number of
complicated images show our method’s effectiveness and robustness.
Title: Indoor object recognition in RGBD images with complex-valued neural networks for visually-
impaired people

Journal: Neurocomputing

Author name: Rim Trabelsi a,d,e,∗ , Issam Jabrif , Farid Melgani b , Fethi Smachc , Nicola Conci b ,
Ammar Bouallegue d

Abstract:

We present a new multi-modal technique for assisting visually-impaired people in recognizing objects
in public indoor environment. Unlike common methods which aim to solve the problem of multi-
class object recognition in a traditional single-label strategy, a comprehensive approach is developed
here allowing samples to take more than one label at a time. We jointly use appearance and depth
cues, specifically RGBD images, to overcome issues of traditional vision systems using a new
complex-valued representation.

Methodology:

we propose two methods in order to associate each input RGBD image to a set of labels
corresponding to the object categories recognized at once.

 The first one, ML-CVNN, is formalized as a ranking strategy where we make use of a fully
complex-valued RBF network and extend it to be able to solve multi-label problems using an
adaptive clustering method.
 The second method, L-CVNNs, deals with problem transformation strategy where instead of
using a single network to formalize the classification problem as a ranking solution for the
whole label set, we propose to construct one CVNN for each label where the predicted labels
will be later aggregated to construct the resulting multi-label vector. Extensive experiments
have been carried on two newly collected multi-labeled RGBD datasets prove the efficiency
of the proposed techniques.

Algorithms used:

Multi-label classification

RGBD object recognition

RGB ML-RVNN ,Depth ML-RVNN , RGBD ML-RVNN ,RGBD ML-CVNN

Results:

Performance of ML-CVNN on UNITN01 dataset against ML-RVNN using different modalities.

Metrics / Method RGB ML-RVNN Depth ML-RVNN RGBD ML-RVNN RGBD ML-CVNN
Hamming Loss 0.115 0.120 0.135 0.092 Ranking
Loss 0.258 0.069 0.261 0.077 One Error
0.144 0.177 0.127 0.090 Coverage 5.478
6.147 5.858 4.980 Average Precision 0.745 0.685
0.780 0.872
Let Blind People See: Real-Time Visual Recognition with Results Converted to 3D Audio

Author: Rui Jiang,Qian Lin, Shuhui Qu

Abstract:

This project tries to transform the visual world into the audio world with the potential to inform blind
people objects as well as their spatial locations. Objects detected from the scene are represented by
their names and converted to speech. Their spatial locations are encoded into the 2-channel audio
with the help of 3D binaural sound simulation. Our system composes of several modules. Video is
captured with a portable camera device (Ricoh Theta S, Microsoft Kinect, or GoPro) on the client
side, and is streamed to the server for real-time image recognition with existing object detection
models (YOLO). The 3D location of the objects is estimated from the location and the size of the
bounding boxes from the detection algorithm. Then, a 3D sound generation application based on
Unity game engine renders the binaural sound with locations encoded. The sound is transmitted to
the user with wireless earphones. Sound is play at an interval of few seconds, or when the
recognized object differs from previous one, whichever earliest. The prototype device is tested in a
situation simulating a blind people being exposed to a new environment. With the help of the
device, the user successfully found a chair that is 3-5 meters away, walk towards it and sit on it.
Issues about current prototype have been identified as: detection failure when objects are too close
or too far, and overload of information when the system tries to notify users too many objects.

Methodology:

For object detection:

RCNN [11] uses region proposal methods to generate possible bounding boxes in an image. Then, it
applies various ConvNets to classify each box. The results are then postprocessed and output finer
boxes. The slow test-time, complex training pipeline and the large storage does not fit into our
application. Fast R-CNN [12] max-pools proposed regions and combines the computation of ConvNet
for each proposal of an image and outputs features of all regions at once. Based on Fast R-CNN,
Faster R-CNN [13] inserts a region proposal network after the last layer of ConvNet. Both methods
speed up the computational time and improve the accuracy.

For converting visuals into sound:

We use a plug-in for Unity 3D game engine called 3DCeption to simulate the 3D sound. We
developed a Unity-based game program “3D Sound Generator” using either a file watcher or TCP
socket to receive the information about the correct sound clips to be played as well as their spatial
coordinates. Then, 3DCeption renders the binaural sound effect with the help of the Head-Related
Transfer Function (HRTF) to simulate the reflection of the sound on human body (head, ear, etc.) and
obstacles (such as wall and floor).

Algorithms used:You Only Look Once (YOLO) model,R-CNN,Faster R-CNN


Results:

The pipelines of these methods are still relatively complex and hard to optimize. YOLO could
efficiently provides relatively good objective detection with extremely fast speed. The YOLO
algorithm could process a single image frame at a speed of 4-60 frames/second depending on the
image size we send to the engine. YOLO can correctly detect objects, such as chair, within a range
about 2-5 m away.
Title: A Bilingual Scene-to-Speech Mobile Based Application

Author: AbdelGhani Karkar, Mary Puthren, Somaya Al-Maadeed

Abstract—Scene-to-Speech (STS) is the process of recognizing visual objects in a picture or a video to


say aloud a descriptive text that represents the scene. The recent advancement in convolution neural
network (CNN), a deep learning feed-forward artificial neural network, enables us to recognize
objects on mobile handled devices in real-time. Several applications have been developed to
recognize objects in scenes and speak loud their relevant descriptions. However, the Arabic language
is not fully supported. In this paper, we propose a bilingual mobile based application that captures
video scenes and processes their content to recognize objects in real-time. The mobile application
will then speak loud, in English or Arabic language, the description of the captured scene. The mobile
application can be extended to further support eLearning technologies and edutainment games.
People with visual impairments (VI), such as people with low vision and totally blind people, can
benefit from the application to know about their surroundings. We conducted an elementary study
about the usage of the mobile application with people with VI and they expressed their interest to
use it in their daily lives.

Methodology:

We propose mobile assistive application that supports the Arabic language. The application can be
used by people with VI, people with low vision, and blind. A practical deployment approach has been
done through the integration of TensorFlow and Google TTS API to fulfil the requirements of the
application. The mobile application stresses the usage of the camera of the mobile device to capture,
in real time, the scene happening in front the user. Using trained NN models, the captured video will
be processed in order to extract textual descriptions, or labels, characterizing the recognized objects
of the visual scene. Therefore, the obtained labels will be filtered out having only labels with
confidence level above sixty percent. We consider filtered labels to generate sentences that will be
spoken out for the user. Java programming language has been used to develop the mobile
application. An ontology has been incorporated to provide rich explanatory about particular words as
in [35], [36]. This is to make the application rich from its kind as an educational application [37]. The
mobile application can operate on Android system with minimum version of 5.0. Figure 1 illustrates
the components of our proposed mobile application.

Technologies used:

TensorFlow ,Google TTS API, Language Mapper,cnn,R-cnn

Results:

You Look Only Once (YOLO) [16], and other models targeting mobile devices; along with custom
models we have trained manually by ourselves to define new embodiments of visual objects that
have not been included and/or trained in existing models; such as, Qatari cultural camp, Qatar
National Bank (QNB) [38] credit and/or debit cards, door knobs at Qatar University, and so. Thus, the
model that provides the highest accuracy will be marked as a default model and will be used first in
the following procedure of examining the content of the visual scenes.
Title: Object detection via deeply exploiting depth information

Authors: Saihui Hou, Zilei Wang∗ , Feng Wu Journal: Neurocomputing

Abstract:

This paper addresses the issue on how to more effectively coordinate the depth with RGB aiming at
boosting the performance of RGB-D object detection. Particularly, we investigate two primary ideas
under the CNN model: property derivation and property fusion. Firstly, we propose that the depth
can be utilized not only as a type of extra information besides RGB but also to derive more visual
properties for comprehensively describing the objects of interest. Then a two-stage learning
framework consisting of property derivation and fusion is constructed. Here the properties can be
derived either from the provided color/depth or their pairs (e.g. the geometry contour). Secondly, we
explore the fusion methods of different properties in feature learning, which is boiled down to,
under the CNN model, from which layer the properties should be fused together. The analysis shows
that different semantic properties should be learned separately and combined before passing into
the final classifier. Actually, such a detection way is in accordance with the mechanism of the primary
visual cortex (V1) in brain. We experimentally evaluate the proposed method on the challenging
datasets NYUD2 and SUN RGB-D, and both achieve remarkable performances that outperform the
baselines.

Methodology:

Overview of our framework for RGB-D object detection. We manage to learn multiple properties of
the objects and fuse them better for the detection. Various maps including geometry contour,
horizontal disparity, height above ground and angle with gravity are first derived from the raw color
and depth pairs. These maps, along with the RGB image, are sent into different CNNs to learn
particular types of features. And the features are fused at the highest level, i.e. they are not joint
until passing into the classifier. The region proposals for R-CNN [22] are generated using MCG [28]
with the depth information. A SVM is appended to predict the label and score for each proposal.
Results:
Title: Enhancing perception for the visually impaired with deep learning techniques and low-cost
wearable sensors

Author: Zuria Bauer, Alejandro Dominguez, Edmanuel Cruz, Francisco Gomez-Donoso∗ , Sergio Orts-
Escolano, Miguel Cazorla

Journal: Pattern Recognition Letters

Abstract:

As estimated by the World Health Organization, there are millions of people who lives with some
form of vision impairment. As a consequence, some of them present mobility problems in outdoor
environments. With the aim of helping them, we propose in this work a system which is capable of
delivering the position of potential obstacles in outdoor scenarios. Our approach is based on non-
intrusive wearable devices and focuses also on being low-cost. First, a depth map of the scene is
estimated from a color image, which provides 3D information of the environment. Then, an urban
object detector is in charge of detecting the semantics of the objects in the scene. Finally, the three-
dimensional and semantic data is summarized in a simpler representation of the potential obstacles
the users have in front of them. This information is transmitted to the user through spoken or haptic
feedback. Our system is able to run at about 3.8 fps and achieved a 87.99% mean accuracy in
obstacle presence detection. Finally, we deployed our system in a pilot test which involved an actual
person with vision impairment, who validated the effectiveness of our proposal for improving its
navigation capabilities in outdoors

Methodology:

Technologies used:

a state-of-the-art CNN , state-of-the-art tracking algorithms: KCF [10], MedianFlow [12] and Mosse
[1],YOLO
results:
Title: Semi-supervised 3D Object Recognition through CNN Labeling

Author: José Carlos Rangela,c,∗ , Jesus Martínez-Gómezb , Cristina Romero-Gonzálezb , Ismael


García-Vareab , Miguel Cazorla

Journal: Applied Soft Computing

Abstract:

works (CNNs) in object recognition and classification, there are still some open problems to address
when applying these solutions to real-world problems. Specifically, CNNs struggle to generalize under
challenging scenarios, like recognizing the variability and heterogeneity of the instances of elements
belonging to the same category. Some of these difficulties are directly related to the input
information, 2D-based methods still show a lack of robustness against strong lighting variations, for
example. In this paper, we propose to merge techniques using both 2D and 3D information to
overcome these problems. Specifically, we take advantage of the spatial information in the 3D data to
segment objects in the image and build an object classifier, and the classification capabilities of CNNs
to semi-supervisedly label each object image for training. As the experimental results demonstrate,
our model can successfully generalize for categories with high intra-class variability and outperform
the accuracy of a well-known CNN model.

Methodology:

Initially, objects are detected in 3D images (encoded as point clouds) by means of a clustering
algorithm. Then, each cluster is projected into its perspective image, and it is labeled using an CNN-
based lexical annotation tool. At the same time, the 3D data from each cluster are processed to
extract a 3D descriptor. The set of descriptor-label pairs define the training dataset that will serve to
build a classification model, which will carry out the effective object recognition process.

Results:

Best-case Scenario Results In this scenario, the accuracy obtained by our proposed classification
model is 93.5%, whereas the accuracy obtained for the same test dataset with the CNN model was
57.1%. Hence, our method outperforms the results produced by the cnn classifier.

Our proposal obtained a mean accuracy of 84.55% (σ = 4.90) while the state-of-the-art classifier
obtained a mean accuracy of 84.40% (σ = 7.70). These results show that, in general conditions, our
system will have a similar performance to a state-of-the-art classifier, even if it is trained with
mislabeled samples due to the semi-supervised nature of the proposal. While, under the adequate
conditions, it can clearly overcome the drawbacks of current state-of-the-art object classification
systems, as illustrated in our best-case scenario in Section 4.3.

Title:Fast indoor scene description for blind people with multiresolution random projections.
Author:Farid Melganz, Yakoub Bazi, Naif Alajlan

journal:J. Vis. Commun. Image R.

Abstract:

Object recognition forms a substantial need for blind and visually impaired individuals. This paper pro-
poses a new multiobject recognition framework. It consists of coarsely checking the presence of
multipleobjects in a portable camera-grabbed image at a considered indoor site. The outcome is a list of
objects that likely appear in the indoor scene. Such description is meant to uplift the conscience of the
blind person in order to better sense his/her surroundings. The method consists of a library containing (i) a
bunch of images represented by means of the Random Projections (RP) technique and (ii) their respective
list of objects, both prepared offline. Thus, given an online shot image, its RP representation is generated
and matched to the RP patterns of library images. It thus inherits the objects of the closest image from the
library. Extensive experiments returned promising recognition accuracies and a processing lapse of real-
time standard.

Methodology:

Random projections for image representation

The input x is a one-dimensional signal that is pro-jected column-wise on the random matrix P.
Hence, each column of the matrix represents an element of the projection basis. Subse-
quently, we can obtain the same results if a two-dimensional resentation of the input signal and
the projection elements were used. In particular, the input signal that is meant for a RP-based
representation is a portable camera-grabbed image. Therefore,the projection elements consist of
a bunch of random matrices holding the same size of the image. In more details, if M filters are
adopted, then M inner products (between the image and each filter) are performed, which points
out M scalars whose concatenation forms the ultimate compact RP representation of the
image.We have stated earlier that the input image undergoes an inner product with the templates
(filters) of the adopted random matrix (also referred to as measurement matrix). Accordingly, the
choice of the matrix entries has to be defined. From the literature, it emerges that the popular
matrix configuration is confined to the one presented in[18].

Results:
Title:Recovering
the sight to blind people in indoor
environments with smart technologies
Authors:
Mohamed L. Mekhalfi, Farid Melgani, Abdallah Zeggada, Francesco G.B.
De Natalz,Mohammed A.-M. Salem, Alaa Khamis

journal:

Expert Systems With Applications

Abstract:

Assistive technologies for blind people are showing a fast growth, providing useful tools to support daily

activities and to improve social inclusion. Most of these technologies are mainly focused on helping blind
people to navigate and avoid obstacles. Other works emphasize on providing them assistance to recognize
their surrounding objects. Very few of them however couple both aspects (i.e., navigation and recognition).
With the aim to address the aforesaid needs, we describe in this paper an innovative prototype, which
offers
the capabilities to (i) move autonomously and to (ii) recognize multiple objects in public indoor environ-
ments. It incorporates lightweight hardware components (camera, IMU, and laser sensors), all mounted on
a reasonably-sized integrated device to be placed on the chest. It requires the indoor environment to be
‘blind-friendly’, i.e., prior information about it should be prepared and loaded in the system beforehand. Its
algorithms are mainly based on advanced computer vision and machine learning approaches. The
interaction
between the user and the system is performed through speech recognition and synthesis modules. The
pro-
totype offers to the user the possibility to (i) walk across the site to reach the desired destination, avoiding
static and mobile obstacles, and (ii) ask the system through vocal interaction to list the prominent objects
in the user’s field of view. We illustrate the performances of the proposed prototype through experiments
conducted in a blind-friendly indoor space equipped at our Department premises

methodology:
result:
From the current literature on blind assistance, it appears clearly

that most of the works focus on the navigation aid, while few others
deal with the object recognition aspect. It also comes out that little
attention paid to integrate both guidance and recognition capabili-
ties within a single platform. In this context, we have proposed in
this paper a prototype that serves both for navigation assistance and
object recognition. To the best of our knowledge, juts one attempt
of this kind has been introduced before, that is the one in
López-
de-Ipiña et al. (2011
) which uses RFID and QR codes for navigation
and recognition, respectively. Both RFID and QR codes raise however
a distance constraint for a correct operation of the system. We think
that our contribution represents another important step toward the
development of more comprehensive assistive technologies for blind
people. It incorporates simple hardware components all mounted
on the chest of the blind user. The indoor environment needs to
be ‘blind-friendly’, which means that prior information about it un-
der the form of a digital map of the site and a library of multi-
labeled images should be beforehand prepared and loaded in the
system

S-ar putea să vă placă și