Documente Academic
Documente Profesional
Documente Cultură
VISUALLY IMPAIRED
A Major Project Report
Submitted in partial fulfillment of the requirements for the award of the degree of
BACHELOR OF TECHNOLOGY
in
Electrical and Electronics Engineering
by
A. BHAVANI REDDY (15071A0262)
B. N. V. J. A. K SAI SASANK (15071A0278)
M. SHRIYA REDDY (15071A0290)
N. L. B. K. ARVIND REDDY (15071A02A0)
1
VNR VIGNANA JYOTHI INSTITUTE OF ENGINEERING & TECHNOLOGY
AN AUTONOMOUS INSTITUTE
Bachupally, Nizampet (S.O), Hyderabad-500090
DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING
CERTIFICATE
This is to certify that A. Bhavani Reddy (15071A0262), B. N. V. J. A. K. Sai Sasank
(15071A0278), M. Shriya Reddy (15071A0290), N. L. B. K. Arvind Reddy (15071A02A0), have
successfully completed their major project work at Electrical and Electronics Engineering
Department of VNR VJIET, Hyderabad entitled “MICROPROCESSOR BASED ASSISTIVE
TEXT READER FOR VISUALLY IMPAIRED”, in fulfillment of the requirements for the
award of Bachelors of Technology degree during the academic year 2018-19.
This work is carried out under our supervision and has not been submitted to any other
university/institute for award of any degree/diploma.
2
VNR VIGNANA JYOTHI INSTITUTE OF ENGINEERING & TECHNOLOGY
AN AUTONOMOUS INSTITUTE
Bachupally, Nizampet (S.O), Hyderabad-500090
DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING
DECLARATION
3
ACKNOWLEDGEMENT
The success and final outcome of this project required a lot of guidance and assistance from many
people and we are extremely privileged to have got this all along the completion of our project.
All that we have done is only due to such supervision and assistance and we would not forget to
thank them.
We are extremely grateful to Dr. Poonam Upadhyay, Professor and HOD, Department of Electrical
and Electronics Engineering, VNRVJIET, Hyderabad.
We are extremely thankful to Ms. T. Hari Priya, Assistant Professor, Project coordinator,
Department of Electrical and Electronics Engineering, for
We are extremely thankful to Dr. Rashmi Kapoor, Assistant Professor, Project guide, Department
of Electrical and Electronics Engineering, for constant guidance, encouragement and moral
support throughout the project.
We will be failing in duty if we do not acknowledge with gratefulness to the authors of the
references and other literatures referred in this project.
Finally, we express our gratitude and appreciation to all the staff members and friends for all the
help and co-ordination extended in bringing out this project successfully in time.
4
ABSTRACT
The technological advances in these modern days had an impact on our visual sensory
system in our day to day life. As this progressed, people began suffering from visual impairments
time to time. Existing reading assisting systems perform well with documented images with simple
backgrounds and standard fonts only. So as to help these people who have been suffering due to
this, we would like to build a assistive device which could help ease their life by converting the
text in front of them to speech from the captured image inputted. This text to speech converter
converts normal language text into speech. But text to speech conversion is not that much easy for
machine as it is for human because Text Extraction from colored images with different
backgrounds is a challenging task in computer vision.
Text-to-Speech conversion is a method that scans and reads English alphabets and numbers
that are in the image using OCR technique and changing it to voice. Converting the captured input
image to final speech output can be achieved by utilizing Deep Learning. This kind of system helps
visually impaired people to interact with product packages, computers effectively through vocal
interface.
5
CONTENTS
Abstract
List of Figures
List of Screenshots
List of Abbreviations
CHAPTER 1: INTRODUCTION
1.1 Introduction
1.2 Literature Survey
1.3 Problem formulation
1.4 Objective of the project
1.5 Scope of the project
CHAPTER 3: IMPLEMENTATION
3.1 Code flow
3.2 Screenshots
3.3 Results
CHAPTER 4:
4.1 Conclusions
4.2 Scope for future enhancement
REFERENCES
6
LIST OF FIGURES
7
LIST OF SCREENSHOTS
8
CHAPTER 1: INTRODUCTION
1.1 Introduction
The field of technology in the modern day has progressed to such an extent that every
complex problem has a well-defined solution. With the increase in the advancement, the adverse
effect on human lifestyle has also increased. The repetitive usage of smart devices and the gradual
increase in the radiation has an impact on the whole ecosystem.
Recent statistics show that a lot of people are facing health issues due to excessive usage
of Digital Devices. The reasons that drive a person to use these devices are many. It also has to do
with the reason that these smart devices are available in most of the places in the world.
Technology is in a way helping us to improve the efficiency and execute things in a faster method
but excessive usage of this is leading to several health consequences.
One such example that can be derived from the excessive usage of Smart Devices is the
impact on human visual sensory system. People feel stressful from prolonged usage of mobile
phones and computers. The blue light emitted from these devices cause gradual blindness in the
long-run. Visual Impairment is the major health challenge that modern day people are facing due
to the usage of smart devices.
The person identified with this problem either needs a support of another person to help
with the daily chores, understand the surrounding or is restricted to a place. The person often has
no aid to assist with the outside world and have the freedom to roam around.
It would be of a great help to the person if a specific portable tool is available to carry along
and understand the things around. The current project considers a scenario where a person
suffering from a visual impairment needs a tool to carry around and correlate the information
which cannot be seen. The objective of the current project is to deliver a product that is fully
functional where-in the person can carry this tool, capture an image, get a voice assist on what the
picture is about.
9
1.2 Literature Survey
The field of assistive technology considered as the technology designed for individuals
with some kind of impairment is a vital field evolving at a fast pace as it emerges from many
disciplines and is majorly driven by the technology. Assistive technology for the visually impaired
and blind people is concerned with different technologies, devices, services and software
applications which enable them to overcome all the physical, social and accessibility barriers to
lead an independent and productive lives. Loss of vision, as it being an extremely vital sensor
organ, affects the performance of almost all daily living activities and instrumental activities and
hence hamper an individual’s quality of life. Therefore, a technology that facilitates ease,
accessibility, safety leading to an improved quality of life has a very relevant social impact. This
has driven the novel research across various disciplines, from cognitive intelligence and
neuroprosthetics to computer vision. Recently, advancements made in computer vision,
multisensory research, wearable technology have aided the development of numerous assistive
technology solutions. Different contributions made in this assistive technology for blind or visually
impaired to read text and the methodologies adopted are discussed and related literature work on
text detection, text recognition and speech synthesis available is presented in this chapter.
Kwang In Kim, Keechul Jung entitled “Texture-based approach for text detection in images
using support vector machines and continuously adaptive mean shift algorithm” [1] proposes text
detection as a texture classification problem and adopts a Support Vector Machine (SVM) as a
trainable classifier. An SVM analyzes the textural properties and classifies the text and non-text
regions. The results obtained from SVM classifiers are not immediate ready for future post
processing, like OCR, due to the presence of noisy regions and spurious areas. CAMSHIFT is
applied for text chip generation which sets a rectangular window for searching text. This work is
based on the traditional methods before the evolution of deep learning and proposed classifier
encountered problems when detecting very small texts or texts with low contrast.
Chucai Li et.al entitled “Portable Camera-Based Assistive Text and Product Label Reading
From Hand-Held Objects for Blind Persons”[2],developed a system to spontaneously localize the
text containing regions from the image which is the RoI, they proposed a unique algorithm for text
localization by training gradient features of the stroke orientations and also edge pixel distributions
in an Adaboost model. Characters contained in the localized text in text regions are then taken
through a process of binarization and are recognized by using optical character recognition (OCR)
software.
R. Lienhart and A. Wernicke entitled “Localizing and segmenting text in images and
videos” [3] proposed a method to classify the text region from non-text regions by a complex
multi-layer feed-forward neural network trained to localize text at a fixed position and scale. The
10
output of the network at all scales and positions is then integrated into one text saliency map, which
serves as the starting point for candidate text regions.
Chucai Li and Ying Li Tian entitled “Text string detection from natural scenes by structure
based partition and grouping,”[4] demonstrates a framework for text detection by image partition
based on gradient based method where the path of pixel couples is connected and color uniformity
method and then perform adjacent character grouping when there are multiple text regions to group
them appropriately.
Viswanath Sivakumar entitled “Rosetta large scale system for text detection and
recognition in images” proposed a faster R-CNN model for text detection and fully convolutional
model for text recognition which uses a sequence to sequence CTC loss.
Max Jaderberg entitled “Synthetic Data and Artificial Neural Networks for Natural Scene
Text Recognition”, Computer Vision and Pattern Recognition, Cornell University, demonstrates a
framework which does text recognition on the complete image holistically unlike the traditional
character-based recognition systems. He is the first to propose a work on deep neural network
models which are learned solely on the data generated by synthetic text generating engine.
Zhanzhan Cheng entitled “Focusing attention towards text recognition in natural images”,
IEEE Xplore, an attention-based text recognizing model is developed as an encoder-decoder
framework. In the encoding process, an image is converted into a sequence of the feature vectors
by CNN or LSTM and each such obtained feature vector corresponds to a region or area in the
input image called as attention regions. In decoding stage, glimpse vectors are synthesized
achieving alignment between attention regions and ground truth labels. Then, a RNN is used to
generate target characters based on glimpse vectors.
11
1.3 Problem Formulation
Technology is an astonishing thing which grabs the attention of everybody once there is a
new product in the market. Over the past decade, there has been a constant development in the
field of computers and smart gadgets improving the older generation. Technology helps us to get
connected to near ones, execute things faster, and increase human efficiency. But on the other
hand, it also has a negative impact on the human lifestyle.
People are spending an average of 10-15 hours in a day using technology. On the other
hand, kids are getting more addicted to the smartphones due to multiplayer games and other
addictive applications. Due to this, the usage of technology has a direct impact on human health
especially on the visual sensory system.
Studies say that the Blue Light emitted from the smart gadgets accelerate blindness. It has
become a common habit to everyone to use their mobile phones during night times in low light.
Out of all the frequencies, the Blue Light has the highest impact on human eye. The University
of Toledo stated that the repeated exposure of eyes to digital devices can cause the vital molecules
in retina to turn into cell killers. Macular Degeneration is the major cause of blindness in America
and other developing countries.
Computer Vision Syndrome / Digital Eye Strain, describes vision-related problems caused
due to excessive usage of digital devices. People experience vision problems with a prolonged
12
usage of smart gadgets. The intensity of visual impairment depends on the period of technology
usage.
The intensity of visionary problems depends on the body style and living habits of each
individual. Nevertheless, prolonged usage of these digital devices causes visual impairments
regardless of age, gender and state. The individuals who are suffering from visionary problems
cannot match with the real-world getting help about their surroundings. Once they are identified
with the problem, they cannot go out in the fast-paced world and synchronize with the
surroundings.
The problem statement of the current project considers the sample space of all individuals
suffering from visual impairments who are in need of a tool that helps them to identify, understand
and comprehend things around them.
The objective of the current project deals with implementation of the concepts of machine
learning, deep learning, neural networks and image processing to convert the captured text in a
image to sound that aids the user as a voice assist giving him a real-time feedback.
Existing products implement the concept of converting the text captured from an image
and convert it into machine level language and thereby giving a text output on a visual display
screen. This is further extended to the implementation of voice aid that helps the User to understand
things around.
The User can carry the product to any location and capture the image in front using a button
and get a voice output explaining the object. This acts as an aid to the visually impaired by assisting
them with a voice output driven by a speaker interfaced with a microcontroller.
The design of the current project is neglected due to safety concerns of the users and
assembly of all the components that are used for the project. The current project only deals with
the functionality of each component and projecting the usage of it considering a scenario.
The final prototype displayed after the project completion comprises of a setup wherein the
person is capable of capturing an image in any background and get a corresponding voice feedback.
13
1.5 Scope of the Project
The scope of the current project includes the assembly of 3 major components. They are
Raspberry Pi, Camera and Speaker. The programming language used to interface all the
components with the microcontroller is Raspbian.
OpenCV is installed in the Raspberry Pi to perform the image processing. OpenCV is Open
Source Computer Vision which is a set of libraries including all the programs that support real-
time computer vision. This OpenCV installed in the Raspberry Pi supports in executing the image
processing captured with the camera.
The camera used in the project is the Pi Camera which is interfaced with the Raspberry Pi
using specific commands. The Speaker can be of a standard audio output device such as
Headphones / Earphones which helps the user to listen to the voice output.
The capturing of image and the reduction of text using the microcontroller is done using
the OCR technology. Optical Character Recognition abbreviated as OCR is a technique used to
capture the text from an image and display it as an understandable language.
The reduction of text from the image and conversion of text to speech is achieved through
the TTS technique. The Text-to-Speech technique is a process in which the normal language text
is converted into speech. This technique is performed through a speech synthesizer which in our
case is the microcontroller.
The assembly of all the three components along with the OpenCV software makes the
prototype fully functional and helps the user to understand the image in front.
14
CHAPTER 2: ANALYSIS AND DESIGN
2.1 Introduction
The technological advancements have been helping people with disabilities by making
their lives simpler independent, lively. The whole credit goes to artificial intelligence in
combination with other things like assistive technology and wearable technology.
Assistive technology aids disabled people who face difficulties in performing tasks in daily
life by providing assistive, adaptive devices or a software. These devices give higher independence
and improve the functionality of users by allowing them to perform activities which they were
unable to conduct earlier, by using various methods and interacting with technology. Generally,
the assistive technology enhances the functional capability of the users suffering from mobility
impairments, cognitive impairments, visual impairments, eating impairments, hearing
impairments. Here, we are presenting a device which helps the visually impaired people by giving
the speech output by taking pictures from the vicinity.
15
2.2 Existing system
The already existing technologies for visually impaired people include Smart cane, Sentri,
Orcam my eye 2.0, Aira.
Smart cane has the capability to adjust the level of object detection. It helps the people in
surroundings to be informed using a sensor. GPS is used to navigate the visually impaired people
by vibrating. It also provides audio feedback. It detects the objects in real time using the sensor
present in the cane and describes them. But it cannot read out the text from the natural scene.
Sentri is a 360-degree head band which senses the distance around the user and uses
vibration motors to navigate the user. Greater the level of vibration, the user is nearer to the
obstacle. There is no need of using a hand. But it does not give any information about the text.
Orcam my eye 2.0 is the best assistive device for the visually impaired people. The tasks
performed by Orcam my eye 2.0 are facial recognition, product recognition, color identification,
read text which is printed on any surface, money identification. Simple gestures are used to get the
required output. The only disadvantage of this device is that it is costly i.e., around $4,500.
Aira provides instant access to information. Aira connects the user to one of its trained
agents. Aira consists of smart glasses inbuilt with a camera which provides a 120-degree view. It
has a higher video quality which enables the agents to read the text efficiently without a need of
capturing an image. It costs $29 per month for 30 minutes.
16
2.3 Proposed system
We proposed a system which enables a visually impaired person to understand text on sign
boards, banners, hoardings. This system captures the image from its surroundings using the camera
attached to the Raspberry pi and the image will be internally processed and the speech output is
given through the speaker or earpiece connected to it.
There are two main blocks: image processing block and voice processing block, In the
image processing block the image captured using camera is converted to text. In voice processing
block the output of the previous block i.e., text extracted from the captured image is converted to
speech. For convenience of the user the voice may be altered to masculine or feminine voice.
● Now that the image file is converted to text file it is further converted into speech using a
text to speech synthesizer
● There are two ways to do this: one is text to phoneme conversion where text is compared
with the words present in dictionary and giving output, other one is learning based speech
output approach.
● Deep Neural Networks are used by Deep learning-based synthesizers, which use a recorded
speech data for training. Few DNN-based speech synthesizers are achieving nearly the
quality of the human voice.
17
2.4 Software Requirement
The software which are needed to carry out this project are
● Python
● OpenCV
● Tesseract OCR
● Raspbian
Python is a high-level programming language where the instructions are executed freely and
directly. Due to its simple, compatible, cost-free, object oriented, easily understandable nature
python is preferred.
OpenCV, open source computer vision which consists of library of various programming
functions. It contains various artificial neural networks and optimized algorithms. It is generally
used in detection of objects, faces, etc. Here, OpenCV East Text Detector is used to detect the text
from the natural scenes. The region of interest is acquired using the east text detector.
Tesseract is one of the top OCR engines based on accuracy. The text is extracted from dark
text on light background or light text on black background using Tesseract OCR. Here dark text
indicates the text in black and light text indicates text in white. And, similarly dark background
indicates the color black and light background indicates the color white. The conversion of image
to grey image is done. Then, binary image is obtained from the grey image. This binary image
obtained is broken into pieces. Based on the pitch the breaking is done. The fixed pitch also called
as monospaced font in which the occupied horizontal width by both characters and letters is same,
is broken first. The non-fixed pitch in which the occupied horizontal width by the letters and spaces
between them is not same are measured and compared with the threshold values and the nearest
valued letter or character is considered. Thus, the letters are identified and words are formed. These
words are properly associated to form phrases or sentences.
18
2.5 Techniques Used
Artificial intelligence is bridging gap between humans and machine constantly. It helps the
machine to see the world as a human. This can be done by deep learning where output is predicted
for a given input. Here we are using Convolutional Neural Network (CNN), a deep learning
algorithm which takes input and allots confidence to different characteristics in the input image
which helps in differentiating one from another.
A series of convolution and pooling operations are performed followed by fully connected
layers to get the output for a given input.
Convolution layer is the main building block CNN. Convolution means merging two sets
of data. Here, the filter, also called as kernel is convolved with input image to get a featured map.
Fig.2.5.2 An example showing input and filter fig.2.5.3 Filter sliding over the input
19
Consider the fig.2.5.2 Where blue area is the input and green area is the filter. This filter
slides over the input where matrix multiplication is done element wise and the sum gives feature
map as shown in fig.2.5.3. In the above-mentioned example, we have considered in 2D but in
general the image is taken as a 3D matrix. Usually we use many filters which slide on input and
result in different feature maps which are combined together to get a single output from the
convolution layer. Stride tells us by how much value the filter slides over the input. Generally, the
stride value is 1. If we want less overlap, we can take higher stride value. In fig.2.5.3 we can
observe that feature map size is not same as the input. So, we use padding by bordering the input
with zeros. Now, the dimensions of both input image and feature map will match. By doing this
the possibility of image to shrink is eliminated.
Pooling is used to reduce the dimensions. The height and width of the feature map is
reduced but the depth is maintained the same. There are two types of pooling
● max pooling
● average pooling
Max pooling which is commonly used type of pooling considers the maximum value from
the pooling window. Average pooling takes all the values from a pooling window and computes
the average value as shown in fig.2.5.4.
Fully connected layers are added after the convolution and pooling layers to complete the
CNN architecture. The output from the convolution and pooling layers is 3D but the output from
the fully connected layer is 1D. So, the output from the last pooling layer is flattened to make it
1D.
In the fig2.5.1 the hidden layers are considered to be convolution layer and pooling layer whereas
output layer is considered as a fully connected layer.
20
Text detection is a tough task because
● The light may be so bright causing saturation of the image or the light may not be sufficient
enough
● The surface of the text may be reflective which make it tough to capture the image due to
reflection and refraction phenomena.
● The resolution of the camera may below standard value
● When compared to a scanner the sensor noise is high of a camera
● The images may be blurred at times
● The text may be at an angle which makes it hard to detect the text
● The text may be on non-planar objects which makes it difficult to recognize the text as
boundary boxes are irregular in shape.
We are using a EAST deep learning text detector which can localize the text when it is
reflective, at an angle or blurred. EAST refers to
Efficient and
Accurate
Scene
Text detection pipeline.
21
2.6 Hardware requirement
● Raspberry Pi
● Power supply
● Camera
● Speaker
● Mouse/ Push button
● HDMI cable
Raspberry pi is a small computer whose size is of a credit card. A mouse and keyboard can be
used to operate it when connected to a display. We have chosen raspberry pi as it supports python,
the language in which the code is written. And also, the cost of the pi is low and it is portable.
Here, we are using Raspberry pi 3 B+ model.
Specifications:
Raspberry PI 3B+:
Pi camera v2:
22
Power supply:
● PSU Current Capacity: 2.5 A
● Total Peripheral Current Draw from USB: 1.2 A
● Active Current Consumption from Bare-Board: 500 mA
Speaker:
Display Output:
23
2.7 Installation procedure for EAST Text Detector
Before running the code in the command window there are necessary software for efficient
running of the EAST Text Detector. They are as follows:
1. Python 2.7
2. OPENCV 3.4.x to 4.0.x
3. Tesseract OCR
As these are open sourced software they are free of cost and can be used over various
platforms such as Windows, MAC, Linux etc.
The links for the download Pages are present below:
1. Python:
https://www.python.org/ftp/python/2.7.13/python-2.7.13.msi
2. OpenCV 4.0.1
https://opencv.org/opencv-4-0-0.html
3. Tesseract OCR
https://github.com/tdhintz/tesseract4win64
After installing the following software and adding them to paths we move onto the next
step i.e. proceed with installation of modules in Python.
These are usually required for running the preprocessed code present in their respective modules
for an effective and efficient run time of the program or code.
1. imutils
2. numpy
3. matplotlib
4. pytesseract
5. tesseract
6. pywin32
7. pyttsx
24
To install these modules, use the command
>> pip install *module name*
eg. >> pip install imutils
After the module installation proceed with the extraction of downloaded opencv, then
follow the steps below,
1.Go to opencv/build/python/2.7 folder.
2.Copy or Move cv2.pyd to C:/Python27/lib/site-packages.
3.Go to Python IDLE and type following lines in Python terminal for importing the downloaded
software
>>import cv2
There are some necessary files to move into the python27 folder such as the codes and
image folder where importing and exporting of images is done. After following the procedure
above go to Command Prompt,
Then type cd:c/python27
#This command is for changing directories from user folder to python folder
#to access files present in it.
Then type
python *File name*.py --east frozen_east_text_detection.pb --image *image folder name* /
*name of the image*.*extension*
Then the code runs and displays the detected text along with speech output for each and
every individual word.
25
CHAPTER 3: IMPLEMENTATION
3.1 Code Flow
The code goes accordingly through various stages present in the flowchart, these represent
how the code operates at each level with the complexity involved in each and every stage.
In the first stage the modules are imported accordingly such as the cv2, imutils, pytesseract,
pyttsx etc.
In the second stage the numbers of rows and columns are obtained from the score volumes
and are initiated with set of boundary box rectangles and confidence scores. Then rows and
columns are looped over to derive potential bounding box coordinates that surround the text. Then
computation of rotation angle is predicted along with dimensions of the boundary box i.e. width
and height. Then they’re fed to return rects and confidence by a tuple. The rects value returned is
based on geometry and is in a more compact form and the confidence values in this list correspond
to each rectangle in rects.
While moving on to the third stage the following command line arguments are required
-i --image The path to the input image
--east --east The path to the pre-trained EAST text detector
-c --min-confidence The minimum probability of a detected text region the EAST text
detector. Our detector requires multiples of 32
-e --height Same as the width, but for the height. Again, our detector requires
multiple of 32 for resized height
-p --padding The (optional) amount of padding to add to each ROI border. is 5% or
10% (and so on) if you find that your OCR result is incorrect.
In the fourth stage, the neural network layers are formed namely the Output probabilities
layer and the box coordinates layer.
Followingly in the fifth and sixth and seventh stages, BLOB abbreviated as Binary Large
Objects are created and are passed through the neural network obtaining scores and geometry. The
obtained scores and geometry are decoded using decode_predictions function, then the NMS
known as Non-Maximum Suppression is applied through the imutils package which effectively
takes the most likely text regions, eliminating other overlapping regions.
Moving on to the eighth and ninth stages, the results list is initiated to contain our OCR bounding
boxes and text, then we begin looping over the boxes where we:
Scale the bounding boxes based on the previously computed ratios
Pad the bounding boxes (Commonly known as Padding)
26
And finally, extract the padded ROI (Region of Interest).
As we proceed to the next stages uptill the final stage, we need to use Tesseract OCR for
obtaining the OCR’d text. In order to apply Tesseract v4 to OCR text we must supply a language,
an OEM flag of 4, indicating that the we wish to use the LSTM neural net model for OCR, and
finally an OEM value, in this case, 7 which implies that we are treating the ROI as a single line of
text and the OCR'd text to the list of results.
Thus, the code flows accordingly as mentioned in the above figure from stage by stage and gets
completed after the user presses any key.
27
3.2 SCREENSHOTS
28
SS.3.3.3. Detection of NO SMOKING sign board
29
3.3 Results
30
References
[1] Kwang In Kim, Keechul Jung “Texture-based approach for text detection in images using
support vector machines and continuously adaptive mean shift algorithm” - Pattern Analysis and
Machine Intelligence, IEEE
[2] Chucai Li et.al “Portable Camera-Based Assistive Text and Product Label Reading from Hand-
Held Objects for Blind Persons”, IEEE Transactions on Mechatronics, June 2014
[3] R. Lienhart and A. Wernicke entitled “Localizing and segmenting text in images and videos,”
,IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 4, pp. 256 –268,
2002.,
[4] Chucai Li and Ying Li Tian entitled “Text string detection from natural scenes by structure
based partition and grouping,” IEEE Trans. Image Process., vol. 20, no. 9,pp. 2594– 2605,Sep.
2011
31
Appendix:
1. Code for Text Recognition
# USAGE
# python text_recognition.py --east frozen_east_text_detection.pb --image
images/example_01.jpg
# python text_recognition.py --east frozen_east_text_detection.pb --image
images/example_04.jpg --padding 0.05
k = pyttsx.init()
32
# loop over the number of columns
for x in range(0, numCols):
# if our score does not have sufficient probability,
# ignore it
if scoresData[x] < args["min_confidence"]:
continue
33
help="path to input image")
ap.add_argument("-east", "--east", type=str,
help="path to input EAST text detector")
ap.add_argument("-c", "--min-confidence", type=float, default=0.5,
help="minimum probability required to inspect a region")
ap.add_argument("-w", "--width", type=int, default=320,
help="nearest multiple of 32 for resized width")
ap.add_argument("-e", "--height", type=int, default=320,
help="nearest multiple of 32 for resized height")
ap.add_argument("-p", "--padding", type=float, default=0.0,
help="amount of padding to add to each border of ROI")
args = vars(ap.parse_args())
# set the new width and height and then determine the ratio in change
# for both the width and height
(newW, newH) = (args["width"], args["height"])
rW = origW / float(newW)
rH = origH / float(newH)
# define the two output layer names for the EAST detector model that
# we are interested -- the first is the output probabilities and the
# second can be used to derive the bounding box coordinates of text
layerNames = [
"feature_fusion/Conv_7/Sigmoid",
"feature_fusion/concat_3"]
# construct a blob from the image and then perform a forward pass of
34
# the model to obtain the two output layer sets
blob = cv2.dnn.blobFromImage(image, 1.0, (W, H),
(123.68, 116.78, 103.94), swapRB=True, crop=False)
net.setInput(blob)
(scores, geometry) = net.forward(layerNames)
35
# wish to use the LSTM neural net model for OCR, and finally
# (3) an OEM value, in this case, 7 which implies that we are
# treating the ROI as a single line of text
config = ("-l eng --oem 1 --psm 7")
text = pytesseract.image_to_string(roi, config=config)
# add the bounding box coordinates and OCR'd text to the list
# of results
results.append(((startX, startY, endX, endY), text))
# strip out non-ASCII text so we can draw the text on the image
# using OpenCV, then draw the text and a bounding box surrounding
# the text region of the input image
text = "".join([c if ord(c) < 128 else "" for c in text]).strip()
output = orig.copy()
cv2.rectangle(output, (startX, startY), (endX, endY),
(0, 0, 255), 2)
cv2.putText(output, text, (startX, startY - 20),
cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 0, 255), 3)
36
2. Code for live tracking of text in video
# USAGE
# python text_detection_video.py --east frozen_east_text_detection.pb
37
continue
38
ap.add_argument("-w", "--width", type=int, default=320,
help="resized image width (should be multiple of 32)")
ap.add_argument("-e", "--height", type=int, default=320,
help="resized image height (should be multiple of 32)")
args = vars(ap.parse_args())
# define the two output layer names for the EAST detector model that
# we are interested -- the first is the output probabilities and the
# second can be used to derive the bounding box coordinates of text
layerNames = [
"feature_fusion/Conv_7/Sigmoid",
"feature_fusion/concat_3"]
# if a video path was not supplied, grab the reference to the web cam
if not args.get("video", False):
print("[INFO] starting video stream...")
vs = VideoStream(src=0).start()
time.sleep(1.0)
39
frame = vs.read()
frame = frame[1] if args.get("video", False) else frame
# construct a blob from the frame and then perform a forward pass
# of the model to obtain the two output layer sets
blob = cv2.dnn.blobFromImage(frame, 1.0, (newW, newH),
(123.68, 116.78, 103.94), swapRB=True, crop=False)
net.setInput(blob)
(scores, geometry) = net.forward(layerNames)
40
endY = int(endY * rH)
41