Sunteți pe pagina 1din 41

MICROPROCESSOR BASED ASSISTIVE TEXT READER FOR

VISUALLY IMPAIRED
A Major Project Report
Submitted in partial fulfillment of the requirements for the award of the degree of
BACHELOR OF TECHNOLOGY
in
Electrical and Electronics Engineering
by
A. BHAVANI REDDY (15071A0262)
B. N. V. J. A. K SAI SASANK (15071A0278)
M. SHRIYA REDDY (15071A0290)
N. L. B. K. ARVIND REDDY (15071A02A0)

Under the guidance of


Dr. Rashmi Kapoor
Assistant Professor (EEE Dept.)

Department of Electrical & Electronics Engineering


VNR VIGNANA JYOTHI INSTITUTE OF ENGINEERING & TECHNOLOGY
AN AUTONOMOUS INSTITUTE
Bachupally, Nizampet (S.O), Hyderabad
April, 2019.

1
VNR VIGNANA JYOTHI INSTITUTE OF ENGINEERING & TECHNOLOGY
AN AUTONOMOUS INSTITUTE
Bachupally, Nizampet (S.O), Hyderabad-500090
DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING

CERTIFICATE
This is to certify that A. Bhavani Reddy (15071A0262), B. N. V. J. A. K. Sai Sasank
(15071A0278), M. Shriya Reddy (15071A0290), N. L. B. K. Arvind Reddy (15071A02A0), have
successfully completed their major project work at Electrical and Electronics Engineering
Department of VNR VJIET, Hyderabad entitled “MICROPROCESSOR BASED ASSISTIVE
TEXT READER FOR VISUALLY IMPAIRED”, in fulfillment of the requirements for the
award of Bachelors of Technology degree during the academic year 2018-19.
This work is carried out under our supervision and has not been submitted to any other
university/institute for award of any degree/diploma.

Signature of Project Guide: Signature of Head of the Department:


Dr. Rashmi Kapoor Dr. Poonam Upadhyay
Assistant Professor Professor
Department of Electrical and Department of Electrical and
Electronics Engineering Engineering
VNR VJIET VNR VJIET

2
VNR VIGNANA JYOTHI INSTITUTE OF ENGINEERING & TECHNOLOGY
AN AUTONOMOUS INSTITUTE
Bachupally, Nizampet (S.O), Hyderabad-500090
DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING

DECLARATION

We hereby declare that the major project entitled “MICROPROCESSOR BASED


ASSISTIVE TEXT READER FOR VISUALLY IMPAIRED ” is submitted in partial
fulfillment of the requirements for the award of Bachelor of Technology in Electrical and
Electronics Engineering at VNR Vignana Jyothi Institute of Engineering and Technology, an
autonomous institute Hyderabad, is an authentic work and has not been submitted the same to any
other university or organization for the award of any degree/diploma.

A. Bhavani Reddy B. Sai Sasank M. Shriya Reddy N. Arvind Reddy

15071A0262 15071A0278 15071A0290 15071A02A0

VNR VJIET VNR VJIET VNR VJIET VNR VJIET

3
ACKNOWLEDGEMENT

The success and final outcome of this project required a lot of guidance and assistance from many
people and we are extremely privileged to have got this all along the completion of our project.
All that we have done is only due to such supervision and assistance and we would not forget to
thank them.

We are extremely grateful to Dr. Poonam Upadhyay, Professor and HOD, Department of Electrical
and Electronics Engineering, VNRVJIET, Hyderabad.

We are extremely thankful to Ms. T. Hari Priya, Assistant Professor, Project coordinator,
Department of Electrical and Electronics Engineering, for

We are extremely thankful to Dr. Rashmi Kapoor, Assistant Professor, Project guide, Department
of Electrical and Electronics Engineering, for constant guidance, encouragement and moral
support throughout the project.

We will be failing in duty if we do not acknowledge with gratefulness to the authors of the
references and other literatures referred in this project.

Finally, we express our gratitude and appreciation to all the staff members and friends for all the
help and co-ordination extended in bringing out this project successfully in time.

4
ABSTRACT

The technological advances in these modern days had an impact on our visual sensory
system in our day to day life. As this progressed, people began suffering from visual impairments
time to time. Existing reading assisting systems perform well with documented images with simple
backgrounds and standard fonts only. So as to help these people who have been suffering due to
this, we would like to build a assistive device which could help ease their life by converting the
text in front of them to speech from the captured image inputted. This text to speech converter
converts normal language text into speech. But text to speech conversion is not that much easy for
machine as it is for human because Text Extraction from colored images with different
backgrounds is a challenging task in computer vision.

Text-to-Speech conversion is a method that scans and reads English alphabets and numbers
that are in the image using OCR technique and changing it to voice. Converting the captured input
image to final speech output can be achieved by utilizing Deep Learning. This kind of system helps
visually impaired people to interact with product packages, computers effectively through vocal
interface.

5
CONTENTS

Abstract
List of Figures
List of Screenshots
List of Abbreviations

CHAPTER 1: INTRODUCTION
1.1 Introduction
1.2 Literature Survey
1.3 Problem formulation
1.4 Objective of the project
1.5 Scope of the project

CHAPTER 2: ANALYSIS AND DESIGN


2.1 Introduction
2.2 Existing System
2.3 Proposed System
2.4 Software Requirements
2.5 Techniques Used
2.6 Hardware requirements
2.7 Installation procedure for EAST Text Detector

CHAPTER 3: IMPLEMENTATION
3.1 Code flow
3.2 Screenshots
3.3 Results

CHAPTER 4:
4.1 Conclusions
4.2 Scope for future enhancement

REFERENCES

6
LIST OF FIGURES

Figure Number Description Page Number


1.3.1 Effect of Blue Screen on
Human Eye

2.3.1 Block diagram

2.5.1 Convolutional Neural


Network

2.5.2 An example showing input


and filter

2.5.3 Filter sliding over the input

2.5.4 Types of pooling

2.5.5 Working of EAST detector

2.6.1 Hardware block diagram

3.1.1 Flow of code

7
LIST OF SCREENSHOTS

SS Number Description Page Number


3.3.1 Detection of No entry sign board

3.3.2 Detection of Stop sign board

3.3.3 Detection of NO SMOKING sign


board

3.3.4 Detection of CAR WASH sign


board

8
CHAPTER 1: INTRODUCTION
1.1 Introduction

The field of technology in the modern day has progressed to such an extent that every
complex problem has a well-defined solution. With the increase in the advancement, the adverse
effect on human lifestyle has also increased. The repetitive usage of smart devices and the gradual
increase in the radiation has an impact on the whole ecosystem.

Recent statistics show that a lot of people are facing health issues due to excessive usage
of Digital Devices. The reasons that drive a person to use these devices are many. It also has to do
with the reason that these smart devices are available in most of the places in the world.
Technology is in a way helping us to improve the efficiency and execute things in a faster method
but excessive usage of this is leading to several health consequences.

One such example that can be derived from the excessive usage of Smart Devices is the
impact on human visual sensory system. People feel stressful from prolonged usage of mobile
phones and computers. The blue light emitted from these devices cause gradual blindness in the
long-run. Visual Impairment is the major health challenge that modern day people are facing due
to the usage of smart devices.

The person identified with this problem either needs a support of another person to help
with the daily chores, understand the surrounding or is restricted to a place. The person often has
no aid to assist with the outside world and have the freedom to roam around.

It would be of a great help to the person if a specific portable tool is available to carry along
and understand the things around. The current project considers a scenario where a person
suffering from a visual impairment needs a tool to carry around and correlate the information
which cannot be seen. The objective of the current project is to deliver a product that is fully
functional where-in the person can carry this tool, capture an image, get a voice assist on what the
picture is about.

9
1.2 Literature Survey

The field of assistive technology considered as the technology designed for individuals
with some kind of impairment is a vital field evolving at a fast pace as it emerges from many
disciplines and is majorly driven by the technology. Assistive technology for the visually impaired
and blind people is concerned with different technologies, devices, services and software
applications which enable them to overcome all the physical, social and accessibility barriers to
lead an independent and productive lives. Loss of vision, as it being an extremely vital sensor
organ, affects the performance of almost all daily living activities and instrumental activities and
hence hamper an individual’s quality of life. Therefore, a technology that facilitates ease,
accessibility, safety leading to an improved quality of life has a very relevant social impact. This
has driven the novel research across various disciplines, from cognitive intelligence and
neuroprosthetics to computer vision. Recently, advancements made in computer vision,
multisensory research, wearable technology have aided the development of numerous assistive
technology solutions. Different contributions made in this assistive technology for blind or visually
impaired to read text and the methodologies adopted are discussed and related literature work on
text detection, text recognition and speech synthesis available is presented in this chapter.

Kwang In Kim, Keechul Jung entitled “Texture-based approach for text detection in images
using support vector machines and continuously adaptive mean shift algorithm” [1] proposes text
detection as a texture classification problem and adopts a Support Vector Machine (SVM) as a
trainable classifier. An SVM analyzes the textural properties and classifies the text and non-text
regions. The results obtained from SVM classifiers are not immediate ready for future post
processing, like OCR, due to the presence of noisy regions and spurious areas. CAMSHIFT is
applied for text chip generation which sets a rectangular window for searching text. This work is
based on the traditional methods before the evolution of deep learning and proposed classifier
encountered problems when detecting very small texts or texts with low contrast.

Chucai Li et.al entitled “Portable Camera-Based Assistive Text and Product Label Reading
From Hand-Held Objects for Blind Persons”[2],developed a system to spontaneously localize the
text containing regions from the image which is the RoI, they proposed a unique algorithm for text
localization by training gradient features of the stroke orientations and also edge pixel distributions
in an Adaboost model. Characters contained in the localized text in text regions are then taken
through a process of binarization and are recognized by using optical character recognition (OCR)
software.

R. Lienhart and A. Wernicke entitled “Localizing and segmenting text in images and
videos” [3] proposed a method to classify the text region from non-text regions by a complex
multi-layer feed-forward neural network trained to localize text at a fixed position and scale. The

10
output of the network at all scales and positions is then integrated into one text saliency map, which
serves as the starting point for candidate text regions.

Chucai Li and Ying Li Tian entitled “Text string detection from natural scenes by structure
based partition and grouping,”[4] demonstrates a framework for text detection by image partition
based on gradient based method where the path of pixel couples is connected and color uniformity
method and then perform adjacent character grouping when there are multiple text regions to group
them appropriately.

Viswanath Sivakumar entitled “Rosetta large scale system for text detection and
recognition in images” proposed a faster R-CNN model for text detection and fully convolutional
model for text recognition which uses a sequence to sequence CTC loss.

S.Venkateswarlu entitled “Text to speech conversion” presents a text to speech conversion


system based on Raspberry Pi. OCR is used for text recognition and Festival, an open source
software is used for speech synthesis. However, we have installed a python module for this instead
of a separate software which simplified the task and reduced the time of implementation.

Max Jaderberg entitled “Synthetic Data and Artificial Neural Networks for Natural Scene
Text Recognition”, Computer Vision and Pattern Recognition, Cornell University, demonstrates a
framework which does text recognition on the complete image holistically unlike the traditional
character-based recognition systems. He is the first to propose a work on deep neural network
models which are learned solely on the data generated by synthetic text generating engine.

Zhanzhan Cheng entitled “Focusing attention towards text recognition in natural images”,
IEEE Xplore, an attention-based text recognizing model is developed as an encoder-decoder
framework. In the encoding process, an image is converted into a sequence of the feature vectors
by CNN or LSTM and each such obtained feature vector corresponds to a region or area in the
input image called as attention regions. In decoding stage, glimpse vectors are synthesized
achieving alignment between attention regions and ground truth labels. Then, a RNN is used to
generate target characters based on glimpse vectors.

11
1.3 Problem Formulation

Fig1.3.1 Effect of Blue Screen on Human Eye

Technology is an astonishing thing which grabs the attention of everybody once there is a
new product in the market. Over the past decade, there has been a constant development in the
field of computers and smart gadgets improving the older generation. Technology helps us to get
connected to near ones, execute things faster, and increase human efficiency. But on the other
hand, it also has a negative impact on the human lifestyle.

People are spending an average of 10-15 hours in a day using technology. On the other
hand, kids are getting more addicted to the smartphones due to multiplayer games and other
addictive applications. Due to this, the usage of technology has a direct impact on human health
especially on the visual sensory system.

Studies say that the Blue Light emitted from the smart gadgets accelerate blindness. It has
become a common habit to everyone to use their mobile phones during night times in low light.
Out of all the frequencies, the Blue Light has the highest impact on human eye. The University
of Toledo stated that the repeated exposure of eyes to digital devices can cause the vital molecules
in retina to turn into cell killers. Macular Degeneration is the major cause of blindness in America
and other developing countries.

Computer Vision Syndrome / Digital Eye Strain, describes vision-related problems caused
due to excessive usage of digital devices. People experience vision problems with a prolonged

12
usage of smart gadgets. The intensity of visual impairment depends on the period of technology
usage.

The intensity of visionary problems depends on the body style and living habits of each
individual. Nevertheless, prolonged usage of these digital devices causes visual impairments
regardless of age, gender and state. The individuals who are suffering from visionary problems
cannot match with the real-world getting help about their surroundings. Once they are identified
with the problem, they cannot go out in the fast-paced world and synchronize with the
surroundings.

The problem statement of the current project considers the sample space of all individuals
suffering from visual impairments who are in need of a tool that helps them to identify, understand
and comprehend things around them.

1.4 Objective of the Project

The objective of the current project deals with implementation of the concepts of machine
learning, deep learning, neural networks and image processing to convert the captured text in a
image to sound that aids the user as a voice assist giving him a real-time feedback.

Existing products implement the concept of converting the text captured from an image
and convert it into machine level language and thereby giving a text output on a visual display
screen. This is further extended to the implementation of voice aid that helps the User to understand
things around.

The User can carry the product to any location and capture the image in front using a button
and get a voice output explaining the object. This acts as an aid to the visually impaired by assisting
them with a voice output driven by a speaker interfaced with a microcontroller.

The design of the current project is neglected due to safety concerns of the users and
assembly of all the components that are used for the project. The current project only deals with
the functionality of each component and projecting the usage of it considering a scenario.

The final prototype displayed after the project completion comprises of a setup wherein the
person is capable of capturing an image in any background and get a corresponding voice feedback.

13
1.5 Scope of the Project

The scope of the current project includes the assembly of 3 major components. They are
Raspberry Pi, Camera and Speaker. The programming language used to interface all the
components with the microcontroller is Raspbian.

Fig: Final Assembly of Components

OpenCV is installed in the Raspberry Pi to perform the image processing. OpenCV is Open
Source Computer Vision which is a set of libraries including all the programs that support real-
time computer vision. This OpenCV installed in the Raspberry Pi supports in executing the image
processing captured with the camera.

The camera used in the project is the Pi Camera which is interfaced with the Raspberry Pi
using specific commands. The Speaker can be of a standard audio output device such as
Headphones / Earphones which helps the user to listen to the voice output.

The capturing of image and the reduction of text using the microcontroller is done using
the OCR technology. Optical Character Recognition abbreviated as OCR is a technique used to
capture the text from an image and display it as an understandable language.

The reduction of text from the image and conversion of text to speech is achieved through
the TTS technique. The Text-to-Speech technique is a process in which the normal language text
is converted into speech. This technique is performed through a speech synthesizer which in our
case is the microcontroller.

The assembly of all the three components along with the OpenCV software makes the
prototype fully functional and helps the user to understand the image in front.

14
CHAPTER 2: ANALYSIS AND DESIGN
2.1 Introduction

The technological advancements have been helping people with disabilities by making
their lives simpler independent, lively. The whole credit goes to artificial intelligence in
combination with other things like assistive technology and wearable technology.

2.1.1 Assistive technology

Assistive technology aids disabled people who face difficulties in performing tasks in daily
life by providing assistive, adaptive devices or a software. These devices give higher independence
and improve the functionality of users by allowing them to perform activities which they were
unable to conduct earlier, by using various methods and interacting with technology. Generally,
the assistive technology enhances the functional capability of the users suffering from mobility
impairments, cognitive impairments, visual impairments, eating impairments, hearing
impairments. Here, we are presenting a device which helps the visually impaired people by giving
the speech output by taking pictures from the vicinity.

2.1.2 Wearable technology

Wearable technology is a kind of electronic device which may be incorporated in clothes,


used as accessories. They may or may not be wireless. They are powered by microprocessors and
are capable of sending and receiving data through internet. This technology uses sensors to track
and give signals to the user. Wearable technology came into existence with the growth of mobile
networks. Wearable technology started with developing smart watches, Bluetooth headsets which
shared data via internet. In the recent past the wearable technology has been focusing not only on
development of consumer accessories make everything easy and simple but also developing more
practical applications which are helpful for the impaired enabling them to be more independent.

15
2.2 Existing system

The already existing technologies for visually impaired people include Smart cane, Sentri,
Orcam my eye 2.0, Aira.

Smart cane has the capability to adjust the level of object detection. It helps the people in
surroundings to be informed using a sensor. GPS is used to navigate the visually impaired people
by vibrating. It also provides audio feedback. It detects the objects in real time using the sensor
present in the cane and describes them. But it cannot read out the text from the natural scene.

Sentri is a 360-degree head band which senses the distance around the user and uses
vibration motors to navigate the user. Greater the level of vibration, the user is nearer to the
obstacle. There is no need of using a hand. But it does not give any information about the text.

Orcam my eye 2.0 is the best assistive device for the visually impaired people. The tasks
performed by Orcam my eye 2.0 are facial recognition, product recognition, color identification,
read text which is printed on any surface, money identification. Simple gestures are used to get the
required output. The only disadvantage of this device is that it is costly i.e., around $4,500.

Aira provides instant access to information. Aira connects the user to one of its trained
agents. Aira consists of smart glasses inbuilt with a camera which provides a 120-degree view. It
has a higher video quality which enables the agents to read the text efficiently without a need of
capturing an image. It costs $29 per month for 30 minutes.

To overcome the difficulties mentioned in the above technologies and to make it


economical and feasible we are proposing a system which is discussed in the next section i.e., 2.3.

16
2.3 Proposed system

We proposed a system which enables a visually impaired person to understand text on sign
boards, banners, hoardings. This system captures the image from its surroundings using the camera
attached to the Raspberry pi and the image will be internally processed and the speech output is
given through the speaker or earpiece connected to it.

There are two main blocks: image processing block and voice processing block, In the
image processing block the image captured using camera is converted to text. In voice processing
block the output of the previous block i.e., text extracted from the captured image is converted to
speech. For convenience of the user the voice may be altered to masculine or feminine voice.

Fig.2.3.1: Block diagram

A. Image processing block

● Image is acquired using a camera. the camera used here is


● Text is separated from the text after processing which involves determination of region of
interest, conversion of image to gray scale then to binary image and pitches are broken and
compared with the threshold values.
● Here we get .txt file from a .png file or .jpg file

B. Voice processing block

● Now that the image file is converted to text file it is further converted into speech using a
text to speech synthesizer
● There are two ways to do this: one is text to phoneme conversion where text is compared
with the words present in dictionary and giving output, other one is learning based speech
output approach.
● Deep Neural Networks are used by Deep learning-based synthesizers, which use a recorded
speech data for training. Few DNN-based speech synthesizers are achieving nearly the
quality of the human voice.

17
2.4 Software Requirement

The software which are needed to carry out this project are

● Python
● OpenCV
● Tesseract OCR
● Raspbian

Python is a high-level programming language where the instructions are executed freely and
directly. Due to its simple, compatible, cost-free, object oriented, easily understandable nature
python is preferred.

OpenCV, open source computer vision which consists of library of various programming
functions. It contains various artificial neural networks and optimized algorithms. It is generally
used in detection of objects, faces, etc. Here, OpenCV East Text Detector is used to detect the text
from the natural scenes. The region of interest is acquired using the east text detector.

Tesseract is one of the top OCR engines based on accuracy. The text is extracted from dark
text on light background or light text on black background using Tesseract OCR. Here dark text
indicates the text in black and light text indicates text in white. And, similarly dark background
indicates the color black and light background indicates the color white. The conversion of image
to grey image is done. Then, binary image is obtained from the grey image. This binary image
obtained is broken into pieces. Based on the pitch the breaking is done. The fixed pitch also called
as monospaced font in which the occupied horizontal width by both characters and letters is same,
is broken first. The non-fixed pitch in which the occupied horizontal width by the letters and spaces
between them is not same are measured and compared with the threshold values and the nearest
valued letter or character is considered. Thus, the letters are identified and words are formed. These
words are properly associated to form phrases or sentences.

Raspbian is a computer operating system for Raspberry Pi.

18
2.5 Techniques Used

Artificial intelligence is bridging gap between humans and machine constantly. It helps the
machine to see the world as a human. This can be done by deep learning where output is predicted
for a given input. Here we are using Convolutional Neural Network (CNN), a deep learning
algorithm which takes input and allots confidence to different characteristics in the input image
which helps in differentiating one from another.

fig.2.5.1 Convolutional Neural Network

A series of convolution and pooling operations are performed followed by fully connected
layers to get the output for a given input.
Convolution layer is the main building block CNN. Convolution means merging two sets
of data. Here, the filter, also called as kernel is convolved with input image to get a featured map.

Fig.2.5.2 An example showing input and filter fig.2.5.3 Filter sliding over the input

19
Consider the fig.2.5.2 Where blue area is the input and green area is the filter. This filter
slides over the input where matrix multiplication is done element wise and the sum gives feature
map as shown in fig.2.5.3. In the above-mentioned example, we have considered in 2D but in
general the image is taken as a 3D matrix. Usually we use many filters which slide on input and
result in different feature maps which are combined together to get a single output from the
convolution layer. Stride tells us by how much value the filter slides over the input. Generally, the
stride value is 1. If we want less overlap, we can take higher stride value. In fig.2.5.3 we can
observe that feature map size is not same as the input. So, we use padding by bordering the input
with zeros. Now, the dimensions of both input image and feature map will match. By doing this
the possibility of image to shrink is eliminated.

Pooling is used to reduce the dimensions. The height and width of the feature map is
reduced but the depth is maintained the same. There are two types of pooling
● max pooling
● average pooling

Fig.2.5.4 Types of pooling

Max pooling which is commonly used type of pooling considers the maximum value from
the pooling window. Average pooling takes all the values from a pooling window and computes
the average value as shown in fig.2.5.4.

Fully connected layers are added after the convolution and pooling layers to complete the
CNN architecture. The output from the convolution and pooling layers is 3D but the output from
the fully connected layer is 1D. So, the output from the last pooling layer is flattened to make it
1D.
In the fig2.5.1 the hidden layers are considered to be convolution layer and pooling layer whereas
output layer is considered as a fully connected layer.

20
Text detection is a tough task because
● The light may be so bright causing saturation of the image or the light may not be sufficient
enough
● The surface of the text may be reflective which make it tough to capture the image due to
reflection and refraction phenomena.
● The resolution of the camera may below standard value
● When compared to a scanner the sensor noise is high of a camera
● The images may be blurred at times
● The text may be at an angle which makes it hard to detect the text
● The text may be on non-planar objects which makes it difficult to recognize the text as
boundary boxes are irregular in shape.

We are using a EAST deep learning text detector which can localize the text when it is
reflective, at an angle or blurred. EAST refers to
Efficient and
Accurate
Scene
Text detection pipeline.

fig.2.5.5 Working of EAST detector

21
2.6 Hardware requirement

The components required for the hardware implementation are as follows

● Raspberry Pi
● Power supply
● Camera
● Speaker
● Mouse/ Push button
● HDMI cable

Raspberry pi is a small computer whose size is of a credit card. A mouse and keyboard can be
used to operate it when connected to a display. We have chosen raspberry pi as it supports python,
the language in which the code is written. And also, the cost of the pi is low and it is portable.
Here, we are using Raspberry pi 3 B+ model.

Specifications:

Raspberry PI 3B+:

● SOC - Broadcom BCM2837B0


● CPU - 1.4 GHz
● Memory - 1GB
● Networking - Ethernet, 2.4/5 GHz wireless
● Storage - MicroSD slot
● GPIO - 40 pin GPIO
● Power Source - 5V
● Ports: HDMI, 3.5mm audio-video jack, 4 x USB, Ethernet, Camera Interface, Display
Interface

Pi camera v2:

● Sony IMX219 Sensor.


● 8 MP camera capable of taking picture of 3280 x 2464 pixels.
● Capturing video at 1080p 30fps, 720p 60fps and 640 x 480p 90fps resolutions.
● Supports the latest version of Raspbian OS.
● Supports Raspberry Pi 1, Pi 2 and Pi 3 and Models A, B and B+.
● Applications of Pi Camera: CCTV security, auto motion detection, time lapse photography.

22
Power supply:
● PSU Current Capacity: 2.5 A
● Total Peripheral Current Draw from USB: 1.2 A
● Active Current Consumption from Bare-Board: 500 mA

Speaker:

● Earphones can also be used as audio output.


● Standard Speaker can be used with audio amplifier.
● Bluetooth speaker can also be used for wireless audio output.

Display Output:

● 1.3/1.4a HDMI cable


● Transfer speed of up to 10.2 Gbps

Fig.2.6.1 Hardware block diagram

23
2.7 Installation procedure for EAST Text Detector

Before running the code in the command window there are necessary software for efficient
running of the EAST Text Detector. They are as follows:
1. Python 2.7
2. OPENCV 3.4.x to 4.0.x
3. Tesseract OCR

As these are open sourced software they are free of cost and can be used over various
platforms such as Windows, MAC, Linux etc.
The links for the download Pages are present below:
1. Python:
https://www.python.org/ftp/python/2.7.13/python-2.7.13.msi
2. OpenCV 4.0.1
https://opencv.org/opencv-4-0-0.html
3. Tesseract OCR
https://github.com/tdhintz/tesseract4win64

We need to add values of addresses in environmental variables for smooth flow of


installations, to access the environmental variables
go to Control Panel >> System and Security >> System >> Advanced System Settings >>
Environmental Variables
Select PATH >> EDIT >>
Then add following paths using ADD button,
1. C:\Python27\Scripts
2. C:\Python27\Lib\site-packages
3. C:\Program Files\Tesseract-OCR

After installing the following software and adding them to paths we move onto the next
step i.e. proceed with installation of modules in Python.
These are usually required for running the preprocessed code present in their respective modules
for an effective and efficient run time of the program or code.
1. imutils
2. numpy
3. matplotlib
4. pytesseract
5. tesseract
6. pywin32
7. pyttsx

24
To install these modules, use the command
>> pip install *module name*
eg. >> pip install imutils

After the module installation proceed with the extraction of downloaded opencv, then
follow the steps below,
1.Go to opencv/build/python/2.7 folder.
2.Copy or Move cv2.pyd to C:/Python27/lib/site-packages.
3.Go to Python IDLE and type following lines in Python terminal for importing the downloaded
software
>>import cv2

There are some necessary files to move into the python27 folder such as the codes and
image folder where importing and exporting of images is done. After following the procedure
above go to Command Prompt,
Then type cd:c/python27
#This command is for changing directories from user folder to python folder
#to access files present in it.
Then type
python *File name*.py --east frozen_east_text_detection.pb --image *image folder name* /
*name of the image*.*extension*

Then the code runs and displays the detected text along with speech output for each and
every individual word.

25
CHAPTER 3: IMPLEMENTATION
3.1 Code Flow

The code goes accordingly through various stages present in the flowchart, these represent
how the code operates at each level with the complexity involved in each and every stage.

In the first stage the modules are imported accordingly such as the cv2, imutils, pytesseract,
pyttsx etc.

In the second stage the numbers of rows and columns are obtained from the score volumes
and are initiated with set of boundary box rectangles and confidence scores. Then rows and
columns are looped over to derive potential bounding box coordinates that surround the text. Then
computation of rotation angle is predicted along with dimensions of the boundary box i.e. width
and height. Then they’re fed to return rects and confidence by a tuple. The rects value returned is
based on geometry and is in a more compact form and the confidence values in this list correspond
to each rectangle in rects.

While moving on to the third stage the following command line arguments are required
-i --image The path to the input image
--east --east The path to the pre-trained EAST text detector
-c --min-confidence The minimum probability of a detected text region the EAST text
detector. Our detector requires multiples of 32
-e --height Same as the width, but for the height. Again, our detector requires
multiple of 32 for resized height
-p --padding The (optional) amount of padding to add to each ROI border. is 5% or
10% (and so on) if you find that your OCR result is incorrect.

In the fourth stage, the neural network layers are formed namely the Output probabilities
layer and the box coordinates layer.

Followingly in the fifth and sixth and seventh stages, BLOB abbreviated as Binary Large
Objects are created and are passed through the neural network obtaining scores and geometry. The
obtained scores and geometry are decoded using decode_predictions function, then the NMS
known as Non-Maximum Suppression is applied through the imutils package which effectively
takes the most likely text regions, eliminating other overlapping regions.

Moving on to the eighth and ninth stages, the results list is initiated to contain our OCR bounding
boxes and text, then we begin looping over the boxes where we:
 Scale the bounding boxes based on the previously computed ratios
 Pad the bounding boxes (Commonly known as Padding)

26
 And finally, extract the padded ROI (Region of Interest).

fig.3.1.1 Flow of code

As we proceed to the next stages uptill the final stage, we need to use Tesseract OCR for
obtaining the OCR’d text. In order to apply Tesseract v4 to OCR text we must supply a language,
an OEM flag of 4, indicating that the we wish to use the LSTM neural net model for OCR, and
finally an OEM value, in this case, 7 which implies that we are treating the ROI as a single line of
text and the OCR'd text to the list of results.

From there, looping over the results, we:


 Print the OCR’d text to the command window.
 Stripping out the non-ASCII characters from text as OpenCV does not support non-ASCII
characters.
 Draw a bounding box surrounding the ROI and the text result above the ROI.
 The inbuilt pyttsx converts the OCR’d text to speech by loading the module.
 Display the output and wait for any key to be pressed.

Thus, the code flows accordingly as mentioned in the above figure from stage by stage and gets
completed after the user presses any key.

27
3.2 SCREENSHOTS

SS.3.3.1. Detection of No entry sign board

SS.3.3.2. Detection of Stop sign board

28
SS.3.3.3. Detection of NO SMOKING sign board

SS.3.3.4. Detection of CAR WASH sign board

29
3.3 Results

● The image is captured from the camera successfully


● The region of interest is found from the image acquired
● The image file is converted into a text file
● Padding is done depending on the requirement
● Finally, the text obtained is given out as speech output
● And also, detection of text in real time is achieved

30
References

[1] Kwang In Kim, Keechul Jung “Texture-based approach for text detection in images using
support vector machines and continuously adaptive mean shift algorithm” - Pattern Analysis and
Machine Intelligence, IEEE

[2] Chucai Li et.al “Portable Camera-Based Assistive Text and Product Label Reading from Hand-
Held Objects for Blind Persons”, IEEE Transactions on Mechatronics, June 2014

[3] R. Lienhart and A. Wernicke entitled “Localizing and segmenting text in images and videos,”
,IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 4, pp. 256 –268,
2002.,

[4] Chucai Li and Ying Li Tian entitled “Text string detection from natural scenes by structure
based partition and grouping,” IEEE Trans. Image Process., vol. 20, no. 9,pp. 2594– 2605,Sep.
2011

31
Appendix:
1. Code for Text Recognition

# USAGE
# python text_recognition.py --east frozen_east_text_detection.pb --image
images/example_01.jpg
# python text_recognition.py --east frozen_east_text_detection.pb --image
images/example_04.jpg --padding 0.05

# import the necessary packages


from imutils.object_detection import non_max_suppression
import numpy as np
import pytesseract
import argparse
import cv2
import pyttsx
import win32com.client

k = pyttsx.init()

def decode_predictions(scores, geometry):


# grab the number of rows and columns from the scores volume, then
# initialize our set of bounding box rectangles and corresponding
# confidence scores
(numRows, numCols) = scores.shape[2:4]
rects = []
confidences = []

# loop over the number of rows


for y in range(0, numRows):
# extract the scores (probabilities), followed by the
# geometrical data used to derive potential bounding box
# coordinates that surround text
scoresData = scores[0, 0, y]
xData0 = geometry[0, 0, y]
xData1 = geometry[0, 1, y]
xData2 = geometry[0, 2, y]
xData3 = geometry[0, 3, y]
anglesData = geometry[0, 4, y]

32
# loop over the number of columns
for x in range(0, numCols):
# if our score does not have sufficient probability,
# ignore it
if scoresData[x] < args["min_confidence"]:
continue

# compute the offset factor as our resulting feature


# maps will be 4x smaller than the input image
(offsetX, offsetY) = (x * 4.0, y * 4.0)

# extract the rotation angle for the prediction and


# then compute the sin and cosine
angle = anglesData[x]
cos = np.cos(angle)
sin = np.sin(angle)

# use the geometry volume to derive the width and height


# of the bounding box
h = xData0[x] + xData2[x]
w = xData1[x] + xData3[x]

# compute both the starting and ending (x, y)-coordinates


# for the text prediction bounding box
endX = int(offsetX + (cos * xData1[x]) + (sin * xData2[x]))
endY = int(offsetY - (sin * xData1[x]) + (cos * xData2[x]))
startX = int(endX - w)
startY = int(endY - h)

# add the bounding box coordinates and probability score


# to our respective lists
rects.append((startX, startY, endX, endY))
confidences.append(scoresData[x])

# return a tuple of the bounding boxes and associated confidences


return (rects, confidences)

# construct the argument parser and parse the arguments


ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", type=str,

33
help="path to input image")
ap.add_argument("-east", "--east", type=str,
help="path to input EAST text detector")
ap.add_argument("-c", "--min-confidence", type=float, default=0.5,
help="minimum probability required to inspect a region")
ap.add_argument("-w", "--width", type=int, default=320,
help="nearest multiple of 32 for resized width")
ap.add_argument("-e", "--height", type=int, default=320,
help="nearest multiple of 32 for resized height")
ap.add_argument("-p", "--padding", type=float, default=0.0,
help="amount of padding to add to each border of ROI")
args = vars(ap.parse_args())

# load the input image and grab the image dimensions


image = cv2.imread(args["image"])
orig = image.copy()
(origH, origW) = image.shape[:2]

# set the new width and height and then determine the ratio in change
# for both the width and height
(newW, newH) = (args["width"], args["height"])
rW = origW / float(newW)
rH = origH / float(newH)

# resize the image and grab the new image dimensions


image = cv2.resize(image, (newW, newH))
(H, W) = image.shape[:2]

# define the two output layer names for the EAST detector model that
# we are interested -- the first is the output probabilities and the
# second can be used to derive the bounding box coordinates of text
layerNames = [
"feature_fusion/Conv_7/Sigmoid",
"feature_fusion/concat_3"]

# load the pre-trained EAST text detector


print("[INFO] loading EAST text detector...")
net = cv2.dnn.readNet(args["east"])

# construct a blob from the image and then perform a forward pass of

34
# the model to obtain the two output layer sets
blob = cv2.dnn.blobFromImage(image, 1.0, (W, H),
(123.68, 116.78, 103.94), swapRB=True, crop=False)
net.setInput(blob)
(scores, geometry) = net.forward(layerNames)

# decode the predictions, then apply non-maxima suppression to


# suppress weak, overlapping bounding boxes
(rects, confidences) = decode_predictions(scores, geometry)
boxes = non_max_suppression(np.array(rects), probs=confidences)

# initialize the list of results


results = []

# loop over the bounding boxes


for (startX, startY, endX, endY) in boxes:
# scale the bounding box coordinates based on the respective
# ratios
startX = int(startX * rW)
startY = int(startY * rH)
endX = int(endX * rW)
endY = int(endY * rH)

# in order to obtain a better OCR of the text we can potentially


# apply a bit of padding surrounding the bounding box -- here we
# are computing the deltas in both the x and y directions
dX = int((endX - startX) * args["padding"])
dY = int((endY - startY) * args["padding"])

# apply padding to each side of the bounding box, respectively


startX = max(0, startX - dX)
startY = max(0, startY - dY)
endX = min(origW, endX + (dX * 2))
endY = min(origH, endY + (dY * 2))

# extract the actual padded ROI


roi = orig[startY:endY, startX:endX]

# in order to apply Tesseract v4 to OCR text we must supply


# (1) a language, (2) an OEM flag of 4, indicating that the we

35
# wish to use the LSTM neural net model for OCR, and finally
# (3) an OEM value, in this case, 7 which implies that we are
# treating the ROI as a single line of text
config = ("-l eng --oem 1 --psm 7")
text = pytesseract.image_to_string(roi, config=config)

# add the bounding box coordinates and OCR'd text to the list
# of results
results.append(((startX, startY, endX, endY), text))

# sort the results bounding box coordinates from top to bottom


results = sorted(results, key=lambda r:r[0][1])

# loop over the results


for ((startX, startY, endX, endY), text) in results:
# display the text OCR'd by Tesseract
print("OCR TEXT")
print("========")
print("{}\n".format(text))

# strip out non-ASCII text so we can draw the text on the image
# using OpenCV, then draw the text and a bounding box surrounding
# the text region of the input image
text = "".join([c if ord(c) < 128 else "" for c in text]).strip()
output = orig.copy()
cv2.rectangle(output, (startX, startY), (endX, endY),
(0, 0, 255), 2)
cv2.putText(output, text, (startX, startY - 20),
cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 0, 255), 3)

# show the output image


cv2.imshow("Text Detection", output)
k.say(format(text))
k.runAndWait()
cv2.waitKey(0)

36
2. Code for live tracking of text in video

# USAGE
# python text_detection_video.py --east frozen_east_text_detection.pb

# import the necessary packages


from imutils.video import VideoStream
from imutils.video import FPS
from imutils.object_detection import non_max_suppression
import numpy as np
import argparse
import imutils
import time
import cv2

def decode_predictions(scores, geometry):


# grab the number of rows and columns from the scores volume, then
# initialize our set of bounding box rectangles and corresponding
# confidence scores
(numRows, numCols) = scores.shape[2:4]
rects = []
confidences = []

# loop over the number of rows


for y in range(0, numRows):
# extract the scores (probabilities), followed by the
# geometrical data used to derive potential bounding box
# coordinates that surround text
scoresData = scores[0, 0, y]
xData0 = geometry[0, 0, y]
xData1 = geometry[0, 1, y]
xData2 = geometry[0, 2, y]
xData3 = geometry[0, 3, y]
anglesData = geometry[0, 4, y]

# loop over the number of columns


for x in range(0, numCols):
# if our score does not have sufficient probability,
# ignore it
if scoresData[x] < args["min_confidence"]:

37
continue

# compute the offset factor as our resulting feature


# maps will be 4x smaller than the input image
(offsetX, offsetY) = (x * 4.0, y * 4.0)

# extract the rotation angle for the prediction and


# then compute the sin and cosine
angle = anglesData[x]
cos = np.cos(angle)
sin = np.sin(angle)

# use the geometry volume to derive the width and height


# of the bounding box
h = xData0[x] + xData2[x]
w = xData1[x] + xData3[x]

# compute both the starting and ending (x, y)-coordinates


# for the text prediction bounding box
endX = int(offsetX + (cos * xData1[x]) + (sin * xData2[x]))
endY = int(offsetY - (sin * xData1[x]) + (cos * xData2[x]))
startX = int(endX - w)
startY = int(endY - h)

# add the bounding box coordinates and probability score


# to our respective lists
rects.append((startX, startY, endX, endY))
confidences.append(scoresData[x])

# return a tuple of the bounding boxes and associated confidences


return (rects, confidences)

# construct the argument parser and parse the arguments


ap = argparse.ArgumentParser()
ap.add_argument("-east", "--east", type=str, required=True,
help="path to input EAST text detector")
ap.add_argument("-v", "--video", type=str,
help="path to optinal input video file")
ap.add_argument("-c", "--min-confidence", type=float, default=0.5,
help="minimum probability required to inspect a region")

38
ap.add_argument("-w", "--width", type=int, default=320,
help="resized image width (should be multiple of 32)")
ap.add_argument("-e", "--height", type=int, default=320,
help="resized image height (should be multiple of 32)")
args = vars(ap.parse_args())

# initialize the original frame dimensions, new frame dimensions,


# and ratio between the dimensions
(W, H) = (None, None)
(newW, newH) = (args["width"], args["height"])
(rW, rH) = (None, None)

# define the two output layer names for the EAST detector model that
# we are interested -- the first is the output probabilities and the
# second can be used to derive the bounding box coordinates of text
layerNames = [
"feature_fusion/Conv_7/Sigmoid",
"feature_fusion/concat_3"]

# load the pre-trained EAST text detector


print("[INFO] loading EAST text detector...")
net = cv2.dnn.readNet(args["east"])

# if a video path was not supplied, grab the reference to the web cam
if not args.get("video", False):
print("[INFO] starting video stream...")
vs = VideoStream(src=0).start()
time.sleep(1.0)

# otherwise, grab a reference to the video file


else:
vs = cv2.VideoCapture(args["video"])

# start the FPS throughput estimator


fps = FPS().start()

# loop over frames from the video stream


while True:
# grab the current frame, then handle if we are using a
# VideoStream or VideoCapture object

39
frame = vs.read()
frame = frame[1] if args.get("video", False) else frame

# check to see if we have reached the end of the stream


if frame is None:
break

# resize the frame, maintaining the aspect ratio


frame = imutils.resize(frame, width=1000)
orig = frame.copy()

# if our frame dimensions are None, we still need to compute the


# ratio of old frame dimensions to new frame dimensions
if W is None or H is None:
(H, W) = frame.shape[:2]
rW = W / float(newW)
rH = H / float(newH)

# resize the frame, this time ignoring aspect ratio


frame = cv2.resize(frame, (newW, newH))

# construct a blob from the frame and then perform a forward pass
# of the model to obtain the two output layer sets
blob = cv2.dnn.blobFromImage(frame, 1.0, (newW, newH),
(123.68, 116.78, 103.94), swapRB=True, crop=False)
net.setInput(blob)
(scores, geometry) = net.forward(layerNames)

# decode the predictions, then apply non-maxima suppression to


# suppress weak, overlapping bounding boxes
(rects, confidences) = decode_predictions(scores, geometry)
boxes = non_max_suppression(np.array(rects), probs=confidences)

# loop over the bounding boxes


for (startX, startY, endX, endY) in boxes:
# scale the bounding box coordinates based on the respective
# ratios
startX = int(startX * rW)
startY = int(startY * rH)
endX = int(endX * rW)

40
endY = int(endY * rH)

# draw the bounding box on the frame


cv2.rectangle(orig, (startX, startY), (endX, endY), (0, 255, 0), 2)

# update the FPS counter


fps.update()

# show the output frame


cv2.imshow("Text Detection", orig)
key = cv2.waitKey(1) & 0xFF

# if the `q` key was pressed, break from the loop


if key == ord("q"):
break

# stop the timer and display FPS information


fps.stop()
print("[INFO] elasped time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))

# if we are using a webcam, release the pointer


if not args.get("video", False):
vs.stop()

# otherwise, release the file pointer


else:
vs.release()

# close all windows


cv2.destroyAllWindows()

41

S-ar putea să vă placă și