Documente Academic
Documente Profesional
Documente Cultură
1 List of Publications
Publication Name Year
Name
List of Abbreviations
GUI Graphics User Interface
RBG Red, Green, and Blue
SVM Support Vector Machine
HOF Hand Of Gesture
HOG Hidden Markov Model
1
Department of Computer Engineering 4
Department of Computer Engineering 2
SYNOPSIS
3. Action Recognition
4. Motion Shape
5. Computer Vision
Relevant Objectives:
Motivation:
2
Hypothesis:
3
Fig1: System Architecture
The Fig 1 shows the detail overview of the purposed system. In Fig 1 System
take input as video and video is collection of video frames. We collect frames
from video later we use for processing. We need to analyze the frame and based
on that frame action is identified. Preprocessing purpose we remove the blur
from image for improve accuracy. Feature extraction can be done using HOG
algorithm and we train the support vector machine by using feature collected
from HOG algorithm. In real time web camera to get image and that image to
extract HOG feature later that feature used for test the support vector machine.
Based on the training support vector machine predict the class label.
Name of at least two journals where papers (Sem-I and Sem-II) can be pub-
lished:
In paper [3], they used Silhouette Extraction for easily recognize back-
ground and remove its background by using depth image. Motion history
image to identify motion. History motion image will analyze the n number
of frames. System collect feature vector and based on that feature system will
train support vector machine and predict the action.
In paper [4] the human action recognition is done by using data gloves and
various other hardware parameter. In that paper the HNN algorithm is used
for the extraction of the features from the image but this system has more
limitatioin in case of changing environment.
According to Naidoo et al. [11] these systems are restricted to the employ-
ment of refined and typically costly devices. conjointly applications that use
glove primarily based analysis encounter an oversized vary of issues together
with reliable ness, accuracy and magnetic force noise .however as Fudickar
and Nurzynska [12] imply the less researched however conjointly being terri-
bly promising domain for additional development area unit systems that use
video camera for image capture only(without any further markers). The most
approaches of those kinds of real time systems were signing communication
supported recording, transferring and presenting video streams. However the
need of a high speed net facility thanks to the high quantity of information
being sent was the most barrier.
Plan of dissertation Execution:
4
MONTH WORK
0-2 Problem identification, Problem analysis
2-3 Literature survey
3-6 Formulation of objectives and system ar-
chitecture
6-8 Methodologies, Data processing and
analysis
8-10 Pre-test, results and discussion
10-12 Conclusion
Problem Statement:
The purposed system aims towards providing proper generation of text de-
scription from real time gesture from deaf people.
Solving Approach:
Unlike American Sign Language or British sign language, Indian sign lan-
guage does not contain a standard database that is available for use. Thus we
have created our own video database and proposed a system to recognize ges-
tures of sign language as of a video stream of the signer. The proposed system
consists of
1. Hand Detection:
The System set specific ROI after that check the given ROI contain hand or
not. If it recognize then we find the actual hand and crop that image because
finally we extract the feature from actual hand instead of extracting feature of
whole hand. That improves accuracy of the system.
2) Preprocess
This section consists preprocessing on video like frame extraction, noise
and blur elimination and edge detection. Video holds huge amount of data at
dissimilar levels in terms of sights, shots and surrounds. Thus to process on
video, first extract frames from video. These frames are nothing but images
that are used for further processing.
5
3) Feature Extraction
We extract the feature from hand. We use the technique like histogram
to find out feature. A histogram is a graphical representation of the distri-
bution of numerical data. It is an estimate of the probability distribution of
continuous variable (quantitative variable) and was first introduced by Karl
Pearson. To construct a histogram, the first step is to bin the range of val-
ues that is, divide the entire range of values into a series of intervals and then
count how many values fall into each interval. The bins are usually specified as
consecutive, non-overlapping intervals of a variable. The bins (intervals) must
be adjacent, and are usually equal size.
4) Classification
Outcomes:
The English text description is generated efficiently from hand gesture recog-
nition by SVM classification. The process uses gesture recognition, noise de-
tection and feature extraction methods to provide output of visual content.
Mathematical model:
Algorithm
1) Compute Score of input vector
2) Kernel function (Radical basis function)
6
3) Class y = -1 when output of scoring function is negative.
4) Class y = 1 when output of scoring function is positive.
Parameter
Xi ith value of input vector
Yi ith value of class label
i is the coefficient associated with the i th training dataset
b- scaler value
Let S be the whole system consists of
S= {LV, DI, TE, OP, ?}
Where,
LV: Live Video
TD: Trained Dataset
TE: Text in English
OP: Output
?: null/empty set
1. LV={VF, ? }
1. Where
1. TD={DV,DE, ? }
2.
Where
DV: Database of video
DE: Database of English text
2.
Where,
EA: ENGLISH text Alphabet
EW: ENGLISH text Word
ES: ENGLISH text Sentences
References:
7
[2].N. Krishnamoorthy, G. Malkarnenkar, R. Mooney, K. Saenko, and S. Guadar-
rama, Generating Natural-Language Video Descriptions Using Text-Mined
Knowledge, 2013
[3] X. Sun, M. Chen, and A. Hauptmann, Action Recognition via Local De-
scriptors and Holistic Features, in
IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL,USA,
2009, pp. 5865.
[4] A.-P. Ta, C. Wolf, G. Lavoue, A. Baskurt, and J. M. Jolion, Pairwise Fea-
tures for Human Action Recognition, in International Conference on Pattern
Recognition
, Istanbul, Turkey , 2010, pp. 32243227.
[5]Ayushi Gahlot, J. Purvi Agarwaland,Akshya Agarwal, Skeleton based Hu-
man Action Recognition using Kinect, Recent Trends in Future Prospective in
Engineering & Management Technology 2016 .
[6].Chang, C., and Lin, LIBSVM: a library for support vector machines, ACM
Transactions on Intelligent Systems and Technology (TIST) 2(??):27, 2011.
[7].De Marneffe, M. MacCartney, B. and Manning, Generating typed depen-
dency parses from phrase structure parses, In Proceedings of the Interna-
tional Conference on Language Resources and Evaluation (LREC), volume 6,
449454, 2006.
[8].Ding, D. Metze, F. Rawat, S. Schulam, P. Burger, S. Younessian, E. Bao, L.
Christel, M. and Hauptmann, Beyond audio and video retrieval: towards mul-
timedia summarization, In Proceedings of the 2nd ACM International Confer-
ence on Multimedia Retrieval, 2012.
[9].Farhadi, A. Hejrati, M. Sadeghi, M. Young, P. Rashtchian, C. Hockenmaier,
J. and Forsyth, D., Every picture tells a story: Generating sentences from im-
ages, Computer VisionEuropean Conference on Computer Vision (ECCV)
1529, 2010.
[10]Felzenszwalb, P. McAllester, D. and Ramanan, D., A discriminatively trained,
multiscale, deformable part model, In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 18, 2008.
[11] Naidoo, S., Omlin, C.W. and Glaser, M. (2002). Vision-Based Static Hand
Gesture Recognition Using Support Vector Machines. Proceedings of Southern
Africa Telecommunication Networks and Applications Conference.
[12] Fudickar, S. and Nurzynska, K., (2007). A User-Friendly Sign Language
Chat. In: Proceedings of the Conference ICL2007. Villach, Australia. 26-28
September 2007.
[13].Laptev, I., and Perez, P., Retrieving actions in movies, In Proceedings
of the 11th IEEE International Conference on Computer Vision (ICCV), 18,
2007.
[14].Laptev, I. Marszalek, M. Schmid, C. and Rozenfeld, B., Learning realistic
human actions from movies, In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 18, 2008.
[15].Lee, M. Hakeem, A.; Haering, N. and Zhu, S., Save: A framework for
semantic annotation of visual events, In IEEE Computer Vision and Pattern
Recognition Workshops (CVPR-W), 18, 2008.
8
[16].Li, S. Kulkarni, G. Berg, T. Berg, A. and Choi, Y., Composing simple image
descriptions using web-scale n-grams, In Proceedings of the Fifteenth Confer-
ence on Computational Natural Language Learning (CoNLL), 220228, Asso-
ciation for Computational Linguistics (ACL), 2011.
Name of the student with sign Name of the guide with sign
.
9
Chapter 2
TECHNICAL KEYWORDS
Image Processing
To accomplish the recognize gestures of sign language from a video stream,
powerful image processing techniques such as frame differencing based track-
ing, edge detection, wavelet transform, image fusion techniques to segment
shapes in our videos are used.
Machine Learning
Machine learning is the subfield of computer science that gives computers
the ability to learn without being explicitly programmed. Machine learning is
closely related to (and often overlaps with) computational statistics, which also
focuses in prediction-making through the use of computers. It has strong ties
to mathematical optimization, which delivers methods, theory and application
domains to the field. Machine learning is sometimes conflated with data min-
ing, where the latter subfield focuses more on exploratory data analysis and is
known as unsupervised learning. Machine learning can also be unsupervised
and be used to learn and establish baseline behavioral profiles for various en-
tities.
Gesture Recognition
Gesture recognition is the most generally used modality among other com-
munication modalities in human computer interaction. Gesture recognition is
a matter in computer science and language technology with the aim of inter-
preting human gestures via mathematical algorithms. Gestures can originate
from any physically motion or state but normally originate as of the face or
hand.
Video Processing
Video processing is an exacting case of signal processing, which frequently
employs video filters and where the input and output are video files or video
streams. This consist process like scaling the video for reducing data and so
saving the processing time, section matter from a video sequence etc.
10
Chapter 3
INTRODUCTION
11
based matching [5], template matching [6], statistical modelling [7] algorithm
are applied to categorize the input activity into its appropriate class.
Hierarchical Approaches works on the concept of divide and conquer; in
which any complex problem can be solved by dividing it into several sub prob-
lems. The sub-activities are used to identify the main complex activity. These
approaches are classified on the basis of the recognition methodologies they
use: statistical approach, syntactic approach and description-based approach.
Statistical approaches construct statistical state-based models like layered Hid-
den Markov Model (HMM) to represent and recognize high level human activ-
ity.
Ties [8]. Similarly syntactic approach use a grammar syntax such as a
stochastic context free grammar (SCFG) to model sequential activities [9]. Description-
based approach represent human activities by describing sub-events of the ac-
tivities and their temporal, spatial and logical structures [10]. Fig. 1.1 sum-
marizes the hierarchical approach based taxonomy of the approaches used in
human activity recognition.
Ke et al. [11] used segmented spatio temporal volumes to model human
activities, their system applied a hierarchical mean-shift to clusters of similar
colored voxels, and obtain several segmented volumes. The motivation is to
and the actor volume segments automatically and to measure their similarity
to the action model. Recognition is done by searching for a subset of over-
segmented spatio temporal volumes that best matches the space of the action
model. Their system recognized simple actions such as hand waving and box-
ing from the KTH action database.
Laptev [12] recognized human actions by extracting sparse spatio temporal
interest points from videos. They extended the local feature detectors (Har-
ris) commonly used for object recognition, in order to detect interest points
in space-time volume. Motion patterns such as change in direction of object;
splitting and merging of an image structure; and collision/bouncing of object
are detected as a result. In their work, these features were used to distinguish
a walking person from complex backgrounds.
Bobick and Davis [13] constructed a real time action recognition system
3 using template matching. Instead of maintaining the 3-dimensional space-
time volume of each action, they represented each action with a template com-
posed of two 2-dimensional images: binary Motion Energy Image and scalar
Motion History Image. The two images are constructed from a sequence of
foreground images, which essentially are weighted 2-D projections of the orig-
inal 3-D (XYT) volume. By applying a traditional template matching technique
to a pair of (MEI;MHI), their system was able to recognize simple actions like
sitting, arm waving and crouching. However the MHI method suers from a
serious drawback that when self-occluding or overwriting actions are encoun-
tered it leads to severe recognition failure [14]. This failure results because
when repetitive action is performed the same pixel location is accessed mul-
tiple times due to which the previously stored information in the pixel gets
overwritten or deleted by the current action. In order to address this issue
we implemented a novel technique for creating motion history images that are
12
capable of representing self-occluding and overwriting action. Our method-
ology overcomes the limitation of representing repetitive activities and thus
outshines the conventional MHI method.
4 3.3 Motivation
Design objectives and goals are:
Chapter 4
13
4.1.4 Approaches for Solving the Problem and Efficiency Issues
While capturing image that image contain extra noise so accuracy will be
vary. We used median filter to remove noise in capture image. The system
work on real time so we used faster techniques to reduce computational time.
4.1.5 Outcome
Outcome from this project is the capability to use application for deaf peo-
ple so that they will communicate.
14
Chapter 5
DISSERTATION PLAN
15
SOFTWARE REQUIREMENT AND SPECIFICATION
10 Project Scope
The scope of the project is to provide a platform for dumb individuals to share
their views among every one. Analyze human motion from images and video
and developing application like Facebook where dumb people communicate
with each other.
16
Figure 6.1: Block DiagramA Methodology for Sign Language Video Translation
into Textual Version in Hindi
Department of Computer Engineering 30
A Methodology for Sign Language Video Translation into Textual Version in
Hindi
Department of Computer Engineering 2
6.3 Package Diagram
Package diagram s UML structure diagram which shows packages and de-
pendencies between the packages. Model diagrams allow to show different
views of a system, for example, as multi-layered (aka multi-tiered) application
- multi-layered application model. Following fig. 6.2 package diagram consist
overall representation of layers used in implementation. User communicate
with system using provided GUI and presentation logic. Then business layer
in system consists all knowledge require in business point of view and it con-
sist all business entities like identify action. Afterword, Data access layer used
17
for storing information of features which is collected in Business layer.
18
Figure 6.3: Deployment Diagram
19
(defined during architectural design) at an abstraction level, closer to the ac-
tual code. In addition, it specifies an interface that may be used to access the
functionality of all the software components.
Interaction Diagram
A collaboration diagram, also called a communication diagram or inter-
action diagram, is an illustration of the relationships and interactions among
software objects in the Unified Modeling Language (UML).In this application
User and System are two objects, user interact with system by giving action
video as input to perform some operations like frame generation etc. as illus-
trated in fig. 6.4.
20
A Methodology for Sign Language Video Translation into Textual Version in
Hindi
Department of Computer Engineering 27
1. Human Interface
This application is use by any user using which normal people can commu-
nicate with defeat people or vice versa.
21
Chapter 7
13 7.1 Introduction
To develop any of the system the design phase of the system plays a vital role.
The system design gives central theme of the system which is going to be de-
velop. So the system design and documentation is very important for devel-
oper to start the work.
7.2 Overview of System
Video is collection of video frames. We collect frames from video later we
use for processing. We need to analyze the frame and based on that frame ac-
tion is identified. Preprocessing purpose we remove the blur from image for
improve result. Feature extraction can be done using HOG algorithm and we
train the support vector machine by using feature collected from HOG algo-
rithm. In real time web camera to get image and that image to extract HOG
feature later that feature used for test the support vector machine. Based on the
training support vector machine predict the class label as output.7.1: System
Architecture
22
Fig.7.1 Overview of system
The overall systems perform preprocessing, Feature extraction, Classifica-
tion, detecting the hand gesture and finally generating text description. The
proposed system consists of two major phases training and testing.
1. Training Phase:
In training module the images extracted from the captured video and are
trained by using SVM after that stored in the database by assigning class la-
bel. Figure 1 shows the overview of the system in the training section. All the
trained images are used to extract features and which are further used for test-
ing. Firstly, through live video the different frames are captured since a video
is nothing but a set of images. Then the training is performed on that captured
frames. After that, every Image is processed by filtering technique (noise re-
moval, edge detection or shape detection) and applying Histogram Oriented
Gradient (HOG) algorithm is used for feature Extraction. HOG algorithm de-
fines the objects (hand) and motion shapes in the images by describing the
intensity gradient and edge detection. After that an gray scale image is gener-
ated. This gray image used as input. The output is a list of points on the image
23
each associated to a vector of low-level descriptors. These points are said key
points and their descriptors are invariant by rescaling, in-plane rotating, and
noise addition and in some cases by changes of illuminant. The gesture cap-
tured from the images are used to generate exact meaning and are ranked in
English language. Thus whatever done in training is hand gesture and hand
motion are used to create exact meaning Thus in training section, meaning of
each gesture are insert into database.
1. Testing Phase:
This module test live video and gets the result in terms of segmentation
of frames. In this phase, a video is processed and divided into frames and
these frames are further processed by applying the purifying algorithm to re-
move noise from images. Median blur technique is used to filter image. The
lower part of figure 1 shows the testing phase. After elimination of noise, the
features of images are extracted and these features are linking with training
videos to recognize text. The prosed system under goes following step to yield
the desired result.
In this Fig.7.2, use cases of user are shown. In our project user takes input
for training or testing purpose. Then he/she perform video frame extraction,
image processing, feature extraction and for testing perform feature matching
on testing video.
24
Figure 7.2: Use Case
14 7.3 ER Diagram
ER-modeling is a data modeling technique used in software engineering to pro-
duce a conceptual data model of a information system. Diagrams created using
this ER-modeling technique are called Entity-Relationship Diagrams, or ER di-
agrams or ERDs. Following ER-diagram in fig. 7.3 shows that user is an entity
who handles application and performs some operations as shown in diagram.
25
Figure 7.3: ER Diagram
26
Figure 7.4: DFD 0 Diagram
After feature extraction these features are match with trained video fea-
tures from database to detect action. Following fig. 7.5 shows DFD2 diagram
for this application.
27
Figure 7.2 Activity Diagram
15
28
Figure 7.8: Class Diagram
1. Availability
The system must always be available for action recognition every time.
29
1. Modifiability
1. Performance
Performance of this system depends upon nature and size of input data. Per-
formance is in accuracy. Error is also a term that can be used as a performance
measure which tells us how many instances are misclassified.
1. Security
Security of the system is an important issue. We use login details for an-
alyzing authenticate user. This system itself is related to error in software.
Security is achieved by applying several constraints.
1. Testability
The action database can be upgraded to add new action into the database.
The system can now be tested with the action present in the database using
SVM Classifier.
1. Usability
The role of Action recognition systems in the society is to ensure that deaf
people have equality of opportunity and full participation in society.
18
The data mining systems such as this product interact via stream data. The
major component for the interface consists of Java virtual machine and inte-
grated development environment (IDE). System design takes into considera-
tion how the system will look like to the end user.
30
1. External Machine Interfaces
1. Human Interface
The human machine interface or user interface is the part of the machine
that handles the interaction between human and machine. Any user can use
this application as external entity by giving dataset as input.
1. Validity Criteria
31
Chapter 8
This paper has introduced the English text generation from the sign lan-
guage. The process uses gesture recognition image preprocessing and feature
extraction. Each video splits into frames at one-second intervals and the fil-
tering, shape detection techniques are applied on every frame. The HOG algo-
rithm is used for mining the features and these features are used for compari-
son of testing with the training video.
REFERENCES
[1] M. S. Ryoo and J. K. Aggarwal, Spatio-Temporal Relation- ship Match:
Video Structure Comparison for Recognition of Complex Human Activities,
in
International Conference on Computer Vision , Kyoto, Japan, 2009, pp. 1593
1600.
[2] X. Sun, M. Chen, and A. Hauptmann, Action Recognition via Local De-
scriptors and Holistic Features, in
IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL,USA,
2009, pp. 5865.
[3]Ayushi Gahlot, J. Purvi Agarwaland,Akshya Agarwal, Skeleton based Hu-
man Action Recognition using Kinect, Recent Trends in Future Prospective in
Engineering & Management Technology 2016 .
[4] Di Wu, Ling Shao Silhouette Analysis-Based Action Recognition Via Ex-
ploiting Human Poses, IEEE Transactions on Circuits and Systems for Video
Technology ( Volume: 23, Issue: 2, Feb. 2013 )
[5] Ting Liu, Mojtaba Seyedhosseini, and Tolga Tasdizen, Image Segmenta-
tion Using Hierarchical Merge Tree, IEEE TRANSACTIONS ON IMAGE PRO-
CESSING, VOL. 25, NO. 10, OCTOBER 2016.
[6] Chengcheng Jia, Yun Fu, Low-Rank Tensor Subspace Learning for RGB-D
Action Recognition , IEEE Transactions on Image Processing
[7] ]Ayushi Gahlot, J. Purvi Agarwaland,Akshya Agarwal, Skeleton based Hu-
man Action Recognition using Kinect, Recent Trends in Future Prospective in
Engineering & Management Technology 2016 .
[8] M. S. Ryoo and J. K. Aggarwal, Spatio-Temporal Relation- ship Match:
Video Structure Comparison for Recognition of Complex Human Activities,
inInternational Conference on Computer Vision , Kyoto, Japan, 2009, pp. 1593
1600.
[9] Fudickar, S. and Nurzynska, K., (2007). A User-Friendly Sign Language
Chat. In: Proceedings of the Conference ICL2007. Villach, Australia. 26-28
September 2007.
[10] Amit kumar and Ramesh Kagalkar Advanced Marathi Sign Language
Recognition using Computer Vision, International Journal of Computer Ap-
plications (0975 8887) Volume 118 No. 13, May 2015.
32
[11] Ramesh M. Kagalkar and Nagaraja H.N, New Methodology for Transla-
tion of Static Sign Symbol to Words in Kannada Language, International Jour-
nal of Computer Applications (0975 8887) Volume 121 No.20, July 2015.
[12] Ramesh M. Kagalkar, Dr. Nagaraj H.N and Dr. S.V Gumaste, A Novel
Technical Approach for Implementing Static Hand Gesture Recognition, In-
ternational Journal of Advanced Research in Computer and Communication
Engineering(ISSN (Online) 2278-1021 ISSN (Print) 2319-5940), Vol. 4, Issue 7,
July 2015.
33