LucianTuca Disertatie PDF

“Alexandru Ioan Cuza” University of Iasi
Faculty of Computer Science
DISSERTATION
SPEECH RECOGNITION AND ITS

EDUCATION APPLICATIONS
PROPOSED BY SCIENTIFIC
COORDINATOR
Lucian – Petru C. Assoc. Prof. PhD

TUCĂ Adrian IFTENE
– 2017 –
Declaraţie privind originalitatea şi respectarea drepturilor de autor
Prin prezenta declar pe propria răspundere faptul că lucrarea de licență cu

titlul “Speech recognition and its education applications” este scrisă de mine și
nu a mai fost prezentată în cadrul altei facultăți sau instituții de învățământ
superior de stat din țară sau străinătate.
De asemenea, declar faptul că toate sursele utilizate, atât cele preluate din

mediul offline cât și cele preluate din mediul online, sunt indicate în lucrare, iar
regulile de evitare a plagiatului sunt respectate:
- toate fragmentele de text reproduse exact, chiar și în traducere proprie
din altă limbă, sunt scrise în ghilimele și dețin referința precisă a sursei de la
care au fost preluate;
- reformularea textelor scrise de către alți autori folosind cuvinte din
propriul vocabular dețin referință exactă privind originea acestora;
- coduri sursă, imagini, etc. preluate din proiecte de tip open-source sau
alte surse similare sunt utilizate în conformitate cu drepturile de autor și dețin
referințe precise privind originea acestora;
- în cazul rezumării ideilor altor autori este precizat textul de referință din
care acestea au fost preluate;
Absolvent
Lucian – Petru C. ȚUCĂ,
_________________________
(semnătura în original)
1
Declaraţie de consimţământ
Prin prezenta declar pe propria răspundere faptul că lucrarea de disertație

cu titlul “Speech recognition and its education applications”, codul sursă al
programelor cât și celelalte conținuturi (grafice, multimedia, date de test, etc.)
care însoțesc această lucrare pot fi utilizate în cadrul Facultății de Informatică a
Universității “Alexandru Ioan Cuza”, Iasi.
De asemenea, sunt de acord ca această unitate de învățământ superior de
stat să utilizeze, modifice, reproducă sau să distribuie în scopuri necomerciale
programele-calculator (format executabil și/sau sursă) realizate personal în
cadrul prezentei lucrări de disertație.
Absolvent
Lucian – Petru C. ȚUCĂ,
_________________________
(semnătura în original)
2
Contents
Contents 3
Abstract 5
Motivation 6
State of the art 8

Notions 8
Problem 10
Solution 10
Conclusions 10
Speech recognition 11
Google Cloud Speech API 11
Cloud Speech API characteristics 11
IBM Bluemix Speech to Text 12
Watson Speech to Text characteristics 13
Open Source alternative 14
Kaldi 15
Evaluation, results and comparison 16
Evaluation 17
Commands recognition 17
Kaldi 20
Costs 21
Kaldi 22
Conclusions 23
Voice Geometry Painter 24

Storyboard 25
Userflow 28
Sketches & Wireframe 29
Sketch 30
Wireframe 30
Speech interface 31
Commands 31
Questions, options, criteria (QOC) 31
3
Architecture 35
Speech logic 35
Command parser 36
Drawing context 36
Architecture UML 37
Implementation 38
Speech logic & Command parser 38
Commands 38
Google interim results 39
Best matching interim result 40
Drawing context 40
Conclusions 43
App evaluation 43
Conclusions and further work 43
Bibliography and References 43
Attachments 43
Annex 1 - Figure 18: Voice Geometry Painter map 44
Annex 2 - Figure 24: Architecture UML 45
4
Abstract
Nowadays, we find ourselves in an era when the education is reforming and on the
other side the technology is getting better, greater and more accessible than ever.1
The IoT (Internet of things) is already altering health care, security, utilities,
transportation, and household management. The devices themselves might be small, but they
bring about major changes in how we live, work, and educate our society; we must plan for
and question those changes.2
Winona Ryder about Internet-of-Things, computer interaction in classrooms and
e-learning with smart devices:
“We must approach the Internet of Things from a place that doesn't reduce ourselves,
or reduce students, to mere algorithms. We must approach the IoT as a space of learning, not
as a way to monitor and regulate. Our best tools in this are ones that encourage compassion
more than obedience. The Internet is made of people, not things.”3
Anticipating all these phenomenons this project aims to provide a working prototype
of an educational oriented app that will make teaching easier, learning more pleasant. This
application will use speech recognition as one of its pillars and will address the basic
geometry students.
1
"How IoT in Education is Changing the Way We Learn - Business Insider." 20 Dec. 2016, http://www.
businessinsider.com/internet-of-things-education-2016-9. Accessed 16 Jan. 2017.
2
"The Internet of Things for Educators and Learners | EDUCAUSE." 8 Aug. 2016, http://er.educause.edu/
articles/2016/8/the-internet-of-things-for-educators-and-learners. Accessed 16 Jan. 2017.
3
"Winona Ryder and the Internet of Things | EDUCAUSE." 27 Jun. 2016, https://er.educause.edu/articles/2016/
6/winona-ryder-and-the-internet-of-things. Accessed 16 Jan. 2017.
5
Motivation
Internet of Things is considered 4 to be the major breakthrough in technology if not
already5. It is reconned that devices and the interaction with them will be tremendously
popular, available in most houses, office buildings6 and any other type of institutions.
Given this popularity, a poor knowledge and management about Internet of Things
can lead to major catastrophes. e.g:
“The research firm Gartner has estimated that, by 2020, there will be 250 million
connected cars on the world’s roads, with many of them capable of driving themselves. There
are eight million traffic accidents each year and 1.3 million crash-related deaths; Cisco’s
Smart, Connected Vehicles division has posited that autonomous cars could eliminate as
many as 85% of head-on collisions. They could also help ease traffic, since they’ll be able to
communicate their positions to each other and therefore drive much closer together than
vehicles piloted by humans. Traffic experts call this “platooning”—packing more cars into
the same road space—and it could help save drivers at least some of the 90 billion hours they
currently spend stuck in jams each year, generating 220 million metric tonnes of
carbon-equivalent and wasting at least $1 trillion in fuel costs and lost productivity.” 7
Considering the above quotation, we thought about (besides the obvious advantages
the system brings in) the negatives that can happen. If we imagine a society that has poor
knowledge about IoT, all the improvements that the systems offers will be delayed due to a
reluctant society. Given the fact that time is so important due to its irreversible character and
the speed it seems to pass with this paper was chosen to approach a subject that hopes to
educate by infiltrating in the education system itself.
With this in mind we thought that what other place would be better to start educating
people about technology (IoT) if not education itself.
This project aims to build a working prototype of an application that utilizes

techniques present in the IoT zone in order to educate and accommodate people about IoT,
its presence, its techniques and its major importance in the near future.
4
"The Internet of Things – Promise for the Future? An ... - CiteSeerX." http://citeseerx.ist.psu.edu/viewdoc/
download?doi=10.1.1.458.8816&rep=rep1&type=pdf. Accessed 16 Jan. 2017.
5
"Eight ways the Internet of Things will change the ... - The Globe and Mail." http://www.theglobeandmail.
com/report-on-business/rob-magazine/the-future-is-smart/article24586994/. Accessed 16 Jan. 2017.
6
"Sensing the future of the Internet of Things - PwC's Accelerator." https://www.pwcaccelerator.com/
pwcsaccelerator/docs/pwc-future-of-the-internet-of-things.pdf. Accessed 16 Jan. 2017.
7
"Eight ways the Internet of Things will change the ... - The Globe and Mail." http://www.theglobeandmail
.com/report-on-business/rob-magazine/the-future-is-smart/article24586994/. Accessed 16 Jan. 2017.
6
This paper will address voice recognition8 , a pillar when it comes to Internet of
Things.9
8
"Voice Recognition for the Internet of Things - MIT Technology Review." 24 Oct. 2014, https://www.
technologyreview.com/s/531936/voice-recognition-for-the-internet-of-things/. Accessed 16 Jan. 2017.
9
"The Role of Voice in the Internet of Things - Strategy Analytics." 19 Feb. 2016,
https://www.strategyanalytics
.com/strategy-analytics/blogs/iot/2016/02/19/the-role-of-voice-in-the-internet-of-things. Accessed 16 Jan. 2017.
7
State of the art
Notions
1. What is Internet of Things?
The Internet of Things is an emerging topic of technical, social, and economic
significance. Consumer products, durable goods, cars and trucks, industrial and utility
components, sensors, and other everyday objects are being combined with Internet
connectivity and powerful data analytic capabilities that promise to transform the way we
work, live, and play. Projections for the impact of IoT on the Internet and economy are
impressive, with some anticipating as many as 100 billion connected IoT devices and a
global economic impact of more than $11 trillion by 2025.10
2. What is Education?
Education is the act or process of imparting or acquiring general knowledge,
developing the powers of reasoning and judgment, and preparing oneself or others
intellectually for mature life.
3. What is Speech Recognition?

Speech recognition (SR) is the inter-disciplinary sub-field of computational linguistics
that develops methodologies and technologies that enables the recognition and translation of
spoken language into text by computers. It is also known as "automatic speech recognition"
(ASR), "computer speech recognition", or just "speech to text" (STT). It incorporates
knowledge and research in the linguistics, computer science, and electrical engineering
fields.11
4. What is Speech Recognition within Internet of Things?

With the rapid spreading of smart devices, there has been a growing interest in the
concept of Internet of Things (IoT). While as a network of connected objects, IoT is created
by enabling machine to machine interactions, another important factor of IoT is the
human-machine interaction. By creating a connected life, people can interact with their
devices, appliances, vehicles, etc. As one of the most natural ways of communication, using
speech to interact with things adds value by enriching the user experience.12
10
"The Internet of Things (IoT): An Overview | Internet Society." 15 Oct. 2015,
https://www.internetsociety.org/ doc/iot-overview. Accessed 16 Jan. 2017.
11
"Speech recognition - Wikipedia." https://en.wikipedia.org/wiki/Speech_recognition. Accessed 16 Jan. 2017.
12
"Personalized speech recognition for Internet of Things - IEEE Xplore ...."
http://ieeexplore.ieee.org/document/7389082. Accessed 16 Jan. 2017.
8
5. What is Speech Recognition within Education?
For language learning, speech recognition can be useful for learning a second
language. It can teach proper pronunciation, in addition to helping a person develop fluency
with their speaking skills.
Students who are blind (see Blindness and education) or have very low vision can
benefit from using the technology to convey words and then hear the computer recite them, as
well as use a computer by commanding with their voice, instead of having to look at the
screen and keyboard.
Students who are physically disabled or suffer from Repetitive strain injury/other
injuries to the upper extremities can be relieved from having to worry about handwriting,
typing, or working with scribe on school assignments by using speech-to-text programs. They
can also utilize speech recognition technology to freely enjoy searching the Internet or using
a computer at home without having to physically operate a mouse and keyboard.
Speech recognition can allow students with learning disabilities to become better
writers. By saying the words aloud, they can increase the fluidity of their writing, and be
alleviated of concerns regarding spelling, punctuation, and other mechanics of writing. Also,
see Learning disability.
Use of voice recognition software, in conjunction with a digital audio recorder and a
personal computer running word-processing software has proven to be positive for restoring
damaged short-term-memory capacity, in stroke and craniotomy individuals.13
13
"Speech recognition - Wikipedia." https://en.wikipedia.org/wiki/Speech_recognition. Accessed 16 Jan. 2017.
9
Problem
The problem that was identified and planned to address with this project can be
defined as follows:
An uneducated, unprepared, unaccustomed population that will be faced with a major

breakthrough (IoT) that will affect their daily life (home, commute, traffic, office …) will not
benefit entirely from it and there are chances that it (IoT) will cause more harm than good.
Solution
Given the above identified problem the plan is to address it with this paper as follows:
Subliminally educate the young about Internet of Things by having them contact
technologies and practices IoT specific within their education system.
The proposed solution is an application that 3rd - 4th graders and their teachers can
use at school for geometry lessons. This application will have Speech recognition as a pillar
and aims to educate the youngsters about technology and IoT by having them encounter
specific techniques (Speech recognition) at this early stage.
A second pillar that this application will be built upon will be the user experience.
The human computer interaction factor is tremendous important for how the application will
be perceived when used. It will be a difficult task as everything that can be interacted with
and everything the application shows must be easy to use for both children and teachers and
must please two age categories that represent two different personas (annexes 3 and 4).
Conclusions
The problem that has been identified tends to be critical for the future.
This paper wants to prove that it can be tackled and that knowledge together with
education can be infused since an early age to have a better future.
“If you are planning for a year, sow rice; if you are planning for a
decade, plant trees; if you are planning for a lifetime, educate people.”
Chinese proverb
10
Speech recognition
Google Cloud Speech API
Figure 1: Google Cloud Speech API14
Google Cloud Speech API enables developers to convert audio to text by applying
powerful neural network models in an easy to use API. The API recognizes over 80
languages and variants, to support your global user base. You can transcribe the text of users
dictating to an application’s microphone, enable command-and-control through voice, or
transcribe audio files, among many other use cases.
Google manages to do this because it applies wide-range neural network models
targeted at processing natural language.
Cloud Speech API characteristics

Google's new API contains some of the most interesting functionalities when you
need an application programming interface linked to natural language processing, speech
recognition and obtaining results in real time. This is important since a sufficiently high
processing speed is needed to be able to respond immediately.
● Automatic Speech Recognition (ASR): an in-depth learning neural network is used

to recognized speech, provide speech-based search features and transcribe speech.
14
Google Cloud Speech API: http://thenextweb.com/google/2016/03/23/now-can-power-app-googles-speech-
recognition-software/
11
● Streaming recognition: as the API processes and recognizes the user's speech, it
returns results in real time with no waiting times. This allows the application to offer all
speech processing functionalities.
● Buffered audio support: the API processes sound from the microphone of an
application or mobile device and packages it in various compression formats: FLAC, AMR,
PCMU and linear-16. This compression is necessary to subsequently process the sound.
● Speech recognition in over 80 languages. This characteristic offers a major

competitive advantage over other providers of similar services for external developers.
● Integrated API.
● Inappropriate content filtering.15
IBM Bluemix Speech to Text
Figure 2: IBM Bluemix Speech to Text16
Watson Speech to Text can be used anywhere there is a need to bridge the gap
between the spoken word and its written form. This easy-to-use service uses machine
intelligence to combine information about grammar and language structure with knowledge
15
"Google releases API to convert audio into text: characteristics for ...." 10 Jun. 2016, https://bbvaopen4u.com/
en/actualidad/google-releases-api-convert-audio-text-characteristics-developers. Accessed 16 Jan. 2017.
16
IBM Bluemix Speech to Text: http://www.tricedesigns.com/2015/07/13/voice-driven-native-mobile-apps-
with-ibm-watson/
12
of the composition of an audio signal to generate an accurate transcription. It uses IBM's
speech recognition capabilities to convert speech in multiple languages into text. The
transcription of incoming audio is continuously sent back to the client with minimal delay,
and it is corrected as more speech is heard. Additionally, the service now includes the ability
to detect one or more keywords in the audio stream. The service is accessed via a WebSocket
connection or REST API.17
Watson Speech to Text characteristics

● Automatic Speech Recognition: DNN deep neural networks are used and
trained using new methods have been shown to outperform GMMs on a
variety of speech recognition benchmarks, sometimes by a large margin.18
● The Speech to Text service can be used anywhere voice-interactivity is
needed. In addition to transcribing audio in multiple languages, the service
provides the ability to detect the presence of specific keywords or key phrases
in the input stream. Common uses for the Speech to Text service include:
○ Interactions in mobile experiences
○ Transcribing media files
○ Call center transcriptions
○ Voice control of embedded systems
○ Converting sound to text to make data searchable19
● Integrated REST API:
○ The IBM® Speech to Text service provides an Application
Programming Interface (API) that lets you add speech transcription
capabilities to your applications. To transcribe the human voice
accurately, the service leverages machine intelligence to combine
information about grammar and language structure with knowledge of
the composition of the audio signal. The service continuously returns
and retroactively updates the transcription as more speech is heard.
○ The IBM® Speech to Text service provides an Application
Programming Interface (API) that lets you add speech transcription
capabilities to your applications. To transcribe the human voice
accurately, the service leverages machine intelligence to combine
information about grammar and language structure with knowledge of
the composition of the audio signal. The service continuously returns
and retroactively updates the transcription as more speech is heard.
17
"Speech to Text | IBM Watson Developer Cloud." https://www.ibm.com/watson/developercloud/speech-to
-text.html. Accessed 16 Jan. 2017.
18
"Speech to Text Service Documentation | Watson Developer Cloud - IBM." https://www.ibm.com/watson/
developercloud/doc/speech-to-text/science.shtml. Accessed 17 Jan. 2017.
19
13
● Available Languages: English (US), English (UK), Japanese, Arabic (MSA,
Broadband model only), Mandarin, Portuguese (Brazil), Spanish, French
(Broadband model only)
● Speaker labels (beta): Recognizes different speakers from narrowband audio
in US English, Spanish, or Japanese. This feature provides a transcription that
labels each speaker's contributions to a multi-participant conversation.
● Keyword spotting (beta): Identifies spoken phrases from the audio that match
specified keyword strings with a user-defined level of confidence. This feature
is especially useful when individual words or topics from the input are more
important than the full transcription. For example, it can be used with a
customer support system to determine how to route or categorize a customer
request.
● Word alternatives (beta), confidence, and timestamps: Reports alternative
words that are acoustically similar to the words that it transcribes, confidence
levels for each of the words that it transcribes, and timestamps for the start and
end of each word.
● Maximum alternatives and interim results: Returns alternative and interim
transcription results. The former provide different possible hypotheses; the
latter represent interim hypotheses as the transcription progresses. In both
cases, the service indicates final results in which it has the greatest confidence.
● Profanity filtering: Censors profanity from US English transcriptions by
default. You can use the filtering to sanitize the service's output.
● Smart formatting (beta): Converts dates, times, numbers, phone numbers, and
currency values in final transcripts of US English audio into more readable,
conventional forms.
Open Source alternative

It is known that building a speech recognition engine is an incredibly difficult task.
But fear not, there are quite a few speech recognition toolkits available today. These toolkits
are meant to be the foundation to build a speech recognition engine on.
The best part is that there are several free ones that are very high quality. Below one
of them is presented and detailed as a possible alternative to Google Cloud Speech API and
IBM Bluemix Speech to Text. It is important to keep in mind its open source orientation, it’s
raw character and its lack of eye candy features.
It is designed and developed to be a simple yet reliable speech recognition engine, not
more, not less.
14
Kaldi
Figure 3: Kaldi20
“Kaldi is a toolkit for speech recognition written in C++ and licensed under the
Apache License v2.0. Kaldi is intended for use by speech recognition researchers.”21
The flavor of Kaldi

“This section summarizes some of the more generic qualities of the Kaldi toolkit. To
some extent this describes the goals of the current developers, as much as it describes the
current status of the project.
● Kaldi emphasizes generic algorithms and universal recipes
○ "generic algorithms" means things like linear transforms, rather than
those that are specific to speech in some way. But it is not intended to
be too dogmatic about this, if more specific algorithms are useful.
○ Recipes that can be run on any data-set, rather than those that have to
be customized are preferred.
● Provably correct algorithms are preferred
○ The recipes have been designed in such a way that in principle they
should never fail in a catastrophic way. There has been an effort to
avoid recipes and algorithms that could possibly fail, even if they don't
fail in the "normal case" (one example: FST weight-pushing, which
normally helps but can crash or make things much worse in certain
cases).
● Kaldi code is thoroughly tested.
○ The goal is for all or nearly all the code to have corresponding test
routines.
● Simple cases are kept simple.
○ There is a danger when building a large speech toolkit that the code
can become a forest of rarely used alternatives. We are trying to avoid
this by structuring the toolkit in the following way. Each
command-line program generally works for a limited set of cases (e.g.
20
"kaldi-asr.org." http://kaldi-asr.org/. Accessed 19 Jan. 2017.
21
"Kaldi: About the Kaldi project." http://kaldi-asr.org/doc/about.html. Accessed 19 Jan. 2017.
15
a decoder might just work for GMMs). Thus, when you add a new type
of model, you create a new command-line decoder (that calls the same
underlying templated code).
● Kaldi code is easy to understand.
○ Even though the Kaldi toolkit as a whole may get very large, it is
aimed for each individual part of it to be understandable without too
much effort.
● Kaldi code is easy to reuse and refactor.
○ The toolkit is desired to be to as loosely coupled as possible. In general
this means that any given header should need to #include as few other
header files as possible. The matrix library, in particular, only depends
on code in one other subdirectory so it can be used independently of
almost all the rest of Kaldi.”22
As mentioned before, in Kaldi’s description above it has been chosen to describe what
makes it one of the best open source alternatives and not the features that would sell it.
Within the context of this paper for open source the principles of how something is built
supersede the fancy powerful features of the cloud giants.
Evaluation, results and comparison

“According to a report23 published by MarketsandMarkets report, the speech
recognition market, currently at $3.73 billion, will grow to $9.97 billion by 2022. The voice
recognition market is expected to grow from $440 million today to $1.99 billion by 2022.
That’s a compound annual growth rate of 23.66 percent.24
The main reason speech recognition marketing is growing is because of the
widespread adoption of mobile computer technology. Smartphones have become an essential
in our lives (like it or not). Developers have picked speech recognition technology to make it
easier for us to interact with our smart phones and its applications.
Another important reason for growth in the speech recognition market is its rising
usage and popularity in the healthcare sector. Hospitals and doctor’s offices have been using
the likes of electronic health records (EHR) systems to keep track of customer records.
These devices and applications often feature speech recognition and text-to-speech
functionality. A health practitioner can simply ask for a record and then have it read aloud to
them. EHR’s can also eliminate the need for typing in notes. Practitioners just speak to their
devices while it transcribes all of their notes in text!
Translations, transcription, hands-free computing, and automated customer service are
some of the main uses of speech recognition.25
22
"Kaldi: About the Kaldi project." http://kaldi-asr.org/doc/about.html. Accessed 19 Jan. 2017.
23
"MarketsandMarkets." http://www.marketsandmarkets.com/. Accessed 19 Jan. 2017.
24
"Speech Market Projected To See Triple Growth ... - Text2Speech Blog." http://blog.neospeech.com/2016/06/
29/speech-market-growth/. Accessed 19 Jan. 2017.
25
"Speech Market Projected To See Triple Growth ... - Text2Speech Blog." http://blog.neospeech.com/2016/06/
29/speech-market-growth/. Accessed 19 Jan. 2017.
16
Much like how speech recognition and voice recognition are being used more on our
mobile devices, the report expects other consumer devices to integrate these technologies into
their applications. Examples include thermostats, mixers, ovens, and refrigerators.
When you add text-to-speech technology to the mix, developers are able to create a
complete hands-free experience for any device by using a natural language user interface. We
can talk to our devices and they can talk back to us. Speech technology is making it easier for
people to interact with technology.
Don’t expect the growth of the speech market to slow down anytime soon. Developers
are continuing to find new ways to use speech technology and enhance our experiences with
technology.”
Evaluation
In order to choose a speech recognition engine and use it, besides the confidence that
the market shares, it was decided to run some tests in order to analyse the results, compare
them and make some statistics.
The obtained results will have a critical role in choosing the engine for the
application.
The evaluation will consist in 2 tests:
1. Recognition accuracy for quiet environment;
2. Recognition accuracy for noisy environments.
The audio input that will be provided to the engines contains the following
commands:
The commands below will also be used to create a demo of the prototype application
(which will be detailed in the next chapter). They were chosen based on complexity,
combination of words, letters and contextual meaning.
Commands recognition
1. Set title demo 8. Close popup
2. Draw line CD 10 and 11 and 20 9. Delete ABC
and 7 10. Draw square ABCD 7 and 4 and 5
3. Show properties of CD 11. Set line color green
4. Close popup 12. Draw circle F 11 and 24 and 9
5. Clear board 13. Delete ABCD
6. Draw rectangular triangle ABC 25 14. Undo
and 27 and 9 and 11 15. Download
7. Show properties of ABC 16. Share on dropbox
17
The results will be evaluated in the following way:
● If the command sentence is 100% recognized (this includes ‘-’ separated
words, e.g. pop-up) the cell will be colored green and in the calculus it will be
considered 100% correct.
● If the command sentence is recognized but numbers are not displayed using
digits (e.g. ten instead of 10) or only one or two words are misrecognized the
cell will be displayed yellow and in the calculus it will be considered 50%
correct.
● If the command sentence is misrecognized the cell will be colored red and in
the calculus it will be considered 0% correct.
# Google Cloud Speech API # Google Cloud Speech API

quiet environment results noisy environment results
1 set title demo 1 should I go demo
2 draw line CD 10 and 11 and 20 and 7 2 draw line CD 10 and 11 and 20 and 7
3 show properties of CD 3 show properties of CD
4 close popup 4 close popup
5 clear board 5 clear board
6 draw rectangular triangle ABC 25 6 draw a rectangular triangle ABC 25 and

and 27 and 9 and 11 27 and 9 and 11
7 show properties of ABC 7 show properties of a DC
8 close popup 8 close popup
9 delete ABC 9 delete ABC
10 draw square ABCD 7 and 4 and 5 10 brock Square ABCD 7 and 4 and 5
11 set line color green 11 set line colour green
12 draw circle F 11 and 24 and 9 12 draw circle F11 and 24 and 9
13 delete ABCD 13 delete ABCD
14 undo 14 undo
15 download 15 download
16 share on dropbox 16 share on dropbox

Table 1: Google Cloud Speech API results
18
Google Cloud Speech API provided amazing results on both environments.
1. Quiet environment: 100% accuracy;
2. Noisy environment: 87.5% accuracy.
# IBM Bluemix Speech to Text quiet environment results
1 Sit tight co them all.
2 0 OO line she. Then and you live on end to 25.
3 Show properties both CD.
4 Close both Bob.
5 Do you boredom.
6 The road is dangerous triangle ABC 25 and 27 end line and eat less fun.
7 Show properties off 18 BC.
8 Hello ballpark.
9 The a BC.
10 The rule square ABC the. Said 1 and 4 and 5.
11 Said line Carla Greene.
12 The rule circle S.. You live on and 33.
13 Do you need ABC.
14 On new.
15 Download.
16 Cher on little box.

Table 2: IBM Bluemix Speech to Text results
IBM Bluemix Speech to Text provided very poor results:

1. Quiet environment: 6.25% accuracy;
2. Noisy environment: was skipped due to previous results.
19
Kaldi
# Kaldi quiet environment results # Kaldi noisy environment results
1 set title demo 1 set title demo
2 draw line cd ten and eleven and 2 draw line cd ten and eleven and twenty
twenty and seven and seven
3 show properties of cd 3 show properties of cd
4 close pop-up 4 close pop-up
5 clean board 5 clean board
6 draw rectangular triangle a busy 6 draw rectangular triangle a busy

twentyfive and twenty seven and twentyfive and twenty seven and nine
nine and eleven and eleven
7 show properties of a busy 7 show properties of a busy
8 close pop-up 8 close pop-up
9 delete a busy 9 delete a busy
10 draw square abcd seven and four and 10 draw square abcd seven and four and
five five
11 set line color green 11 set line color green
12 draw circle F11 and twentyfour and 12 draw circle F11 and twentyfour and nine
nine
13 delete abcd 13 delete a busy did
14 undo 14 undo
15 download 15 download
16 share on dropbox 16 share on dropbox

Table 3: Kaldi results26
Kaldi provided these results:

3. Quiet environment: 71.875% accuracy;
4. Noisy environment: 71.875% accuracy.
26
"GitHub - jcsilva/docker-kaldi-gstreamer-server: Dockerfile for kaldi ...." https://github.com/jcsilva/docker
-kaldi-gstreamer-server. Accessed 25 Jan. 2017.
20
Kaldi’s biggest strength it’s also its biggest weakness: opensource. Considering this
paper context, the fact that we are trying to develop within Internet of Things context, having
something that you need to host, configure and maintain on-premise may not be a great idea.
Long story short this will bring in hosting and maintaining costs. Also there is no support
available and the latest major activity (on the important modules) was in summer 2016.
Another important thing to consider are the models that train Kaldi. In order to get the
best results one would need the best models. These models are not open source and they can
be purchased or subscribed to. In the end this would bring even more costs to our
deployment.
Interesting enough Kaldi managed to successfully ignore the background and scored
almost identically. This is an important thing to consider regarding classrooms.
On the other hand the fact that Kaldi managed to ignore the noise can’t be considered
a very big advantage because a quality microphone and an intermediate positioned software
that clears the sound of noise can be used.
Costs
The price paid for the service is an important thing that has to be considered. Given
the educational (school) context a lower price can be a major advantage.
In order to compute a maximum total we’ll consider 4 hours, repeated 5 times a
month for a period of one year, which means 14400 minutes.
Monthly usage Price per 15 seconds**
0 - 60 minutes Free
61 - 1,000,000 minutes* $0.006

Table 4: Google prices
* Monthly usage is capped at 1 million minutes per month
** This pricing is for applications on personal systems (e.g., phones, tablets, laptops,
desktops). Please contact us for approval and pricing to use Speech API on embedded devices
(e.g., cars, TVs, appliances, or speakers).27
Total = $345.6
Other advantages:
1. Unlimited professional support;
2. Periodical updates to both the models and the algorithms behind
27
"Speech API - Google Cloud Platform." https://cloud.google.com/speech/. Accessed 25 Jan. 2017.
21
First thousand minutes per month are FREE. Additional minutes are $0.02 per minute.
Includes the ability to use wideband models for all supported languages. Also
includes confidence scores per word, time offsets per word, and alternate hypotheses per
phrase.28
Total = $288
Other advantages:
1. Possibility to negotiate educational fees.
2. Overall cheaper
3. More free minutes per month.
Kaldi
As this is not in the clouds several more things need to be taken care of:
1. Hosting = $3 for the first 12 months, $4 after totalling at $36.29
2. Kaldi installation, configuration, maintenance and upgrades: must be done in house.
3. Linguistic Data Consortium 2016 Not-for-Profit Membership30 = $3,850.00
Total > $4000
Engine Yearly total Major advantage Major disadvantage
Google Cloud Speech API $345.6 High quality Most expensive

recognition. cloud choice.
IBM Bluemix Speech to Text $288 Cheaper and easily Poor recognition
customizable. with the standard
model.
Kaldi $4000 Infinite Requires installation,

possibilities with a maintenance and
good tech support.
knowledge.
Table 5: Costs summary
28
29
"Web Hosting | Lightning Fast Hosting & One Click Setup | GoDaddy UK." h ttps://uk.godaddy.com/hosting/
web-hosting. Accessed 25 Jan. 2017.
30
"LDC2016MNP - LDC Catalog - University of Pennsylvania." https://catalog.ldc.upenn.edu/LDC2016MNP.
Accessed 25 Jan. 2017.
22
Conclusions
Given the costs calculated above where Google Cloud Speech API scored highest in
the cloud category but offered the best results it has been decided to use it as the backbone for
the speech recognition application planned to be prototyped within this paper.
For this paper demonstration purpose a free educational trial will be used.
When this decision was taken other factors, mainly cloud technologies offered by
Google were took in consideration, e.g.:
1. Google Home31
2. App Engine32
3. Google Cloud33
31
"Google Home - Smart Speaker & Home Assistant - Google Store." https://store.google.com/product/
google_home. Accessed 25 Jan. 2017.
32
"App Engine - Google Cloud Platform." https://cloud.google.com/appengine/. Accessed 25 Jan. 2017.
33
"Google Cloud Platform." https://cloud.google.com/. Accessed 25 Jan. 2017.
23
Voice Geometry Painter
Figure 5: Voice Geometry Painter logo
Voice Geometry Painter is an application that allows people (especially Mathematics

teachers and pupils, but not only) to draw geometric shapes by speaking certain commands.
The main purpose of the application is to help teaching geometry. The hidden
(subliminal goal) is to accommodate teachers, pupils but not only to concepts which co-exist
with IoT.
A secondary purpose of the application is to allow people that have some motion
disabilities to draw and manipulate geometric shapes, helping them to study geometry. We
think this could be a great help for mathematics teachers that have a motion disability and
cannot draw/write at the whiteboard.
The interaction with the application is simple, based on short voice commands like
draw <shape_name> <simple_parameters> for drawing, and even simpler commands for
other operations, like clear board.
24
Storyboard
Figure 6: Storyboard tile 1

1. Our application design focuses on 2. The application focuses on drawing

improving the quality and the quantity of geometrical figures on a virtual table. This
the way geometry is taught in elementary is achieved by vocal commands. The
and secondary schools. Application’s most features that we mentioned before
important features are represented by the accessibility, flexibility, portability and
following keywords: accessibility, precision make the application incredible
flexibility, portability and precision. easy to be adopted and used independently
of culture, geographical location and
computer experience.
Figure 8: Storyboard tile 3 Figure 9: Storyboard tile 4
3. Right from the start the application 4. The different colors, highlighting
amazes the students with its adaptive speed methods, infinite ink and
of drawing and the interactivity. instant-clean-whiteboard allow the students
to pay more attention to the lesson and
comprehend much more faster.
25
5. Starting from the logo the application 6. The user can personalize the board in
uses a very friendly, intuitive and order to provide certain difficulty levels or
school-oriented design. display the drawings in different contexts.

7. By adaptability we refer to the 8. From the basics of points, lines and

application’s feature that allows it to be segments …
used in any stage of the geometry class. It
achieves this by a built-in set of toolboxes.
The toolboxes contain powerful instruments
for drawing.
26
9. … to the beautiful world of 2D figures … 10. … and the huge space of 3D.

11. The application is very handy and useful 12. After the great part of teaching the
for both the professor and the students. It application proves itself useful for
doesn’t focus on teaching or evaluating evaluating. The teacher can block certain
only, but on both. draw instruments and ask the student to
draw a certain figure using more basic
figures.
Example:
Teacher: Please draw an equilateral triangle.
Student: Draw segment AB, length 10cm.
Draw segment BC, length 10cm, ABC angle
60°. Draw segment CA.
Teacher: Good job, you got an A+!
27
Userflow
Besides the above illustrated storyboard that wants to present the usage of the app, an
user-flow table has been created in order to present a possible use case and the benefits the
app brings in.
Where is the What is the user User action Notes

user? looking to?
Home screen Start a new Voice command: 1. He wants to add a title,

whiteboard and use “Open app” subtitle and automatically a
default instruments. (externally) date.
Whiteboard Select a color for Voice command: 1. He expects the UI to

screen drawing “Color blue” notify him that blue is used.
(e.g. tip of pencil is blue)
Whiteboard Draw and use Voice command:

screen previously created “Draw equilateral
custom instruments triangle ABC side
length 10cm.
Whiteboard Give more details Voice command: 1. He expects the angle to be

screen to the students “Highlight angle highlighted and its measure
ABC” displayed.
Whiteboard Print a copy and Voice command: 1. He expects to be notified

screen distribute to each “Print 20 copies, with the status: out of service if
student. black and white” the printer has no more paper
or ink or done.
Whiteboard Save the lesson and Voice command: 1. The user expects the
screen make it publically “Save”, “Share”, application to save his
available on the “Quit”. progress.
school server. 2. The sketch should be
available to a certain range
of students to be downloaded
and reused at home.
3. Application quits and it’s
time for break.
Table 6: Voice Geometry Painter userflow
As just mentioned the above table presents in detail and with an in depth analysis the
scenario when a new lesson (or just a simple sketch) is started.
28
Obviously besides this the application needs a map of its possible screens and the
navigation via the commands that are offered. This “treasure” map is presented in detail
below as Figure 18: Voice Geometry Painter map. (can be found attached at the end of the
document
Figure 18: Voice Geometry Painter map
Sketches & Wireframe

Considering the above and the fact that goal/purpose has been illustrated through the
storyboard and a specific use case via the user-flow above and the app map the next natural
step in designing the application and preparing it for implementation from the UX and UI
perspective was to build sketches and wireframes for the screens. In the figures below the
main screen, the whiteboard screen will be illustrated both as a sketch and as an wireframe.
The actual implementation followed the guidelines defined by the above.
29
Sketch
Figure 19: Whiteboard sketch
Wireframe
Figure 20: Whiteboard wireframe
30
Speech interface
Commands
The commands are mapped like programming language functions in order to be very
accessible to the user. If this sounds complicated the contrary will be proved in the following
lines.
All the commands are starting with the action. The user can pick the following
actions:
1. draw
2. highlight
3. select
4. delete
The next part in the command is what actually the user will draw. This means the user
has to add the instrument and its parameters. If we take a look at the Userflow of adding a
custom equilateral triangle we can see that we have to add:
Instrument name = “Equilateral triangle”

Parameters
Param no. 1 = Point 1
Param no. 4 = Side length
This is it. You can now draw figures using a simple natural format that respects the
below rule:
<Action> <Instrument name> <Parameters>
Questions, options, criteria (QOC)

“The main constituents of QOC are Questions identifying key design issues, Options
providing possible answers to the Questions, and Criteria for assessing and comparing the
Options.”34
The analysis below was done essentially to try and obtain all the results that are
promised in the citation. These results would then be considered when implementing the
actual application.
34
"Questions, options, and criteria - ACM Digital Library." 1 Sep. 1991, http://dl.acm.org/citation.cfm
?id=1456155. Accessed 28 Jan. 2017.
31
1. How do I login as a professor?
a. I use my school account, username and password.
i. I look for the login dialog and submit button
ii. I can either say or type the credentials
iii. I can use the “remember me” checkbox, which offers means of
recovery and retention.
b. I use the voice recognition service
i. I pronounce the keyword
ii. The application understands it and verifies it is my voice
iii. Others may hear it.
2. How do I teach a lesson using the app?

a. I say “new lesson” command
i. I have to add a title and a subtitle
ii. I must select what toolboxes (instruments) I will use
iii. I continue by adding figures with voice commands
3. How do I draw?
a. I use voice commands
i. I have to learn the default ones by clicking them in the toolbox area.
ii. I can customize them inside the settings menu.
iii. I must remember them
b. I use the UI and the mouse or the Interactive whiteboard
i. I draw using the mouse
ii. The application will correct wrongly figures I had drawn
4. Why should I use this software instead of a basic paint app?

a. I can use voice commands
i. This means I can be away from the whiteboard/device.
ii. I can use my hands to point on the table instead of sitting down and
drawing.
iii. I attract much more attention from the students
5. How do I correct a wrongly drawn figure?

a. I delete and re-draw
i. I select it (e.g. Triangle ABC) and the delete it
ii. I draw it again
iii. This is time consuming
iv. There may be situations where I might delete more than I want to edit
v. I can use this for complex drawings
b. I adjust it
i. I select (e.g. angle A)
ii. I say the adjust command: adjust new measure 90 degrees
iii. I confirm
iv. I can use this for more simple figures
6. How do I select a figure?

a. I refer it by voice
32
i. I might not be able to select a very complex drawing
b. I click on it
i. This can be faster
ii. I will use this for simple figures
c. I click on all figure’s parts
i. I have to be precise when I click on all the edges.
ii. I will use this for more complex drawings
7. What do I do if I need more space for drawing?

a. I can save the current whiteboard
i. I must go through all the export steps which is time consuming
ii. I must open another whiteboard which is again time consuming
b. I can use the pagination feature
i. This can help me better organize the lesson
ii. Depending on the size of the figures I may not be able to store
everything inside a single page
c. I can vertically extend it
i. This can result in a very long document
ii. This kind of documents can get the students bored.
8. How do I save/export/print/share a drawing?

a. I use the voice commands
i. I say the desired command
ii. I add the wanted parameters (path for saving, persons to share with)
iii. I am prompted to confirm the command
b. I use the UI
i. I select the share icon
ii. I add the wanted parameters (path for saving, persons to share with)
iii. I am prompted to confirm the command
9. How do I evaluate a student using the app?

a. Using voice
i. I disable the advanced toolboxes
ii. I provide him access to the microphone and whiteboard
iii. I let him login
iv. I record his performance
b. Using UI
i. I disable the advanced toolboxes
ii. I provide him access to the UI and keyboard & mouse
iii. I let him login
iv. I record his performance
10. What do I do after I evaluated him?

a. I can keep his drawing
i. I can investigate later
ii. I can keep a history
b. I can note him on sight
33
i. I add the grade
ii. The app automatically saves the grade for the lesson and date
11. How do I block the app to only recognize my voice?

a. I go to the settings menu and select the “Listen only to …” option:
i. I say the words that the app asks me to
ii. The app calibrates and notifies me
iii. This is very useful for noise or for students that may try pranks
12. How do I add a custom command?

a. I go to the settings menu and select “Custom commands”
i. I add a new one
ii. I draw using the mouse the desired figure using the available toolboxes
iii. I set the voice command I want for it.
13. How do I represent the drawing instruments set?

a. Using a panel containing icons (and labels)
i. This is the most intuitive as all major designing software come this
way by default.
b. Using a dropdown list
i. Could be useful for a minimalist design but hard to see all what a user
could draw.
c. Using a menu
i. Very hard and not intuitive
14. Where do I position the drawing instruments?

a. Top left
i. Being the most important interacted features they will be placed in
this side as the user eye tends to focus here.
b. Top Right
c. Up
d. Bottom
15. Where do I position the info about the geometrical figures?

a. Bottom
i. As this might contain very much information the size occupied can
be very large. Adding it at the bottom allows us to add scroll.
b. Right
i. The eye doesn’t focus too much on that zone so the user doesn’t see all
he can draw.
16. How do I represent the miscellaneous controls?

a. Big Icons
i. Intuitive for all devices
b. Menu
i. It takes longer for the user to reach the desired action
17. Where do I position the miscellaneous controls?
34
a. A top bar above the whiteboard
i. Always visible. This gives the user control and he feels free to use
them at any moment
b. Right
i. The location is not very suggestive for the human eye and some actions
might not be noticed by the user.
Architecture
The app has been designed in a modular and extendable way. Its major components
are:
1. Speech logic (speech recognition module and its framework Webtoolkit)
2. Command parser (command manager module and the commands themselves)
3. Drawing context (canvas module and its framework JSXGraph)
Speech logic
Figure 21: Speech logic module35
Speech logic module has its key, important role: It has to recognize voice and provide
meaningful text towards the Command parser module. It achieves the raw speech to text
using WebkitSpeechRecognition36 .
The raw (interim) results are then processed with a state machine in order to get the
best match on the known available commands.
35
"Voice Clipart | Clipart Panda - Free Clipart Images." http://www.clipartpanda.com/categories/voice-clipart.
Accessed 2 Feb. 2017.
36
"Web Speech API - Web APIs | MDN - Mozilla Developer Network." 30 Jun. 2016, https://developer.mozilla.
org/en-US/docs/Web/API/Web_Speech_API. Accessed 2 Feb. 2017.
35
Command parser
Figure 22: Command parser37
The command parser module knows about all the commands that the application can
accept and run. It is responsible for knowing all of them, delegating the work to the proper
one and making sure things are executed correctly within the right context.
This module is also the one that is used by the other modules in order to retrieve
information about specific commands. Everything goes through a chain of command.
Drawing context
Figure 23: Whiteboard (drawing context)38
37
"Toy soldier clipart - ClipartFest." h ttps://clipartfest.com/categories/view/
2ee495a0c05ebf2b333ca93adaf1655e378375a2/toy-soldier-clipart.html. Accessed 2 Feb. 2017.
38
"What is on your Whiteboard? | Information Street - Infusionsoft UK ...." 17 Jun. 2014,
http://www.informationstreet.com/marketing-tips/what-is-on-your-whiteboard/. Accessed 2 Feb. 2017.
36
The drawing context module is responsible with displaying the informations on the
screen. The goal when designing this was to keep a plug-n-play architecture in such way that
any library or framework could be used instead of the JSXGraph framework. What does this
mean?
The particular logic is contained in the drawing context (e.g. the JSXGraph module)
and the commands (which do not contain any logic) that know to call the right operation in
the context implementation.
What are the benefits that the application gets by using this? For example multiple
drawing contexts can be used. The user sees only the JavaScript JSXGraph rendered in the
web page but behind the scenes the same commands have actions on backend objects.
Architecture UML
Figure 24: Architecture UML
37
Implementation
Speech logic & Command parser

Speech logic module has its critical role. It translates the speech into text. Even
though it sounds this easy, using the service at its maximum power proved to be quite a
challenge.
First of all the connection to the Speech API and Google service was implemented in
speech.js. This file contains all the needed configuration in order to get the service running in
a simple, unrefined way.
The Speech API was always providing good results but there was a need for a
mechanism that would control the start and the end of the command as well as adapt the
recognized text to the general command structure.
The mechanism above was implemented using a state machine. In order to get a better
understanding of how the speech recognition works better than a standard connection via the
API below the command mechanism will be detailed.
Commands
The commands are represented as javascript objects which contain the following:
1. Name
2. Candidate input string
3. Regexp
4. Help
The most important of the above attributes is the Regexp. A command will be
executed if the candidate input string is matched with the regexp. This way the regexp can
extract the action (predicate), name of the shape and its coordinates.
38
Using this approach here is an example of a how a square drawing command looks like:
1. var SquareCommand = function(commandString) {
2. this.commandString = commandString;
3. };
4.
5. SquareCommand.prototype.execute = function(context) {
6. var reResults = this.commandString.match(this.REGEXP);
7.
8. var pointA = reResults[1].toUpperCase();
9. var pointB = reResults[2].toUpperCase();
10. var pointC = reResults[3].toUpperCase();
11. var pointD = reResults[4].toUpperCase();
12. var Ax = parseFloat(reResults[5]);
13. var Ay = parseFloat(reResults[6]);
14.
15. var size = parseFloat(reResults[7]);
16.
17. context.drawSquare(pointA, Ax, Ay, pointB, pointC, pointD, size);
18. };
19.
20. SquareCommand.prototype.NAME = "Square";
21.
22. SquareCommand.prototype.REGEXP = new
RegExp('PREDICATE\\ssquare\\s([a-zA-Z])([a-zA-Z])([a-zA-Z])([a-zA-Z])\\s(\\d+)\\s' +
23. COORDINATE_DELIMITER + '\\s(\\d+)\\s' +
COORDINATE_DELIMITER + '\\s(\\d+)', 'i');
24.
25. // Command's help
26. SquareCommand.prototype.HELP = "square ABCD 10 [Ax] " + COORDINATE_DELIMITER_HELP + " 20
[Ay] " + COORDINATE_DELIMITER_HELP + " 10 [side]";
Directly applying regular expressions on immediately translated speech is a very long

shot. So when developing we thought how can we guide (create a context for Google Speech
API) in order to get better results.
The answer consists of two parts:
1. Google interim results
2. Deciding which is the best interim result and then continuing querying google on that
path.
Google interim results

This part was very easy to do due to the in depth analysis done when choosing the
speech API. When translating speech to text Google is offering multiple results. The response
contains the best scoring candidate plus other translations which might prove to be the correct
ones. While the user is not accepting from the client side a translation as a final result within
a “timeout” the Speech API continues to provide interim results.
This enlarges the response tree offering a broader path for building the command text.
39
Best matching interim result
This is where the state machine mechanism is used. When speech processing is started
all registered commands are retrieved and based on them a state machine is built. This
consists in two phases: building the tokens and building the transitions.
The tokens are kept in an array while the transitions are kept as a map where the key
is the source and the value is a list composed of the target nodes.
Given this structure, an efficient way of combining the state machine and Speech API
interim results appeared. Every time a new interim result is provided, the last word/token is
checked with the state machine to decide whether or not to continue on that branch of the
results tree. This drastically improves the accuracy and recognition time for the Speech API.
Drawing context
As described in the architecture topic the drawing contexts have been designed in
such way that one or multiple can be used. From the architecture perspective the drawing
context are spots where commands have their actions.
Every command has its own implementation for each drawing context. This was done
in order to be able to record the actions and to build data from them in multiple manners, e.g.:
● A visual drawing context can be used to display the board on the screen
● A backend context can be used in order to build a logical model of the
drawings
For this prototype a visual drawing context was chosen and implemented using
JSXGraph.
“JSXGraph is a cross-browser JavaScript library for interactive geometry, function
plotting, charting, and data visualization in the web browser.
Features:
● Euclidean and projective Geometry
● Curve plotting
● Open source
● High-performance
● Small footprint
● No dependencies
● Multi-touch support
● Backward compatible down to IE 6
JSXGraph is implemented in pure JavaScript, does not rely on any other library, and
uses SVG, VML, or canvas. Special care has been taken to optimize the performance.”39
The drawing context implemented offers the following features.
1. board management (init, reset, repaint)
2. undo, redo
3. zoomIn, zoomOut
39
"JSXGraph - JSXGraph." http://jsxgraph.uni-bayreuth.de/. Accessed 6 Feb. 2017.
40
4. select, deselect shape
5. color and width management
6. point, line, triangle, quadrilateral and circle drawing
7. shape deletion
8. shape properties detailing as popup
9. .png and Dropbox export
All the above actions can be used to obtain results like the following:
Figure 25: Triangle examples
Figure 26: Quadrilaterals example
41
Below a code example will be presented which demonstrates how a triangle is built:
1. /**
2. * Draws a triangle
3. *
4. * @param A
5. * @param Ax
6. * @param Ay
7. * @param B
8. * @param Bx
9. * @param By
10. * @param C
11. * @param Cx
12. * @param Cy
13. */
14. DrawingContext.prototype.drawTriangle = function (A, Ax, Ay, B, Bx, By, C, Cx, Cy){
15. var lineColor = this.getLineColor();
16. var fillColor = this.getFillColor();
17. var shapeFillOpacity = this.getFillOpacity();
18.
19. var lineWidth = this.getLineDrawingWidth();
20. var pointWidth = this.getPointDrawingWidth();
21.
22. var pointA = this.board.create('point', [Ax, Ay], {
23. name: A, size: pointWidth, fillColor: lineColor,
24. strokeColor: lineColor
25. });
26. var pointB = this.board.create('point', [Bx, By], {
27. name: B, size: pointWidth, fillColor: lineColor,
29. });
30. var pointC = t his.board.create('point', [Cx, Cy], {
31. name: C, size: pointWidth, fillColor: lineColor,
33. });
34.
35. var triangle = this.board.createElement('polygon', [pointA, pointB, pointC], {
36. borders: {
37. strokeColor: lineColor,
38. strokeWidth: lineWidth
39. },
40. hasInnerPoints: true,
41. fillColor: fillColor,
42. fillOpacity: shapeFillOpacity,
43. name: A + B + C
44. });
45.
46. pointA = new Point(A, Ax, Ay);
47. pointB = new Point(B, Bx, By);
48. pointC = new Point(C, Cx, Cy);
49.
50. var triangleShape = new Triangle(pointA, pointB, pointC);
51.
52. this.shapes[triangle.id] = triangle;
53. this.shapeProperties[triangle.id] = triangleShape.getProperties();
54.
55. JXG.addEvent(triangle.rendNode, 'mousedown', this.selection(), triangle);
56. };
57.
42
Conclusions
This chapter can end with the conclusion that the application contains a Command
Parser which knows about all the registered commands and is able to execute any of them
when one is recognized.
The recognition part is aided by a state machine which is also built based on the
known commands. The tokens within the commands (predicate, name, attributes) are
extracted and considered tokens, while the way the succeed themselves build the transitions.
Using Google Speech API interim results (which broadens the results tree), the last
token of the recognized phrase is filtered through the state machine in order to pick the better
branch. This way the the broad results tree offered is shrinked in a quality context in order to
obtain a better result.
All in all the design, architecture and implementation chosen for this project proved to
have a great extensibility and by using JSXGraph visually rich drawings could be created that
will add value when watching the board.
43
App evaluation
The built prototype proved to be a success and a demo can be found online, on
YouTube: https://youtu.be/AiYyaDscnLI.
Besides the above an Usability Test has been conducted in order to gain feedback
about the prototype. The tests have been conducted and the results logged using the available
template40 proposed by Stefan Negru.
Figure 27: Usability testing 1
The objectives of this test were chosen in such manner that all the areas, starting from
the speech recognition quality and continuing to simplicity/efficiency or display accuracy
were covered.
As the Tasks topic describes the evaluation document proposed the same the user the
same commands that were used in the demo so a comp
40
"Stefan Negru - Info." http://profs.info.uaic.ro/~stefan.negru/. Accessed 7 Feb. 2017.
44
Figure 28: Usability testing 2
Annex 3 and 4 present the personas that used the app in order to evaluate it and obtain
feedback.
Considering the investment in future education that this paper wants to bring in and
summing up the usability together with the analysis we can conclude the following:
Pupils like the application and the way it works.

The eye candy features that let them change colours, widths, clear the board and
export then immediately print it on paper proved to be a great choice. The export and print is
a great bridge across the paper and e-learning process. Implementing and using this we
managed to convince them that this application and process should not be perceived as
something that is forced but rather as a great alternative.
The subjects also confirmed the great choice that was made by picking Google
Speech API. The various accents and the imperfect accents combined with the
45
not-yet-complete knowledge about geometry, which causes misspells, motivated the higher
price of this API.
Another confirmation point was hit when the children, drove by their ludic, playful
and curiosity behaviour tried to use non-academic words in order to push applications limits.
This was confirmed and appreciated by teachers as well.
“Research shows that students learn by being actively engaged in relevant and
authentic activities—and technology makes this increasingly possible. Learners are also
becoming more adept at using social networks such as YouTube and Facebook to text
message; post videos, blogs, and images; and collaborate and socialize regardless of time or
place.
Furthermore, students are using software applications to either create or interact with
content—even content that previously was only broadcast. More and more, classrooms are
becoming “open” through voice, video, and text-based collaboration, and teachers now have
a wide range of multimodal resources at their disposal to enhance teaching.
Alongside a growing understanding of how the brain works and how learning takes
place, integrated technology solutions such as multimedia, games, and animation have played
a significant role in improving time to mastery and understanding.
As more people adopt new technologies for learning, they will thrive in the emerging
world of the Internet of Everything (IoE)—the networked connection of people, process,
data, and things—which is becoming the basis for the Internet of Learning Things.”41
“Higher education programs must ensure that the next generation of engineers
understands how to design and build technological systems that reflect our altered
expectations of openness and participation. In the area of computer science, the challenge is
in developing new forms of scalable education that accommodate large numbers of students
around the world, attract potential students with various interests, and deliver an innovative
curriculum that reflects the radical changes in computing technology.
In response, the Open University in the United Kingdom revamped its undergraduate
computer science curriculum and now offers an introductory course, My Digital Life,
designed around IoT concepts. My Digital Life places IoT at the core of the first-year
computing curriculum and primes students from the beginning for the coming changes in
society and technology. Rather than narrowly defining IoT as a technical subject, the course
helps students view IoT as a tool for understanding and interrogating their own world, and
recognizing their role in realizing IoT’s.42“
“The “Internet of School Things”14 is one of the first projects to explore this
approach. Announced August 2013, the project—which includes eight U.K. secondary
schools, grades 11 through 18—is designed to teach learners about the potential of connected
everyday devices, using them to bring other subjects to life by collecting data in the areas of
transportation, energy, weather, and health. The project is funded by DISTANCE, a
41
"Education and the Internet of Everything - Cisco." http://www.cisco.com/c/dam/en_us/solutions/industries/
docs/education/education_internet.pdf. Accessed 7 Feb. 2017.
42
" Education and the Internet of Everything - Cisco." http://www.cisco.com/c/dam/en_us/solutions/industries/
46
consortium of IT companies and universities. Learners are also taught how to build their own
products and sensors, easily bring them online, and monitor variables of their choosing. “43
From a psychological point of view the best way to facilitate and accommodate
subjects with a technology is by using children’s natural learning strategies. The most
important and the highest rewarding is learning by doing.
Considering the above, integrating IoT specific techniques in the daily school activity
proves to be the most simple yet the most effective method. This method should be preset as
a fun learning alternative.
Providing children which have a non-linear and creative way of thinking access to a
very powerful instrument (voice recognition within IoT) they can discover new usages (use
cases) and revolutionize (push the limits) of how IoT is used.
One practical, discovered within example was when children proposed to use the
application for geography classes. They wanted to have maps instead of a blank background
in order to calculate distances and highlight Italy’s boot shape.
By using this equal-to-all visual application several barriers are removed allowing
children to better express their creativity, in more direct way.
A really good analogy is the comparison of Voice Geometry Painter with the Manual
labour class that the 1990’s generation had. The math class held using Voice Geometry
Painter becomes a skill building course and prepares children for the present society. IoT
which is considered by older people the latest trend can become for children the equivalent of
a pen for an adult, providing them a very solid base for future developing.
The teacher - children interaction changes. A boring math class can become a
spectacular show capable of obtaining and maintaining children’s attention. The proposed
method generates another type of interaction in the classroom. When the child is creating and
presenting his point of view he manages to reach all the learning levels presented in the
learning pyramid:
- Lecture, because he is talking;
- Reading, because the command is translated and displayed as text;
- Audio-Visual, because of how the application is working;
- Demonstration, because he shows the other what he just built;
- Discussion group, because others can add to his command;
- Practice by doing, because of how the application is working;
- Teach others, because he is using a tool while others are watching.
43
47
Figure 30: Learning pyramid
This way of having a class implies a faster, bigger and better information
retention due to its reach on the learning levels.
The method is also maintaining children’s creativity. The traditional learning
system forces the pupil to accumulate a set of data fact which sacrifices creativity. By
using this approach a balance between creativity and information is held thus keeping
a high level of creativity for when he gets older which tends to be a goldmine.
By using Voice Geometry Painter which has speech recognition as a main
module, which at the same time is a pillar within IoT, the pupil will tend to express
himself in a way society will work in the future. His ability to symbolize is developed.
This is achieved because Voice Geometry Painter is capable of delivering a huge
amount of information in very easy to remember drawings. In other words, the child’s
capacity of working with abstract concepts is drastically amplified.
Still regarding the future, Voice Geometry Painter will add a more realistic,
kinesthetic and personal dimension towards impersonal objects. This is a great plus
considering all the automation that is planned with IoT.
Another reason of why this method works is because it tricks children’s mind.
He is not told of how something works and then asked to reproduce. He is faced with
a great tool that tickles his curiosity.
He is not taught of how a car is made but it is given to drive one to try and win
his excitement. If his excitement is won, learning will flow exponentially better.
By using Voice Geometry Painter it can be concluded that the application
develops and unconscious competence about speech recognition and Internet of
Things and how the society will work in the future.
48
Conclusions and further work
Ending up this paper it can be concluded that a problem which can impact our future
has been discovered:
The lack of knowledge about Internet of Things and the reluctance of using specific
techniques in the daily life.
To face this problem this paper brings in a solution that wishes to educate, primary,
pupils about Internet of Things. This is achieved by:
1. Building an application that is used in the education process
2. Adding IoT specific techniques and flavours to it
The first step welcomes the solution in the education environment, place from where a
great impact on children’s perspective can be done.
Given the premise above, it is only left to be sure to “inject” quality data and
perceptions about IoT by using a great interactive technique, speech recognition.
The solution implemented, named Voice Geometry Painter, is an aid for teaching, an
alternative white/black board. It allows drawing different geometric shapes using voice
commands.
After its implementation the usability tests conducted and the psychological analysis
done proved this to be a great initiative with a lot of potential.
There are multiple directions available for the future work. The most important
reconned after concluding this paper is a represented by various partnerships with schools.
This will obviously grow its popularity together with the feedback data.
Based on the feedback data multiple improvements can be done. These improvements
go from better speech recognition to simpler commands.
Another critical mechanism that was discovered missing after the testing part was the
lack of support for custom commands. This was an expected feedback but most of all the
feature was already planned. The need that popped up can only confirm the appreciation
towards the app and the will of the users to discover more.
Concluding this paper it can be said that an important problem has been discovered, a
valid solution was found and developed and its impact could been already seen.
49
Bibliography and References
Derwing, T., Munro, M. & Carbonaro, M. (2000). Does popular speech recognition software
work with ESL speech? Tesol Quarterly, 34, 592-603.
Higgins, E. & Raskind, M. (2000). Speaking to read: the effects of continuous vs. discrete
speech recognition systems on the reading and spelling of children with learning
disabilities. Journal of Special Education Technology, 15, 19-30.
Higgins, E. & Raskind, M. (1995). Compensatory effectiveness of speech recognition on the
written composition performance of post secondary students with learning disabilities.
Learning Disabilities Quarterly, 18, 159-174.
MacArthur, C. & Cavlier, A. (2001). Dictation and speech recognition technology as
accommodations in large-scale assessments for students with learning disabilities. Data
from study, unpublished.
Cheryl L. Temple, George Mason University, The Uses of Speech Recognition Technology in
Education.44
Education and the Internet of Everything How Ubiquitous Connectedness Can Help
Transform Pedagogy45
How the Internet of Things Is Transforming Education, Zebra Technologies46
Credits to Marius Baciu and Marian Pînzariu who helped me develop the Voice Geometry
Painter prototype application within Human Computer Interaction course held by
Associate Professor, PhD Sabin-Corneliu Buraga and gave their consent for modifying,
extending and using the codebase for this paper.
Credits to future PhD Tudor Ceobota for assisting me in developing a psychological analysis
of how the application will impact children, their and our future.
44
"The Uses of Speech Recognition Technology in Education - Mason Gmu." http://mason.gmu.edu/~ctemple/
Portfolio/pages/themes/at/speech.pdf. Accessed 6 Feb. 2017.
45
46
"How the Internet of Things Is Transforming Education - Zatar." http://www.zatar.com/sites/default/files/
content/resources/Zebra_Education-Profile.pdf. Accessed 7 Feb. 2017.
50
Annex 1 - Figure 18: Voice Geometry Painter map
51
Annex 2 - Figure 24: Architecture UML
52
Annex 3: Teacher persona
53
Annex 4: Pupil persona
54

LucianTuca Disertatie PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

LucianTuca Disertatie PDF

Încărcat de

Drepturi de autor:

Formate disponibile

“Alexandru Ioan Cuza” University of Iasi

Faculty of Computer Science

SPEECH RECOGNITION AND ITS

Lucian – Petru C. Assoc. Prof. PhD

Prin prezenta declar pe propria răspundere faptul că lucrarea de licență cu

Prin prezenta declar pe propria răspundere faptul că lucrarea de disertație

State of the art 8

Voice Geometry Painter 24

Conclusions and further work 43

Bibliography and References 43

This project aims to build a working prototype of an application that utilizes

3. What is ​Speech Recognition?

4. What is ​Speech Recognition within ​Internet of Things?

An uneducated, unprepared, unaccustomed population that will be faced with a major

Figure 1:​ ​Google Cloud Speech API14

Cloud Speech API characteristics

● Automatic Speech Recognition (ASR): an in-depth learning neural network is used

● Speech recognition in over 80 languages. This characteristic offers a major

● Inappropriate content filtering.15

IBM Bluemix Speech to Text

Figure 2:​ ​IBM Bluemix Speech to Text16

Watson Speech to Text characteristics

Open Source alternative

The flavor of Kaldi

Evaluation, results and comparison

Google Cloud Speech API

# Google Cloud Speech API # Google Cloud Speech API

1 set title demo 1 should I go demo

3 show properties of CD 3 show properties of CD

4 close popup 4 close popup

5 clear board 5 clear board

6 draw rectangular triangle ABC 25 6 draw a rectangular triangle ABC 25 and

7 show properties of ABC 7 show properties of a DC

8 close popup 8 close popup

9 delete ABC 9 delete ABC

11 set line color green 11 set line colour green

12 draw circle F 11 and 24 and 9 12 draw circle F11 and 24 and 9

13 delete ABCD 13 delete ABCD

16 share on dropbox 16 share on dropbox

IBM Bluemix Speech to Text

# IBM Bluemix Speech to Text ​quiet environment results

1 Sit tight co them all.

2 0 OO line she. Then and you live on end to 25.

3 Show properties both CD.

4 Close both Bob.

7 Show properties off 18 BC.

10 The rule square ABC the. Said 1 and 4 and 5.

11 Said line Carla Greene.

12 The rule circle S.. You live on and 33.

13 Do you need ABC.

16 Cher on little box.

IBM Bluemix Speech to Text provided very poor results:

# Kaldi ​quiet environment results # Kaldi ​noisy environment results

1 set title demo 1 set title demo

3 show properties of cd 3 show properties of cd

4 close pop-up 4 close pop-up

5 clean board 5 clean board

6 draw rectangular triangle a busy 6 draw rectangular triangle a busy

7 show properties of a busy 7 show properties of a busy

8 close pop-up 8 close pop-up

9 delete a busy 9 delete a busy

11 set line color green 11 set line color green

13 delete abcd 13 delete a busy did

3. What is Speech Recognition?

4. What is Speech Recognition within Internet of Things?

Figure 1: Google Cloud Speech API14

Figure 2: IBM Bluemix Speech to Text16

# IBM Bluemix Speech to Text quiet environment results

# Kaldi quiet environment results # Kaldi noisy environment results

Total > $4000

Figure 5: Voice Geometry Painter logo

Figure 6: Storyboard tile 1

Figure 8: Storyboard tile 3 Figure 9: Storyboard tile 4

Figure 12: Storyboard tile 7

7. By adaptability we refer to the 8. From the basics of points, lines and

Figure 16: Storyboard tile 11

Figure 18: Voice Geometry Painter map

Figure 19: Whiteboard sketch

Figure 20: Whiteboard wireframe

Instrument name = “Equilateral triangle”

Figure 21: Speech logic module35

Figure 22: Command parser37

Figure 23: Whiteboard (drawing context)38

Figure 24: Architecture UML

Figure 25: Triangle examples

Figure 26: Quadrilaterals example

Figure 27: Usability testing 1