Sunteți pe pagina 1din 122

The Simon Handbook

Peter H. Grasch

The Simon Handbook

Contents
1 2 Introduction Overview 2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Required Resources for a Working Simon Setup . . . . . . . . . . . . . . . . . . . . 2.2.1 2.2.2 Scenarios . . . . . . . . . . . . . Acoustic model . . . . . . . . . 2.2.2.1 Backends . . . . . . . 2.2.2.2 Types of base models

Static base model . . . . . . . . . . . . . . . . . . . . . . . Adapted base model . . . . . . . . . . . . . . . . . . . . . User-generated model . . . . . . . . . . . . . . . . . . . .

Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Where to get base models . . . . . . . . . . . . . . . . . . . . . . . . Phoneme set issues . . . . . . . . . . . . . . . . . . . . . . . . . . .

Using Simon: Typical user 3.1 First run wizard . . . . . . . 3.1.1 Scenarios . . . . . . . 3.1.2 Base models . . . . . 3.1.3 Server . . . . . . . . 3.1.4 Sound conguration

3.2

3.1.5 Volume calibration . . . . The Simon Main Window . . . . 3.2.1 Main window: Scenarios 3.2.2 Main window: Training . 3.2.3 3.2.4

Main window: Acoustic model . . . . . . . . . . . . . . . . . . . . . . . . . . Main window: Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3

Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Import Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Delete Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1.1 Simon Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4

The Simon Handbook

3.4.1.2 3.4.2 3.4.3 3.4.4 3.4.5 3.5 3.6

Audacity Calibration . . . . . . . . . . . . . . . . . . . . . . . . . .

24 25 26 27 27 28 29 30 31 31 32 32 33 33 35 36 37 37 38 38 40 41 41 42 42 42 43 44 45 45 46 46 46 47 47 48 49 49 49 49

Silence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Microphone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Contribute Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manage training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 3.6.2 3.6.3 Modifying samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clear training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Importing Training Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . General Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2.1 3.7.2.2 3.7.2.3 3.7.2.4 3.7.3 Device Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . Voice Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . Training settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Postprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.7

Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 3.7.2

3.7.2.5 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Speech Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3.1 3.7.3.2 3.7.3.3 Base model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Language Prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.7.4 3.7.5

Model Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.5.1 Server . . . . . . . . . . . . . . . . . . 3.7.5.1.1 General . . . . . . . . . . . . 3.7.5.1.2 Network . . . . . . . . . . . 3.7.5.2 Synchronization and Model Backup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.7.6

Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.6.1 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.6.2 Dialog font . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.6.3 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Text-to-speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.7.1 3.7.7.2 Backends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.7.7

3.7.8 3.7.9

3.7.7.3 Webservice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Social desktop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Webcam conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.10.1 Julius . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.7.10 Advanced: Adjusting the recognition parameters manually . . . . . . . . .

The Simon Handbook

Advanced: Creating new scenarios with Simon 4.1 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Speech recognition: background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1.1 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1.1.1 4.2.1.1.2 4.2.1.1.3 Active Dictionary . . . . . . . . . . . . . . . . . . . . . . . Shadow Dictionary . . . . . . . . . . . . . . . . . . . . . . Language prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51 51 51 52 52 52 53 53 53 54 55 56 56 57 57 59 60 61 62 62 64 65 66 67 67 68 69 69 69 70 70 70 71 74 74 75 76 77 78 78

4.3

4.2.1.2 Grammar . . . 4.2.2 Acoustic Model . . . . . Scenarios . . . . . . . . . . . . . 4.3.1 Scenario hierarchies . . 4.3.2 Adding a new Scenario 4.3.3 4.3.4

Edit Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Export Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adding Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1.1 Dening the Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1.1.1 4.4.1.1.2 4.4.1.2 Manually Selecting a Category . . . . . . . . . . . . . . . Manually Providing the Phonetic Transcription . . . . . .

4.4

Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1

Training the Word . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4.2 4.4.3 4.4.4 4.4.5

Editing a word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Removing a word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Special Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Importing a Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5.1 4.4.5.2 4.4.5.3 4.4.5.4 4.4.5.5 HADIFIX Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . HTK Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PLS Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SPHINX Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . Julius Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4.6 4.5

Create language prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Import a Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 4.5.3 Renaming Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Merging Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Storage Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adding Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2.1 4.6.2.2 Add training texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . Local text les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.6

Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 4.6.2

The Simon Handbook

4.6.3 4.7

On-The-Fly Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79 80 80 82 82 82 82 83 84 84 85 85 86 89 90 90 91 92 93 93 94 96 96 97

Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Scenario selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 Sample groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.3 Context conditions . . . 4.7.3.1 Active window 4.7.3.2 D-Bus . . . . . 4.7.3.3 Face detection 4.7.3.4 File content . . 4.7.3.5 Lip detection . 4.7.3.6 4.7.3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Or condition association . . . . . . . . . . . . . . . . . . . . . . . . Process opened . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.8

Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Executable Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1.1 Importing Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 4.8.3 4.8.4 4.8.5 Place Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2.1 Importing Places . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shortcut Commands . . . . . . . Text-Macro Commands . . . . . List Commands . . . . . . . . . . 4.8.5.1 List Command Display 4.8.5.2 4.8.6 4.8.7 4.8.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conguring list elements . . . . . . . . . . . . . . . . . . . . . . . .

Composite Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Desktop grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Input Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.8.9 Dictation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.8.10 Articial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.8.11 Calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.8.12 Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.8.13 Pronunciation Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.8.14 Keyboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.8.15 Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.8.15.1 Dialog design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.8.15.2 Dialog: Bound values . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.8.15.3 Template options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.8.15.4 Avatars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.8.15.5 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.8.16 Akonadi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.8.17 D-Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.8.18 JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5 6 Questions and Answers Credits and License 120 121 122 6

A Installation

The Simon Handbook

List of Tables
2.1 2.2 3.1 4.1 4.2 4.3 4.4 4.5 Ways to an acoustic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Base model requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julius Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Improved Sample Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Improved Sample Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 14 50 52 53 54 54 72

Abstract Simon is an open source speech recognition solution.

The Simon Handbook

Chapter 1

Introduction
Simon is the main front end for the Simon open source speech recognition solution. It is a Simond client and provides a graphical user interface for managing the speech model and the commands. Moreover, Simon can execute all sorts of commands based on the input it receives from the server: Simond. In contrast to existing commercial offerings, Simon provides a unique do-it-yourself approach to speech recognition. Instead of predened, pre-trained speech models, Simon does not ship with any model whatsoever. Instead, it provides an easy to use end-user interface to create language and acoustic models from scratch. Additionally the end-user can easily download created use cases from other users and share his / her own. The current release can be used to set up command-and-control solutions especially suitable for disabled people. However, because of the amount of training necessary, continuous, free dictation is neither supported nor reasonable with current versions of Simon. Because of its architecture, the same version of Simon can be used with all languages and dialects. One can even mix languages within one model if necessary.

The Simon Handbook

Chapter 2

Overview
2.1 Architecture

The main recognition architecture of Simon consists of three applications. Simon This is the main graphical interface. It acts as a client to the Simond server. Simond The recognition server. KSimond A graphical front-end for Simond. These three components form a real client / server solution for the recognition. That means that there is one server (Simond) for one or more clients (Simon; this application). KSimond is just a front-end for Simond which means it adds no functionality to the system but rather provides a way to interact with Simond graphically. Additionally to the Simon, Simond and KSimond other, more specialized applications are also part of this integrated Simon distribution. Sam Provides more in-depth control to your speech model and allows to test the acoustic model. SSC / SSCd These two applications can be used to collect large amount of speech samples from different persons more easily. Afaras This simple utility allows users to quickly check large corpora of speech data for erroneous samples. Please refer to the individual handbooks of those applications for more details.

10

The Simon Handbook

Simon is used to create and maintain a representation of your pronunciation and language. This representation is then sent to the server Simond which compiles it into a usable speech model. Simon then records sound from the microphone and transmits it to the server which runs the recognition on the received input stream. Simond sends the recognition result back to the client (Simon). Simon then uses this recognition result to execute commands like opening programs, following links, etc. Simond identies its connections with a user / password combination which is completely independent from the underlying operating system and its users. By default a standard user is set up in both Simon and Simond so the typical use case of one Simond server per Simon client will work out of the box. Every Simon client logs onto the server with a user / password combination which identies a unique user and thus a unique speech model. Every user maintains his own speech model but may use it from different computers (different, physical Simon instances) simply by accessing the same Simond server. One Simond instance can of course also serve multiple users. If you want to open up the server to the Internet or use multiple users on one server, you will have to congure Simond. Please see the Simond manual for details.

2.2

Required Resources for a Working Simon Setup

N OTE For background information about speech models, please refer to the Speech Recognition: Background section.
To get Simon to recognize speech and react to it you need to set up a speech model. Speech models describe how your voice sounds, what words exist, how they sound and what word combination (sentences or structures) exist. A speech model basically consists of two parts:

11

The Simon Handbook

Language model: Describes all existing words and what sentences are grammatically correct Acoustic model: Describes how words sound You need both these components to get Simon to recognize your voice. In Simon, the language model will be created from your active scenarios and the acoustic model will be either built solely through your voice recordings (training) or with the help of a base model.

2.2.1

Scenarios

One scenario makes up one complete use case of Simon. To control Firefox, for example, the user just installs the Firefox scenario. In other words, scenarios tell Simon what words and phrases to listen for and what to do when they are recognized. Because scenarios do not contain information about how these words and phrases actually sound, they can be shared and exchanged between different Simon users without problems. To accommodate this community based repository pool, a category for Simon scenarios has been created on kde-les.org where the scenarios, which are just simple text les (XML format), can be exchanged easily. In most cases scenarios are tailored to work best with a specic base model to avoid issues with the phoneme set. For information on how to use scenarios in Simon, please refer to the Scenario section in the Use Simon chapter.

2.2.2

Acoustic model

As mentioned above, you need an acoustic model to activate Simon. You can either create your own or use and even adapt a base model. Base models are already generated, most often speaker independent, acoustic models that can be used with Simon. The following table shows what is required, depending on your Simon conguration: Base model required Yes Yes No Model creation backend required No Yes Yes

Training required Static base model Adapted base model User-generated model No Yes Yes

Table 2.1: Ways to an acoustic model

2.2.2.1

Backends

Simon uses external software to build acoustic models and to recognize speech. Usually, these backends can be split into two distinct components: The model compiler or model generation backend used to create or adapt acoustic models and the recognizer used to recognize speech with the help of these models.

12

The Simon Handbook

Not all operation modes of Simon will require a model compiler backend. Please refer to the next section about details on when this is the case. Two different backends are supported: Julius / HTK Models will be created with the HTK. Julius will be used as recognizer. To use this backend, please make sure that you have an up-to-date version of both these tools installed. CMU SPHINX This backend, also often simply referred to as SPHINX backend, uses the PocketSphinx recognizer and the SphinxTrain model generation backend. Please refer to the CMU SPHINX website for more details. The CMU SPHINX backend requires that Simon is built with the optional SPHINX support. If you have not compiled Simon from source, please refer to your distribution for more information. If you are using base models, Simon will automatically select the appropriate backend for you. However, if you want to build your own models from scratch (user-generated model, see below) and have a certain preference, please refer to the Simond conguration for more information. Base models created for one backend are not compatible with any other backend. Please refer to the compatibility matrix for details. 2.2.2.2 Types of base models

There are three types of base models: Static base model Adapted base model User-generated model For information on how to use base models in Simon, please refer to the Base Models section in the Use Simon chapter. 2.2.2.2.1 Static base model

Static base models simply use a pre-compiled acoustic model without modifying it. Any training data collected through Simon will not be used to improve the recognition accuracy. This type of model does not require the model creation backend to be installed. 2.2.2.2.2 Adapted base model

By adapting a pre-compiled acoustic model you can improve accuracy by adapting it to your voice. Collected training data will be compiled in a adaption matrix which will then be applied to the selected base model. This type of model does require the model creation backend to be installed.

13

The Simon Handbook

2.2.2.2.3

User-generated model

When using user-generated models, the user is responsible for training his own model. No base model will be used. The training data will be used to compile your own acoustic model allowing you to create a system which directly reects your voice. This type of model does require the model creation backend to be installed. 2.2.2.3 Requirements

To build, adapt or use acoustic models of different types, certain software needs to be installed. CMU SPHINX Julius / HTK PocketSphinx Julius SphinxTrain, PocketSphinx HTK, Julius SphinxTrain, PocketSphinx HTK, Julius Table 2.2: Base model requirements

Static base model Adapted base model User-generated model

All four tools, HTK, Julius, PocketSphinx and SphinxTrain, can safely be installed at the same time. SPHINX support in Simon must be enabled during compile time and might not be available on your platform. Please refer to your distribution.

N OTE The Simon Windows installer includes Julius, PocketSphinx and SphinxTrain but not the HTK. Please refer to the installation section for information on how to install it should you nd the need for it.
2.2.2.4 Where to get base models

Simon base models are packaged as .sbm les. If you happen to have raw model les for your backend, you can package them into a compatible SBM container within Simon. Please refer to the speech model conguration for details. Not all SBM models may work for you. Please refer to the model backends section for details. To keep this list of available base models up to date, please refer to the list in our online wiki. 2.2.2.5 Phoneme set issues

In order for base models to work, both your scenarios and your base model need to use the same set of phonemes. In practice, this often just means that you need to match scenarios to your base model. The name of Simon base models will most likely start with a tag like [EN/VF/JHTK]. Try to download scenarios that start with the same tag. You can not use scenarios designed for different phoneme set (different base model). If Simon recognizes this error, it will try to disable affected words by removing them from the created speech model. These words will be marked with a red background in the vocabulary of the scenario. To re-enable them, transcribe them with the proper phoneme set or use a user-generated model.

14

The Simon Handbook

H INT If you design a new scenario it is therefore a good idea to use the dictionary that was used to create the base model as shadow dictionary. This way Simon will suggest the correct phonemes when adding the words automatically.

15

The Simon Handbook

Chapter 3

Using Simon: Typical user


The following sections will describe how to use Simon.

3.1

First run wizard

On the rst start of Simon, this assistant will guide you through the initial conguration of Simon.

The conguration consists of ve easy steps which are outlined below. You can skip each step and even the whole wizard if you want to - in that case, the system will be set up with default values. However, please note that without any conguration, there wont be any recognition.

3.1.1

Scenarios

In this step you can add or download scenarios. 16

The Simon Handbook

To download scenarios from the online repository, select Open Download to open the download dialog pictured below.

Especially for new users it is recommended to try some scenarios rst to see how the system works before diving into conguring it exactly for your use case. After completing the assistant, you can change the scenario conguration with the use of the scenario management dialog. If you are planning to use a base model, make sure that you download matching scenarios.

17

The Simon Handbook

3.1.2

Base models

In this step you can set up Simon to use base models.

Again, you can download base models from an online repository through Open model Download.

To use a user-generated model, select Do not use a base model. After completing or aborting the rst run wizard you can change conguration options dened here in the Simon conguration.

18

The Simon Handbook

3.1.3

Server

Internally, Simon is a server / client application. If you want to take advantage of a network based installation, you can provide the server address here.

The default conguration is sufcient for a normal installation and will assume that you use a local Simond server that will be automatically be started and stopped with Simon. After completing or aborting the rst run wizard you can change conguration options dened here in the server conguration.

3.1.4

Sound conguration

Because Simon recognizes sound from one or more microphones, you have to tell Simon which devices you want to use for recognition and training.

19

The Simon Handbook

Simon can use one or more input- and output devices for different tasks. You can nd more information about Simons multiple device capabilities in the Simon sound conguration section. If you dont have at least one working input device for recognition, you will not be able to activate Simon. After completing or aborting the rst run wizard you can change conguration options dened here in the sound conguration.

3.1.5

Volume calibration

For Simon to work correctly, you need to congure your microphones volume to a sensible level.

20

The Simon Handbook

For more details on this, please see the general section about Volume Calibration.

3.2

The Simon Main Window

The Simon main window is split into four logical sections. On the top left, you can see the scenario section, to its right you nd the training section, on the bottom left is the acoustic model and nally, on the right of that, the recognition section. The Simon main window can be hidden at any time by clicking on the Simon logo in the system tray (usually next to the system clock in the task bar) which will minimize Simon to the tray. Click it again to show the main window again.

3.2.1

Main window: Scenarios

A list of scenarios shows the currently loaded scenarios. You can manage this selection by clicking Manage scenarios which will open the scenario management dialog. To modify a scenario, select it from the list and open it by pressing Open <name>.

3.2.2

Main window: Training

This section shows all training texts from all currently active scenarios. Selecting a training text will highlight the parent scenario in the scenario section. You can start to train the recognition by selecting a text and clicking on Start training. Please note that, depending on your selected model type, training may or may not improve your recognition accuracy. The acoustic model section (see below) in the Simon main menu tells you if training will have an effect for your specic conguration. For more information, please refer to the base model section for background information on this subject. The gathered training corpus can be managed by selecting Manage training data which will open the sample management dialog. 21

The Simon Handbook

To help build a general, open speech corpus, please consider contributing your training corpus to the Voxforge project by selecting File Contribute samples to bring up the sample upload assistant.

3.2.3

Main window: Acoustic model

Here, Simon shows information about the currently used base- and active model. Select Congure acoustic model to congure the base model.

3.2.4

Main window: Recognition

This section shows information about the recognition status. If Simon is connected to the server, you can activate and deactivate the recognition by toggling the Activate button. If this control element is not available, make sure you are connected by selecting File Connect from Simons menu. An integrated volume calibration widget monitors the congured recognition devices. The sound setup can be modied by selecting Congure audio to bring up the sound conguration.

3.3

Scenarios

This section describes how to import and remove scenarios to your Simon conguration. For general information about scenarios, please refer to the background chapter. If you want to create, edit or export scenarios, please refer to the advanced usage section. To modify your scenario conguration, rst open the scenario management dialog by pressing Manage scenarios in the Simon main window.

To activate or deactivation a scenario you can use the arrow buttons between the two lists or simply double click the option you want to load / unload. More information about individual scenarios can be found in the tooltips of the list items. 22

The Simon Handbook

3.3.1

Import Scenario

Scenarios can be imported from a local le in Simons XML scenario le format but can also be directly downloaded and imported from the internet. When downloading scenarios, the list of scenarios is retrieved from Simon Scenarios subsection of the OpenDesktop site KDE-les.org.

If you create a scenario that might be valuable for other Simon users, please consider uploading it to this online repository and help other Simon users.

3.3.2

Delete Scenario

To delete a scenario, select the scenario and click the Delete button. Because scenarios are synchronized with the recognition server, you can restore deleted scenarios through the model synchronization backup.

3.4

Recordings

If you are using user-generated or adapted models, Simon builds its acoustic model based on transcribed samples of the users voice. Because of this, the recorded samples are of vital importance for the recognition performance.

3.4.1

Volume

It is important that you check your microphone volume before recording any samples.

23

The Simon Handbook

3.4.1.1

Simon Calibration

The current version of Simon includes a simple way of ensuring that your volume is congured correctly.

By default the volume calibration is displayed before starting any recording in Simon. To calibrate simply read the text displayed. The calibration will monitor the current volume and tell you to either raise or lower the volume but you have to do that manually in your systems audio mixer. During calibration, try to talk normally. Dont yell but dont be overly quiet either. Take into account that you should generally use the same volume setting for all your training and for the recognition too. You might speak a little bit louder (unconsciously) when you are upset or at another time of the day so try to raise your voice a little bit to anticipate this. It is much better to have a little quieter samples than to start clipping. In the Simon settings, both the text displayed and the levels considered correct can be changed. If you leave the text empty, the default text will be displayed. In the options you can also deactivate the calibration completely. See the training section for more details. 3.4.1.2 Audacity Calibration

Alternatively you can use an audio editing tool like the free Audacity to monitor the recording volume. Too quiet:

24

The Simon Handbook

Too loud:

Perfect volume:

3.4.2

Silence

To help Simon with the automatic segmentation it is recommended to leave about one or two seconds of silence on the recording before and after reading the prompted text. Current Simon versions include a graphical notice on when to speak during recording. The message will tell the user to wait for about half a second:

25

The Simon Handbook

... before telling the user to speak:

This method of visual feedback proved especially valuable when recording with people who cannot read the prompted text for themselves and therefore need someone to tell them what they have to say. The colorful visual cue tells them when to start repeating what the facilitator said without the need of unreliable hand gestures.

3.4.3

Content

Generally we recommend to record roughly the same sentences that Simon should recognize later. 26

The Simon Handbook

(Obviously that does not apply to massive sample acquisitions where other properties like phonetic balance are more important) Care should be taken to avoid recordings like One One One to quickly ramp up the recognition rate property. Such recordings often decrease recognition performance because the pronunciation differs greatly from saying the word in isolation.

3.4.4

Microphone

For Simon to work well, a high quality microphone is recommended. However, even relatively cheap headsets (around 30 Euros) achieve very good results - magnitudes better than internal microphones. For maximum compatibility we recommend USB headsets as they usually support the necessary samplerate of 16 kHz, are very well supported from both Microsoft Windows as well as GNU/Linux and normally dont require special, proprietary drivers to operate.

3.4.5

Sample Quality Assurance

Simon will check each recording against certain criteria to ensure that the recorded samples are not erroneous or of poor quality. If Simon detects a problematic sample, it will warn the user to re-record the sample. Currently, Simon checks the following criteria: Sample peak volume If the volume is too loud and the microphone started to clip (Clipping on wikipedia), Simon will display a warning message urging the user to lower the microphone volume. Signal to noise ratio (SNR) Simon will automatically determine the signal to noise ratio of each recording. If the ratio is below a congurable threshold, a warning message will be displayed. The default value of 2300 % means that for Simon to accept a sample as correctly recorded the peak volume has to be 23 times louder than the noise baseline (lowest average over 50 ms). Often this can be a result of either a very low quality microphone, high levels of ambient noise or a low microphone gain coupled with a microphone boost option in the system mixer. SNR warning message triggered by an empty sample. This information dialog is displayed when clicking on the More information button on the recording widget.

27

The Simon Handbook

3.5

Contribute Samples

The base models that can be used with Simon to augment or replace training are built from other peoples speech samples. In order to create high quality base models, a large amount of training samples are necessary. If you trained your local Simon installation, you gathered valuable voice samples that could improve the quality of the general model. Through Simons Contribute Samples dialog you can upload those recordings to benet the Voxforge project to create high quality open source base models.

After connecting to the server, Simon will ask for some basic meta-information. This information obviously contains no personal information. Instead, it will later be used to group together samples of similar speaker groups to build more accurate acoustic models.

28

The Simon Handbook

The duration of the upload process itself will depend on your internet connection. Generally speaking, this only transmits relatively little data because the audio samples collected by Simon are generally very small: around 0.1 MB per sample.

3.6

Manage training data

To view and modify your personal training corpus, you can access the training data management dialog by selecting Manage training data in the Simon main window or the training section of any opened scenario. 29

The Simon Handbook

3.6.1

Modifying samples

To listen to or re-record a sample, select it from the list and select Open Sample.

In this dialog you can also modify the samples group after it was recorded. If you remove the opened sample and do not re-record it, Simon will offer to remove it from the corpus.

30

The Simon Handbook

3.6.2

Clear training data

After a conrmation dialog, this will remove all personal training data of the user.

3.6.3

Importing Training Samples

Using the import training data eld you can import previously gathered training samples from previous Simon versions or manual training.

N OTE This feature is very specic. Please use it with caution and make sure that you know exactly what you are doing before you continue.
You can either provide a separate prompts le or let Simon extract the transcriptions from the lenames. When using prompts based transcriptions your prompts le (UTF-8) needs to contain lines of the following content: [lename] [content]. Filenames are without le extensions and the content has to be uppercase. For example: demo_2007_03_20 DEMO to import the le demo _2007_03_20.wav containing the spoken word Demo. Because prompts les do not contain a le extension, Simon will try wav, mp3, ogg and ac (in that order). If one of those match, no other extension will be tested and only the rst le will be imported (in contrast to le based transcription where all les would be imported). When using le based transcriptions, a le called this_is_a_test.wav must contain This is a test and nothing else. Numbers and special characters (., -,...) in the lename are ignored and stripped. Files recorded by Simon 0.2 will follow this naming scheme so you can safely import them using the le name extraction method. Files generated by previous Simon versions should not be imported using this function but you can use the prompts based import for that. Imported les and their transcription are then added to the training corpus. To import a folder containing training samples just select the folder to import and depending on your import type also the prompts le.

31

The Simon Handbook

The folder will be scanned recursively. This means that the given folder and all its subfolders will be searched for .wav, .ac, .mp3 and .ogg les. All les found will be imported. When importing the sound les, all congured post processing lters are applied. If you import anything other than WAV les you are responsible for decoding them during the import process (for example through post processing lters) or the model creation will fail.

3.7

Conguration

Simon was designed with high congurability in mind. Because of this, there are plentiful parameters that can be ne-tuned to your specic requirements. You can access Simons conguration dialog through the applications main menu: Settings Congure Simon....

3.7.1

General Conguration

The general conguration page lists some basic settings. If you want to show the rst run assistant again, deselect Disable conguration wizard.

32

The Simon Handbook

Please note that the option to start Simon at login will work on both Microsoft Windows and when you are using KDE on Linux. Support for other desktop environments like Gnome, XFCE, etc. might require manually placing Simon in the session autostart (please refer to the respective manuals of your desktop environment). When the option to start Simon minimized is selected, Simon will minimize to the system tray immediately after starting. Deselecting the option to warn when there are problems with samples deactivates the sample quality assurance.

3.7.2

Recordings

Simon uses fairly sophisticated internal sound processing to enable complex multi-device setups. 3.7.2.1 Device Conguration

The sound device conguration allows you to choose which sound device(s) to use, congure them and dene additional recording parameters. Use the Refresh devices button if you have plugged in additional sound devices since you started Simon.

33

The Simon Handbook

Most of the time you will want to use 1 channel and 16kHz (which is also the default) because the recognition only works on mono input and works best at 16kHz (8kHz and 22kHz being other viable options). Some low-cost sound cards might not support this particular mode in which case you can enable automatic Resampling in the devices advanced conguration.

N OTE Only change the channel and the samplerate if you really know what you are doing. Otherwise the recognition will most likely not work.

34

The Simon Handbook

You can use Simon with more than one sound device at the same time. Use Add device to add a new device to the conguration and Remove device to remove it from your conguration. The rst device in your sound setup cannot be removed. For each device you can determine for what you want the device to be used: Training or recognition (last one only applicable for input devices). If you use more than one device for training, you will create multiple sound les for each utterance. When using multiple devices for recognition each one feeds a separate sound input stream to the server resulting in recognition results for each stream. If you use multiple output devices the playback of the training samples will play on all congured audio devices. When using different sample rates for your input devices, the output will only play on matching output devices. If you for example have one input device congured to use 16kHz and the other to use 48kHz, the playback of samples generated by the rst one will only play on 16kHz outputs, the other one only on 48kHz devices. In the devices advanced conguration, you can also dene the sample group tag of the produced training samples and set activation context conditions. If you set up this device to be used for recognition and (any of) its activation requirements are not met, the device will not record. This can be used to augment or even replace the traditional voice activity detection with context information. For example, add a face detection condition to the recording devices activation requirements to only enable the recognition when youre looking at the webcam. 3.7.2.2 Voice Activity Detection

The recognition is done one the Simond server. See the architecture section for more details. The sound stream is not continuous but is segmented by the Simon client. This is done by something called voice activity detection.

Here you can congure this segmentation through the following parameters:

35

The Simon Handbook

Cutoff level Everything below this level is considered silence (background noise). Head margin Cache for as long as head margin to start consider it a real sample. During this whole time the input level needs to be above the cutoff level. Tail margin After the recording went below the cutoff level, Simon will wait for as long as tail margin to consider the current recording a nished sample. Skip samples shorter than Samples that are shorter than this value are not considered for recognition. (coughs, etc.) 3.7.2.3 Training settings

When the option Default to power training is selected, Simon will, when training, automatically start- and stop the recording when displaying and hiding (respectively) the recording prompt. This option only sets the default value of the option, the user can change it at any time before beginning a training session. The congurable font here refers to the text that is recorded to train the acoustic model (through explicit training or when adding a word). This option has been introduced after we have worked with a few clients suffering spastic disability. While we used the mouse to control Simon during the training, they had to read what was on the screen. At rst this was very problematic as the regular font size is relatively small and they had trouble making out what to read. This is why we made the font and the font size of the recording prompt congurable. Here you can also dene the required signal to noise ratio for Simon to consider a training sample to be correct. See the Sample Quality Assurance section for more details. On this conguration page you can also set the parameters for the volume calibration.

36

The Simon Handbook

It can be deactivated for both the add word dialog and the training wizard by unchecking the group box itself. The calibration itself uses the voice activity recognition to score your sound conguration. The prompted text can be congured by entering text in the input eld below. If the edit is empty a default text will be used. 3.7.2.4 Postprocessing

All recorded (training) and imported (through the import training data) samples can be processed using a series of postprocessing commands. Postprocessing chains are an advanced feature and shouldnt be needed by the average user.

The postprocessing commands can be seen as a chain of lters through which the recordings have to pass through. Using these lters one could dene commands to suppress background noise in the training data or normalize the recordings. Given the program process_audio which takes the input- and output les as its arguments (e.g.: process_audio in.wav out.wav) the postprocessing command would be: process_audio %1 %2. The two placeholders %1 and %2 will be replaced by the input lename and the output lename respectively. The switch to apply lters to recordings recorded with Simon enables the postprocessing chains for samples recorded during the training (including the initial training while adding the word). If you dont select this switch the postprocesing commands are only applied to imported samples (through the import training data wizard). 3.7.2.5 Context

Every sample recorded with Simon is assigned a sample group. When creating the acoustic model from the training samples Simon can take the current situation into account to only use a subset of all gathered training data.

37

The Simon Handbook

For example, in a system where multiple, very different speakers use one shared setup, context conditions can be set up to automatically build separate models for both users depending on the current situation. The above screenshot, for example, shows a setup where, given that all samples of peter were tagged peters_samples and all samples of mathias were tagged mathias_samples (refer to the device conguration for more information on how to set up sample groups), the active acoustic model will only contain the current users own samples as long as the le /home/bedah r/.username contains either peter or mathias. Another example use-case would be to switch to a more noise-resistant acoustic model when the user starts playing music.

3.7.3

Speech Model

Here you can adjust the parameters of the speech model. 3.7.3.1 Base model

You can optionally use base models to limit / circumvent the training or to avoid installing a model creation backend. Please refer to the general base model section for more details about base models.

38

The Simon Handbook

To use a user-generated model, select Do not use a base model. To use a static base model, select a base model and do not select Adapt base model using training samples. To instead use an adapted base model, check Adapt base model using training samples after selecting a base model. Simon base models are packaged in .sbm les. To add base models to the selection, you can either import local models (Open model Import), download them from an online repository (Open model Download) or create new ones from raw les (Open model Create from model les).

If you have raw model les produced by either supported model creation backend, you can package them into SBM container for use with Simon. 39

The Simon Handbook

You can also export your currently active model by selecting Export active model. The exported SBM le will contain your full acoustic model (ignoring the current context) that can be shared with other Simon users. 3.7.3.2 Training data

This section allows to congure the training samples.

The samplerate set here is the target samplerate of the acoustic model. It has nothing to do with the recording samplerate and it is the responsibility of the user to ensure that the samples 40

The Simon Handbook

are actually made available in that format (usually by recording in that exact samplerate or by dening postprocessing commands that resample the les; see the sound conguration section for more details). Usually either 16kHz or 8kHz models are built / used. 16kHz models will have higher accuracy over 8kHz models. Going higher than 16kHz is not recommended as it is very cpu-intensive and in practice probably wont result in higher recognition rates. Moreover, the path to the training samples can be adjusted. However, be sure that the previously gathered training samples are also moved to the new location. If you use automatic synchronization the Simond would alternatively also provide Simon with the missing sample but copying them manually is still recommended for performance reasons. 3.7.3.3 Language Prole

In the language prole section you can select a previously built or downloaded language prole to aid with the transcription of new words.

3.7.4

Model Extensions

Here you can congure the base URL that is going to be used for the automatic bomp import. The default points to the copy on the Simon listens server.

41

The Simon Handbook

3.7.5

Recognition

Here you can congure the recognition and model synchronization with the Simond server. 3.7.5.1 Server

Using the server conguration you can set parameters of the connection to Simond. 3.7.5.1.1 General

The Simon main application connects to the Simond server (see the architecture section for more information).

42

The Simon Handbook

To identify individual users of the system (one Simond server can of course serve multiple Simon clients), Simon and Simond use users. Every user has his own speech model. The username / password combination given here is used to log in to Simond. If Simond does not know the username or the password is incorrect, the connection will fail. See the Simond manual on how to setup users for Simond. The recognition itself - which is done by the server - might not be available at all times. For example it would not be possible to start the recognition as long as the user does not have a compiled acoustic and language model which has to be created rst (during synchronization when all the ingredients - vocabulary, grammar, training - are present). Using the option to start the recognition automatically once it is available, Simon will request to start the recognition when it receives the information that it is ready (all required components are available). Using the Connect to server on startup option, Simon will automatically start the connection to the congured Simond servers after it has nished loading the user interface. 3.7.5.1.2 Network

Simon connects to Simond using TCP/IP.

43

The Simon Handbook

As of yet (Simon 0.4), encryption is not yet supported. The timeout setting species, how long Simon will wait for a rst reply when contacting the hosts. If you are on a very, very slow network and/or use connect on start on a very slow machine, you may want to increase this value if you keep getting timeout errors and can resolve them by trying again repeatedly. Simon supports to be congured to use more than one Simond. This is very useful if you for example are going to use Simon on a laptop which connects to a different server depending where you are. You could for example add the server you use when you are home and the server used when you are at work. When connecting, Simon will try to connect to each of the servers (in order) until it nds one server that accepts the connection. To add a server, just enter the host name or IP address and the port (separated by :) or use the dialog that appears when you select the blue arrow next to the input eld. 3.7.5.2 Synchronization and Model Backup

Here you can congure the model synchronization and restore older versions of your speech model.

44

The Simon Handbook

Simon creates the speech input les which are then compiled and used by the Simond server (see the section architecture for more details). The process of sending the speech input les, compiling them and receiving the compiled versions is called synchronization. Only after the speech model is synchronized the changes take effect and a new restore point is set. This is why per default Simon will always synchronize the model with the server when it changes. This is called Automatic Synchronization and is the recommended setting. However, if you want more control you can instruct Simon to ask you before starting the synchronization after the model has changed or to rely on manual synchronization all together. When selecting the manual synchronization you have to manually use the Actions Synchronize menu item of the Simon main window every time you want to compile the speech model. The Simon server will maintain a copy of the last ve iterations of model les. However, this only includes the source les (the vocabulary, grammar, etc.) - not the compiled model. However, the compiled model will be regenerated from the restored source les automatically. After you have connected to the server, you can select one of the available models and restore it by clicking on Choose Model.

3.7.6

Actions

In the actions conguration you can congure the reactions to recognition results. 3.7.6.1 Recognition

The recognition of Simon computes not only the most likely result but rather the top ten results. Each of the results are assigned a condence score between 0 and 1 (were 1 is 100% sure). Using the Minimum condence you can set a minimum condence for recognition results to be considered valid. If more than one recognition results are rated higher than the minimum condence score, Simon will provide a popup listing the most likely options for you to choose from. 45

The Simon Handbook

This popup can be disabled using the Display selection popup for ambiguous results check box. 3.7.6.2 Dialog font

Many plugins of Simon have a graphical user interface. The fonts of these interfaces can be congured centrally and independent of the systems font settings here.

3.7.6.3

Lists

Here you can nd the global list element conguration. This serves as a template for new scenarios but is also directly used for the popup for ambiguous recognition results.

3.7.7

Text-to-speech

Some parts of Simon, most notably the dialog command plugin employ text-to-speech (or TTS) to read text aloud.

46

The Simon Handbook

3.7.7.1

Backends

Multiple external TTS solutions can be used to allow Simon to talk. Multiple backends can be enabled at the same time and will be queried in the congured order until one is found that can synthesize the requested message. The following backends are available: Recordings Instead of an engine to convert arbitrary text into speech, text-snippets can be pre-recorded and will be simply played back. Jovie Uses the Jovie TTS system. This requires a valid Jovie set-up. Webservice The webservice backend can be used to talk to any TTS engine that has a web front-end that returns .wav les.

3.7.7.2

Recordings

Instead of using an external TTS engine, you can also record yourself or other speakers reading the texts aloud. Simon can then play back these pre-recorded snippet when they are requested of its text-to-speech engine. These recorded sound bites are organized into sets of different speakers which can also be imported and exported to share them with other Simon users.

47

The Simon Handbook

3.7.7.3

Webservice

Through the webservice backend, Simon can use web-based TTS engines like MARY.

You can provide any Url. Simon will replace any instance of %1 within the congured Url with the text to synthesize. The backend expects the queried webservice to return a .wav le that will be streamed and outputted through Simons sound layer - respecting the sound device conguration.

48

The Simon Handbook

3.7.8

Social desktop

Scenarios can be uploaded and downloaded from within Simon. For this we use KDEs social desktop facilities and our own category for Simon scenarios on kdeles.org. If you already have an account on opendesktop.org you can input the credentials there. If you dont, you can register directly in the conguration module. The registration is of course free of charge.

3.7.9

Webcam conguration

In Webcam conguration, you can congure frame per second (fps) and select the webcam to use when multiple webcams are connected to your system.

Frame per second is the rate at which webcam will produce unique consecutive images called frames. The optimal value of fps is between 5-15 for proper performance.

3.7.10

Advanced: Adjusting the recognition parameters manually

Simon is targeted towards end-users. Its interface is designed to allow even users without any background in speech technology to design their own language and acoustic models by providing reasonable default values for simple uses. In special cases (severe speech impairments for example), special conguration might be needed. This is why the raw conguration les for the recognition are also respected by Simon and can of course be modied to suit your needs. 3.7.10.1 Julius

There are basically two parts of the Julius conguration that can be adjusted: 49

The Simon Handbook

adin.jconf This is the conguration of the Simon client of the Soundstream sent from Simon to the Simond. This le is directly read by the adinstreamer. Simon ships with a default adin.jconf without any special parameters. You can change this system wide conguration which will affect all users if there are different user accounts on your machine who all use Simon. To just change the conguration of one of those users copy the le to the user path (see below) and edit this copy. julius.jconf This is a conguration of the Simond server and directly inuences the recognition. This le is parsed by libjulius and libsent directly. Simond ships with a default julius.jconf. Whenever there is a new user added to the Simond database, Simond will automatically copy this system wide conguration to the new user. After that the user is of course free to change it but it wont affect the other users. This way the template (the system wide conguration) can be changed without affecting other users. The path to the Julius conguration les will depend on your platform: File adin.jconf (system) adin.jconf (user) julius.jconf (template) GNU/Linux kde4-config --prefix/share/apps/simon/adin.jconf ~/.kde/share/apps/simon/adin.jconf kde4-config (installation path)\share\a--prefix/share/apps/simopps\simond\default.jconf nd/default.jconf %appdata%\.kde\share\a~/.kde/share/apps/simonpps\simond\models\(used/models/(user)/active/jr)\active\julius.jconf ulius.jconf Table 3.1: Julius Configuration Files Microsoft Windows (installation path)\share\apps\simon\adin.jconf %appdata%\.kde\share\apps\simon\adin.jconf

julius.jconf (user)

50

The Simon Handbook

Chapter 4

Advanced: Creating new scenarios with Simon


The following chapter is aimed towards more experienced users who want to design their own scenarios. For general usage instruction, please refer to the chapter Using Simon: Typical user.

4.1

Introduction

To add a new scenario, you rst create a new scenario shell by adding a new scenario object and then open it in the Simon main window. To instead modify an existing scenario, you of course just have to open it. A Simon scenario contains the following components: Vocabulary Grammar Training texts Context Commands Before describing how to congure these elements in Simon, the next section provides background information that will help you understand the basic principles of speech modelling. This fundamental knowledge is necessary to design sensible scenarios.

4.2

Speech recognition: background

N OTE Before explaining exactly how you can create new scenarios with Simon, this section introduces some fundamental basics to speech recognition in general.

51

The Simon Handbook

Speech recognition systems take voice input (often from a microphone) and try to translate it into written text. To do that, they rely on statistical representations of human voice. To put it into simple terms: The computer learns how words - or more correctly the sounds that make up those words - sound. A speech model consists of two distinct parts: Language Model Acoustic Model

4.2.1

Language Model

The language model denes the vocabulary and the grammar you want to use. 4.2.1.1 Vocabulary

The vocabulary denes what words the recognition process should recognize. Every word you want to be able to use with Simon should be contained in your vocabulary. One entry in the vocabulary denes exactly one word. In contrast to the common use of the word word, in Simon word means one unique combination of the following: Wordname (The written word itself) Category (Grammatical category; for example: Noun, Verb, etc.) Pronunciation (How the word is pronounced; Simon accepts any kind of phonetic as long as it does not use special characters or numbers) That means that plurals or even different cases are different words to Simon. This is an important design decision to allow more control when using a sophisticated grammar. In general, it is advisable to keep your vocabulary as sleek as possible. The more words, the higher the chance that Simon might misunderstand you. Example vocabulary (please note that the categories here are deliberately set to Noun / Verb to help the understanding; please to refer to the grammar section why this might not be the best idea): Word Computer Internet Mail close Category Noun Noun Noun Verb Table 4.1: Sample Vocabulary Pronunciation k ax m p y uw t er ih n t er n eh t m ey l k l ow s

4.2.1.1.1

Active Dictionary

The vocabulary used for the recognition is referred to as active dictionary or active vocabulary. 52

The Simon Handbook

4.2.1.1.2

Shadow Dictionary

As said above, the user should keep his vocabulary / dictionary as lean as possible. However, as a word in your vocabulary has to also have information about its pronunciation, it would also be good to have a large dictionary where you could look up the pronunciation and other characteristics of the words. Simon provides this functionality. We refer to this large reference dictionary as shadow dictionary. This shadow dictionary is not created by the user but can be imported from various sources. As Simon is a multi-language solution we do not ship shadow dictionaries with Simon. However, it is very easy to import them yourself using the import dictionary wizard. This is described in the Import Dictionary section. 4.2.1.1.3 Language prole

Additionally to a shadow dictionary, Simon can use a language prole to provide help with transcribing words. A language prole consists of rules how words are pronounced in the target language. It can be likened to the way that humans can often pronounce a word they never heard just because they know some implicit pronunciation rules of the language. Just as with humans, this process is not perfect but can provide a solid starting ground. This automatic deduction of a phoneme transcription from a written word is called grapheme to phoneme conversion. Simon requires the Sequitur G2P grapheme to phoneme converter to be installed and set up for language proles to work. If you have selected a pre-built language prole or built your own, Simon will automatically transcribe new words with it when they are not found in your shadow dictionary. 4.2.1.2 Grammar

The grammar denes which combinations of words are correct. Lets look at an example: You want to use Simon to launch programs and close those windows when you are done. You would like to use the following commands: Computer, Internet to open a browser Computer, Mail To open a mail client Computer, close To close the current window Following English grammar, your vocabulary would contain the following: Word Computer Internet Mail close Category Noun Noun Noun Verb Table 4.2: Sample Vocabulary

53

The Simon Handbook

To allow the sentences dened above Simon would need the following grammar: Noun Noun for sentences like Computer Internet Noun Verb for sentences like Computer close While this would work, it would also allow the combinations Computer Computer, Internet Computer, Internet Internet, etc. which are obviously bogus. To improve the recognition accuracy, we can try to create a grammar that better reects what we are trying to do with Simon. It is important to remember that you dene your own language when using Simon. That means that you are not bound to grammar rules that exist in whatever language you want to use Simon with. For a simple command and control use-case it would for example be advisable to invent new grammatical rules to eliminate the differences between different commands imposed by grammatical information not relevant for this use case. In the example above it is for example not relevant that close is a verb or that Computer and Internet are nouns. Instead, why not dene them as something that better reects what we want them to be: Word Computer Internet Mail close Category Trigger Command Command Command Table 4.3: Improved Sample Vocabulary

Now we change the grammar to the following: Trigger Command This allows all the combinations described above. However, it also limits the possibilities to exactly those three sentences. Especially in larger models a well thought grammar and vocabulary can mean a huge difference in recognition results.

4.2.2

Acoustic Model

The acoustic model represents your pronunciation in a machine readable format. Lets look at the following sample vocabulary: Word Computer Internet Mail close Category Noun Noun Noun Verb Table 4.4: Sample Vocabulary Pronunciation k ax m p y uw t er ih n t er n eh t m ey l k l ow s

54

The Simon Handbook

The pronunciation of each word is composed of individual sounds which are separated by spaces. For example, the word close consists of the following sounds: k l ow s The acoustic model uses the fact that spoken words are composed of sounds much like written words are composed of letters. Using this knowledge, we can segment words into sounds (represented by the pronunciation) and assemble them back when recognizing. These building blocks are called phonemes. Because the acoustic model actually represents how you speak the phonemes of the words, training material is shared among all words that use the same phonemes. That means if you add the word clothes to the language model, your acoustic model already has an idea how the clo part is going to sound as they share the same phonemes (k, l, ow) at the beginning. To train the acoustic model (in other words to tell him how you pronounce the phonemes) you have to train words from your language model. That means that Simon displays a word which you read out loud. Because the word is listed in your vocabulary, Simon already knows what phonemes it contains and can thus learn from your pronunciation of the word.

4.3

Scenarios

This section extends the previous one about basic scenario management and tells you how to create, edit and export scenarios.

55

The Simon Handbook

4.3.1

Scenario hierarchies

You can create scenario hierarchies by dragging and dropping active scenarios on top of each other.

Scenario hierarchies serve two purposes: The context system respects scenario hierarchies: If the parent scenario gets deactivated, all child scenarios will become deactivated as well. If you attempt to export a scenario that has children, Simon will allow you to export them in a joint scenario package. This way, you can share multiple logically co-dependent scenarios (e.g. one Ofce scenario that contains sub-scenarios for Word, Excel, etc.).

4.3.2

Adding a new Scenario

To add a new scenario, select the Add button. A new dialog will be displayed.

56

The Simon Handbook

When creating a new scenario, please give it a descriptive name. For the later upload on KDE les we would kindly ask you to follow a certain naming scheme although this is of course not a requirement: [<language>/<base model>] <name>. If, for example you create a scenario in English that works with the Voxforge base model and controls Mozilla Firefox this becomes: [EN/VF] Firefox. If your scenario is not specically tailored to one phoneme set (base model), just omit the second tag like this: [EN] Firefox. The scenario version is just an incremental version number that makes it easier to distinguish between different revisions of a scenario. If your scenario needs a specic feature of Simon (for example because you use a new plugin), you can dene minimum and maximum version numbers of Simon here. The license of your scenario can be set through the drop down. You can of course also add an arbitrary license text directly in the input eld. You can then add your name (or alias) to the list of scenario authors. There you will also be asked for contact information. This eld is purely provided as a convenient way to contact a scenario author for changes, problems, fanmail etc. If you dont feel comfortable providing your e-Mail address you can simply enter a dash - denoting that you are not willing to divulge this information.

4.3.3

Edit Scenario

To edit scenarios, just select Edit from the Manage scenarios dialog. The dialog works exactly the same as the add scenario dialog.

4.3.4

Export Scenario

Scenarios can be exported to a local le in Simons XML scenario le format and directly uploaded to the Simon Scenarios subsection of the OpenDesktop site KDE-les.org. To upload to OpenDesktop sites, you need an account on the site. Registration is very easy and of course free of charge. 57

The Simon Handbook

Simon allows you to upload new content directly from within Simon (Export > Publish).

58

The Simon Handbook

To use this functionality, simply enter your account credentials in the social desktop conguration in the Simon conguration.

4.4

Vocabulary

The vocabulary module denes the set of words of the scenario.

59

The Simon Handbook

Per default, the active vocabulary is shown. To display the shadow vocabulary select the tab Shadow Vocabulary. Every word states it recognition rate which at the moment is just a counter of how often the word has been recorded (alone or together with other words).

4.4.1

Adding Words

To add new words to the active vocabulary, use the add word wizard. Adding words to Simon is basically a two step procedure: 60

The Simon Handbook

Dening the word Initial training 4.4.1.1 Dening the Word

Firstly, the user is asked which word he wants to add.

When the user proceeds to the next page, Simon automatically tries to nd as much information about the word in the shadow dictionary as possible. If the word is listed in the shadow dictionary, Simon automatically lls out all the needed elds (Category and Pronunciation).

61

The Simon Handbook

All suggestions from the shadow dictionary are listed in the table Similar words. Per default only exact word matches are shown. However, this can be changed by checking the Include similar words check box below the suggestion table. Using similar words you can quickly deduce the correct pronunciation of the word you are actually trying to add. See below for details. Of course this really depends on your shadow dictionary. If the shadow dictionary does not contain the word you are trying to add, the required elds have to be lled out manually. Some dictionaries that can be imported with Simon (SPHINX, HTK) do not differentiate between upper and lower case. Suggestions based on those dictionaries will always be uppercase. You are of course free to change these suggestions to the correct case. Some dictionaries that can be imported with Simon (SPHINX, PLS and HTK) provide no grammatical information at all. These will assign all the words to the category Unknown. You should change this to something appropriate when adding those words. 4.4.1.1.1 Manually Selecting a Category

The category of the word is dened as the grammatical category the word belongs to. This might be Noun, Verb or completely new categories like Command. For more information see the grammar section. The list contains all categories used in both your active and your shadow lexicon and in your grammar. You can add new categories to the drop-down menu by using the green plus sign next to it. 4.4.1.1.2 Manually Providing the Phonetic Transcription

The pronunciation is a bit trickier. Simon does not need a certain type of phonetics so you are free to use any method as long as it uses only ASCII characters and no numbers. However, if you want to use a shadow dictionary and want to use it to its full potential you should use the same phonetics as the shadow dictionary. If you do not know how to transcribe a word yourself you can easily use your shadow dictionary to help you with the transcription - even if the word is not listed in it. Lets say we want to add the word Firefox (to launch refox) which is of course not listed in our shadow dictionary. 62

The Simon Handbook

(I imported the English voxforge HTK lexicon available from voxforge as a shadow dictionary.) Firefox is not listed in our shadow dictionary so we do not get any suggestion at all.

However, we know that refox sounds like re and fox put together. So lets just open the vocabulary (you can keep the wizard open) by selecting Vocabulary from your Simon main toolbar. Switch to the shadow vocabulary by clicking on the tab Shadow Vocabulary. Use the Filter box above the list to search for Fire:

We can see, that the word Fire is transcribed as f ay r. Now lter for fox instead of Fire and we can see that Fox is transcribed as f ao k s. We can assume, that refox should be transcribed as f ay r f ao k s. 63

The Simon Handbook

Using this approach of deducing the pronunciation from parts of the word has the distinct advantage that we not only get a high quality transcription but also automatically use the same phoneme set as the other words which were correctly pulled out of the shadow dictionary. We can now enter the pronunciation and change the category to something appropriate.

4.4.1.2

Training the Word

To complete the wizard we can now train the word twice. If you dont want to do this or for example use a static base model, you can skip these two pages. Because you are about to record some training samples, Simon will display the volume calibration to make sure that your microphone is set up correctly. For more information please refer to the volume calibration section Simon will try to prompt you for real-world examples. To do that, Simon will automatically fetch grammar structures using the category of the word and substitute the generic categories with example words from your active lexicon. For example: You have the grammar structure Trigger Command and have the word Computer of the category Trigger in your vocabulary. You then add a new word Firefox of the category Command. Simon will now automatically prompt you for Computer Firefox as it is according to your grammar - a valid sentence. If Simon is unable to nd appropriate sentences using the word (i.e.: No grammar, not enough words in your active lexicon, etc.) it will just prompt you for the word alone. Although Simon ensures that the automatically generated examples are valid, you can always override its suggestions. Just switch to the Examples tab on the Dene Word page.

64

The Simon Handbook

You are free to change those examples to anything you like. You can even go so far and use words that are not yet in your active lexicon as long as you add them before you synchronize the model, although this is not recommended. All that is left is to record the examples.

Make sure you follow the guidelines listed in the recording section.

4.4.2

Editing a word

To edit a word, simply select it from the vocabulary, and click on Edit. 65

The Simon Handbook

There you can change name, category and pronunciation of the selected word.

4.4.3

Removing a word

To remove a word from your language model, select it in the vocabulary view and click on Remove.

The dialog offers four choices: Move the word to the Unused category. Because you (hopefully) dont use the category Unused in your grammar, the word will no longer be considered for recognition. In fact, it will be removed from the active vocabulary before compiling the model because no grammar sentence references it. If you want to use the category Unused in your grammar, you can of course use a different category for unused words. Just set the category through the Edit word dialog. To use the word again, just set the right category again. No data will be lost. 66

The Simon Handbook

Move the word to the shadow lexicon This will remove the selected word from the active lexicon (and thus from the recognition) but will keep a copy in the shadow vocabulary. All the recordings containing the word will be preserved. To use the word again, add it again to the active vocabulary. When adding a new word with the same name the values of the moved word will be suggested to you. Therefore, no data will be lost. Delete the word but keep the samples Removes the word completely but keeps the associated samples. Whenever you add another word with the same word name the samples will be re-associated. Be careful with this option as the new word you add again might be transcribed differently and this difference cannot be taken into account automatically (Simon will then try to force the new transcription on the old recordings during the model compilation). Do not use this option if the samples you recorded for this word were erroneous. Remove the word completely Just remove the word. All the recordings containing the word will be removed too. This option leaves no trace of neither the word itself nor the associated samples. Because samples are global (not assigned to scenarios), even samples recorded from training sessions of other scenarios might be removed as well if they contain the word. Use this option carefully.

4.4.4

Special Training

Please see the special training section in the training section.

4.4.5

Importing a Dictionary

Simon provides the functionality to import large dictionaries as a reference. This reference dictionary is called shadow dictionary. When the user adds a new word to the model, he has to dene the following characteristics to dene this word: Wordname Category Phonetic denition These characteristics are taken out of the shadow dictionary if it contains the word in question. A large, high quality shadow dictionary can thus help the user to easily add new words to the model without keeping track of the phoneme set or - in many cases - even let him forget that the phonetic transcription is needed at all.

67

The Simon Handbook

Since version 0.3 you can also import dictionaries directly to the active dictionary. This option is mostly there to make it easier to move to Simon from custom solutions and to encourage importing of older models (for example one used with Simon 0.2). You will almost never want to import a very large dictionary as active dictionary. You can nd a list of available dictionaries that work with Simon on the Simon wiki. Simon is able to import ve different types of dictionaries: HADIFIX HTK PLS SPHINX Julius 4.4.5.1 HADIFIX Dictionary

Simon can import HADIFIX dictionaries. One example of a HADIFIX dictionary is the German HADIFIX BOMP. Hadix dictionaries provide both categories and pronunciation. Due to a special exemption in their license the Simon listens team is proud to be able to offer you to download the excellent HADIFIX BOMP directly from within Simon.

68

The Simon Handbook

Using the automatic bomp import you can, after providing name and e-Mail address for the team of the University Bonn, directly download and import the dictionary from the Simon listens server. 4.4.5.2 HTK Dictionary

Simon can import HTK lexica. One example of a HTK lexicon is the English Voxforge dictionary. Hadix dictionaries provide pronunciation information but no categories. All words will be assigned to the category Unknown. 4.4.5.3 PLS Dictionary

Simon can import PLS dictionaries. One example of a PLS dictionary is the German GPL dictionary from Voxforge. PLS dictionaries provide pronunciation information but no categories. All words will be assigned to the category Unknown. 4.4.5.4 SPHINX Dictionary

Simon can import SPHINX dictionaries. One example of a SPHINX dictionary is this dictionary for Mexican Spanish. SPHINX dictionaries provide pronunciation information but no categories. All words will be assigned to the category Unknown.

69

The Simon Handbook

4.4.5.5

Julius Dictionary

Simon can import Julius vocabularies. One example of a Julius vocabularies are the word lists of Simon 0.2. Julius dictionaries provide pronunciation information as well as category information.

4.4.6

Create language prole

Here, you can build a language prole from your shadow dictionary.

After selecting Create prole, Simon will analyze your current shadow dictionary and try to deduce the transcription rules from it. This is generally a very length process and can, depending on the size of your shadow dictionary, take up to several hours. The created prole will be selected automatically after the process completes.

4.5

Grammar

Simon provides an easy to use text based interface to change the grammar. You can simply list all the allowed sentences (without any punctuation marks, obviously) like described above.

70

The Simon Handbook

When selecting a sentence on the left, the right pane will automatically show possible real sentences with the words of your vocabulary on the right. The example section will list at most 35 examples so if more than that amount of sentences match the selected grammar entry, the list might not be complete.

4.5.1

Import a Grammar

Additionally to simply entering your desired grammar sentence by sentence, Simon is able to automatically deduce allowed grammar structures by reading plain text using the Import Grammar wizard.

71

The Simon Handbook

Simon can read and import text les but also provides an input eld if you want to simply type the text into Simon. Say we have a vocabulary like in the general section above: Word Computer Internet Mail close Category Trigger Command Command Command Table 4.5: Improved Sample Vocabulary

We want Simon to recognize the sentence Computer Internet!. So we either enter the text using the Import text option or create a simple text le with this content Computer Internet! (any punctuation mark would work) and save it as simongrammar.txt to use the Import les option.

72

The Simon Handbook

Simon will then read the entered text or all the given text les (in this case the only given text le is simongrammar.txt) and look up every single word in both active and shadow dictionary (the denition in the active dictionary has more importance if the word is available in both). It will then replace the word with its category. In our example this would mean that he would nd the sentence Computer Internet. Simon would nd out that Computer is of the category Trigger and Internet of the category Command. Because of this Simon would learn that Trigger Command is a valid sentence and add it to its grammar. The import automatically segments the input text by punctuation marks (., -, !, etc.) so any natural text should work. The importer will automatically merge duplicate sentence structures (even across different les) and add multiple sentence (all possible combinations) when a word has multiple categories assigned to it. The import will ignore sentences where one or more words could not be found in the language model unless you tick the Also import unknown sentences check box in which case those words are replaced with Unknown. 73

The Simon Handbook

4.5.2

Renaming Categories

The rename category wizard allows you to rename categories in both your active vocabulary, your shadow dictionary and the grammar.

4.5.3

Merging Categories

The merge category wizard allows you to merge two categories into one new category in both your active vocabulary, your shadow dictionary and the grammar.

74

The Simon Handbook

This functionality is especially useful if you want to simplify your grammar structures.

4.6

Training

Using the Training-module, you can improve your acoustic model. The interface lists all installed training texts in a table with three columns: Name A descriptive name for the text. Pages The number of pages the text consists of. Each page represents one recording. Recognition Rate Analogue to the vocabulary; represents how likely Simon will recognize the words (higher is better). The recognition rate of the training text is the average recognition rate of all the words in the text.

To improve the acoustic model - and thus the recognition rate - you have to record training texts. This means that Simon gets essentially two needed parts: Samples of your speech Transcriptions of those samples The active dictionary is used to transcribe the words (mapping them from the actual word to its phonetic transcription) that make up the text so every word contained in the training text you want to read (train) has to be contained in your active dictionary. Simon will warn you if this is not the case and provide you with the possibility to add all the missing words in one go.

75

The Simon Handbook

The procedure is the same as if you would add a single word but the wizard will prompt you for details and recordings for all the missing words automatically. This procedure can be aborted at any time and Simon will provide both a way to add the already completely dened words and to undo all changes done so far. When the user has added all the words he is prompted for (all the words missing) the changes to the active dictionary / vocabulary are saved and the training of the previously selected text starts automatically. The training (reading) of the training text works exactly the same as the initial training when adding a new word. Make sure you follow the guidelines listed in the recording section.

4.6.1

Storage Directories

Training texts are stored in two different locations: Linux: ~/.kde/share/apps/simon/texts Windows: %appdata%\.kde\share\apps\simon\texts The texts of the current user. Can be deleted and added with Simon (see below). Linux: kde4-config --prefix/share/apps/simon/texts Windows: (install folder)\share\apps\simon\texts 76

The Simon Handbook

System-wide texts. They will appear on every user account using Simon on this machine and cannot be deleted from within Simon because of the obvious permission restrictions on system-wide les. This folder can be used by system administrators to provide a common set of training texts for all the users on one system. The XML les (one for each text) can just be moved from one location to the other but this will most likely require admin privileges.

4.6.2

Adding Texts

The add texts wizard provides a simple way to add new training texts to Simon. When importing text les, Simon will automatically try to recognize individual sentences and split the text into appropriate pages (recordings). The algorithm treats text between normal punctuation (., !, ?, ..., ,...) and line breaks as sentences. Each sentence will be on its own page. Simon supports two different sources for new training texts.

77

The Simon Handbook

4.6.2.1

Add training texts

Simply enter the training text in an input eld. 4.6.2.2 Local text les

Simon can import normal text les to use them as training texts.

78

The Simon Handbook

4.6.3

On-The-Fly Training

In addition to training texts, Simon also allows to train individual words or word combinations from your dictionary on-the-y. This feature is located in the vocabulary menu of Simon.

Select the words to train from the vocabulary on the left and simply drag them to the selection list to the right (you could also select them in the table on the left and add them by clicking Add to Training). Start the training by selecting Train selected words. The training itself is exactly the same as if it were a pre-composed training text.

79

The Simon Handbook

If there are more than 9 words to train Simon will automatically split the text evenly across multiple pages. Of course you are free to add words from the shadow lexicon to the list of words to train but Simon will prompt you to add the words before the training starts just like he would if you would train a text that contains unknown words (see above).

4.7

Context

Simon includes a context layer that allows you to let Simon automatically adjust its conguration depending on its context. For example, you could set up Simon to only allow commands like New tab if Mozilla Firefox is running and the currently active window. There are three major areas that contextual information can inuence: Scenario selection Sample groups Active microphones

4.7.1

Scenario selection

Scenarios can specify to only be active during certain contextual situations. If these situations are not met, Simon will temporarily deactivate the affected scenario.

80

The Simon Handbook

The local context conditions of this scenario are shown in the list of Activation Requirements and can be added, edited and deleted through the respective buttons. The context conditions respect a possible hierarchy of scenarios: The activation requirements of all direct or indirect parent scenarios also apply to the child scenario(s). This condition inheritance is shown on the right side. The Simon main window also shows a list of currently used scenarios. Scenarios that are deactivated because of their activation requirements (context conditions) are listed in light gray and italic. The screenshot below, for example, shows a temporarily deactivated Amarok scenario.

The same visual hints (gray, italic font for unmet activation criteria) also apply to the individual context conditions in the context menu. 81

The Simon Handbook

4.7.2

Sample groups

Every sample recorded with Simon is assigned a sample group. Sample groups can be congured to only be used for the building of the acoustic models if certain contextual conditions are met. If this is not the case, all samples tagged with the deactivated sample group will be temporarily removed from the training corpus. For more information, an example use-case and instructions on how to work with sample groups, please refer to the section on sample groups.

4.7.3

Context conditions

In Simon, context is monitored through a set of context condition plugins. In general, context conditions are combined through an and association. For example, if the activation of resource is bound by two conditions A and B, it will only be activated if both A and B see their conditions met. To instead model alternatives (A or B or both), use an Or Condition Association. All conditions can optionally be inverted. Inverting a condition means that it will evaluate to true if it would otherwise evaluate to false and vice versa. 4.7.3.1 Active window

True, if the title of the currently active foreground window matches the provided window title.

4.7.3.2

D-Bus

The D-Bus condition plugin allows to monitor 3rd party applications that export state information on D-Bus. The monitored application needs to provide two methods: One signal to notify of changes and another method that returns the current state.

82

The Simon Handbook

The screenshot above, for example, congures a D-Bus condition that will evaluate to true while the music player Tomahawk is playing and to false otherwise. 4.7.3.3 Face detection

The face detection condition will evaluate to true, if Simons vision layer has identied a person sitting in front of the congured webcam.

83

The Simon Handbook

4.7.3.4

File content

This condition plugin will return true, if the given le contains the provided content. The le will be monitored for changes.

4.7.3.5

Lip detection

The lip detection condition will evaluate to true, if Simons vision layer has identied a person sitting in front of the congured webcam and is speaking something (lip movements).

84

The Simon Handbook

The lip detection training will try to determine the optimal value of sensitivity of the detection by monitoring your lip movements. For better accuracy of lip detection condition, stop training when the sensitivity value on the slider during training becomes almost constant. 4.7.3.6 Or condition association

The or condition association allows you to congure a meta-condition that reports to be satised as soon as one or more of its child conditions evaluates to true. Or condition associations can have an arbitrary number of child conditions that may even also be or condition associations.

4.7.3.7

Process opened

Is satised if there is a running process with the provided executable name.

85

The Simon Handbook

4.8

Commands

When Simon is active and recognizes something, the recognition result is given to the loaded command plug-ins (in order) for processing.

The command system can be compared with a group of factory workers. Each one of them knows how to perform one task (e.g. Karl knows how to start a program and Joe knows how to open a folder, etc.). Whenever Simon recognizes something it is given to Karl who then checks if this instruction is meant for him. If he doesnt know what to do with it, it is handed over to Joe and 86

The Simon Handbook

so on. If none of the loaded plugins know how to process the input it is ignored. The order in which the recognition result is given to the individual commands (people) is congurable in the command options (Commands > Manage plugins).

Each plugin can be associated with a trigger. Using triggers, the responsibility of each plugin can be easily be divided. Using the factory workers abstraction from above it could be compared to stating the name of who you mean to process your request. So instead of Open my home folder you say Joe, open my home folder and Joe (the plugin responsible for opening folders) will instantly know that the request is meant for him. In practice you could have commands like the executable command Firefox to open the popular browser and the place command Google to open the web search engine. If you assign the trigger Start to the executable plugin and the trigger Open to the place command you would have to say Start Firefox (instead of just Firefox if you dont use a trigger for the executable plugin) and Open Google to open the search engine (instead of just Google). Triggers are of course no requirement and you can easily use Simon without dening any plugin triggers (although many plugins come with a default trigger of Computer set which you would have to remove). But even if you use just one trigger for all your commands (like Computer to say Computer, Firefox and Computer, Google like) it has the advantage of greatly limiting the number of false-positives. Simons command dialog displays the complete phrase associated with a command in the upper right corner of the command conguration. You can load multiple instances of one plugin even in one scenario. Each instance can of course also have a different plugin trigger. Each Command has a name (which will trigger its invocation), an icon and more elds depending on the type of the plugin (see below). Some command plugins might provide a conguration of the plugin itself (not the commands it contains). These conguration pages will be plugged directly into the action conguration dialog (below the General menu item) when you load the associated plugin. Plugins that provide a graphical user interface (like for example the input number command plugin) can be congured by conguring Voice commands. You can, for example, change the as87

The Simon Handbook

sociated word that will trigger the button, but also change the displayed icon, etc. If you remove all voice interface commands from a graphical element, the element will be hidden automatically. Voice interface commands are added just like normal commands through the command conguration.

To add a new interface command to a function, just select the action you want to associate with a command, click Create from Action template and adapt the resulting command to your needs. Some plugins (for example the desktop grid or the calculator) might also provide a menu item in the Actions menu.

Scenarios can optionally dene one command that will immediately be run when the scenario is initialized. If you require more than one command to run automatically, consider the use of a composite command.

88

The Simon Handbook

Command triggers can contain placeholders in the form of %<index>, referring to any one word, or %%<index> describing one or more left out words. For example the recognition result Next window will be matched by the triggers Next %1, Next %%1 and %%1 but not by the triggers %1, Next window %1, %%1 Next window.

4.8.1

Executable Commands

Executable commands are associated with an executable le (Program) which is started when the command is invoked.

Arguments to the commands are supported. If either path to the executable or the parameters contain spaces they must be wrapped in quotes. 89

The Simon Handbook

Given the executable le C:\Program Files\Mozilla Firefox\firefox.exe the local html le C :\test file.html the correct line for the Executable would be: C:\Program Files\Mozilla Firefox\firefox.exe C:\test file.html. The working folder denes where the process should be launched from. Given the working folder C:\folder, the command C:\Program Files\Mozilla Firefox\firefox.exe file. html would cause Firefox to search for the le C:\folder\file.html. The working folder usually does not need to be set and can be left blank most of the time. 4.8.1.1 Importing Programs

For even easier conguration Simon provides an import dialog which allows you to select programs directly from the KDE menu.

N OTE This option is not available on Microsoft Windows.

The dialog will list all programs that have an entry in your KDE menu in their respective category. Sub-Categories are not supported and are thus listed on the same level as top-level categories. Just select the program you wish to start with Simon and press Ok. The correct values for the executable and the working folder as well as an appropriate command name and description will automatically be lled out for you.

4.8.2

Place Commands

With place commands you can allow Simon to open any given URL. Because Simon just hands the address over to the platforms URL handler, special Protocols like remote:/ (on Linux /KDE) or even KDEs Web-Shortcuts are supported. Instead of folders, les can also be set as the commands URL which will cause the le to be opened with the application which is associated with it when the command is invoked. 90

The Simon Handbook

To associate a specic URL with the command you can manually enter it in the URL eld (select Manual rst) or import it with the import place wizard. 4.8.2.1 Importing Places

The import place dialog allows you to easily create the correct URL for the command. To add a local folder, select Local Place and choose the folder or le with the le selector.

To add a remote URL (HTTP, FTP, etc.) choose Remote URL.

91

The Simon Handbook

Please note that for URLs with authentication information the password will be stored in clear text.

4.8.3

Shortcut Commands

Using shortcut commands the user can associate commands with key-combinations. The command will simulate keyboard input to trigger shortcuts like Ctrl-C or Alt-F4. The plugin can press, release or press and release the congured key combination.

92

The Simon Handbook

To select the shortcut you wish to simulate just toggle the shortcut button and press the key combination on your keyboard. Simon will capture the shortcut and associate it with the command. Due to technical limitations there are several shortcuts on Microsoft Windows that cannot be captured by Simon (this includes e.g. Ctrl-Alt-Del and Alt-F4). These special shortcuts can be selected from a list below the aforementioned shortcut button.

N OTE This selection box is not visible in the screenshot above as the list is only displayed in the Microsoft Windows version of Simon.

4.8.4

Text-Macro Commands

Using text-macro commands, the user can associate text with a command. When the command is invoked, the associated text will be written by simulating keystrokes.

4.8.5

List Commands

The list command is designed to combine multiple commands (all types of commands are supported) into one list. The user can then select the n-th entry by saying the associated number (1-9). This is very useful to limit the amount of training required and provides the possibility to keep the vocabulary to a minimum.

93

The Simon Handbook

List commands are especially useful when using commands with difcult triggers or commands that can be grouped under a general theme. A typical example would be a command Startmenu to present a list of programs to launch. That way the specic executable commands can still retain very descriptive names (like OpenOfce.org Writer 3.1) without the user having to include these words in his vocabulary and consider them in the grammar just to trigger them. Commands of different types can of course be mixed. 4.8.5.1 List Command Display

When invoked, the command will display the list centered on the screen. The list will automatically expand to accompany its items.

94

The Simon Handbook

The user can invoke the commands contained in the list by simply saying their associated number (In this example: One to launch Mozilla Firefox). While a list command is active (displayed), all input that is not directed at the list itself (other commands, etc.) will be rejected. The process can be canceled by pressing the Cancel button or by saying Cancel. If there are more than 9 items Simon will add Next and Back options to the list (Zero will be associated with Back and Nine with Next).

95

The Simon Handbook

4.8.5.2

Conguring list elements

By default the list command uses the following trigger words. To use list commands to their full potential, make sure that your language and acoustic model contains and allows for the following sentences: Zero One Two Three Four Five Six Seven Eight Nine Cancel Of course you can also congure these words in your Simon conguration: Commands > Manage plugins > General > Lists for the scenario wide list conguration. Settings > Congure Simon... > Actions > Lists for the global conguration. When creating a new scenario, the scenario conguration will be initialized with a copy of this list conguration. List commands are internally also used by other plugins like for example the desktop grid. The conguration of the triggers also affects their displayed lists.

4.8.6

Composite Commands

Composite commands allow the user to group multiple commands into a sequence. When invoked the commands will be executed in order. Delays between commands can be inserted. Composite commands can also work as transparent wrappers by selecting Pass recognition result through to other commands. In that case, the recognition result will be treated as unprocessed even if the composite command was executed. For example, suppose you have a command to turn on the light in one scenario. Additionally to turning on the light, you now want to add some kind of reporting to the activity by invoking a script through a program plugin. You could then set up a reporting scenario that contains a transparent composite command with the same trigger as the command to turn on the light and make sure that this scenario is set before the original one in the scenario list. You can then activate and deactivate the reporting simply by loading and unloading this scenario.

96

The Simon Handbook

Using the composite command the user can compose complex macros. The screenshot above for example - does the following: Start Kopete (Executable Command) Wait 2000ms for Kopete do be started Type Mathias (Text-Macro Command) which will select Mathias in my contact list Press Enter (Shortcut Command) Wait 1000ms for the chat window to appear Write Hi! (Text-Macro Command); the text associated to this command contains a newline at the end so that the message will be send. Press Alt-F4 (Shortcut Command) to close the chat window Press Alt-F4 (Shortcut Command) to close the kopete main window

4.8.7

Desktop grid

The desktop grid allows the user to control his mouse with his voice.

97

The Simon Handbook

The desktop grid divides the screen into nine parts which are numbered from 1-9. Saying one of these numbers will again divide the selected eld into 9 elds again numbered from 1-9, etc. This is repeated 3 times. After the fourth time the desktop grid will be closed and Simon will click in the middle of the selected area. The exact click action is congurable but defaults to asking the user. Therefore you will be presented with a list of possible click modes. When selecting Drag and Drop, the desktop grid will be displayed again to select the drop point.

While the desktop grid is active (displayed), all input that is not directed at the desktop grid itself (other commands, etc.) will be rejected. Say Cancel at any time to abort the process. The desktop grid plugin registers a conguration screen right in the command conguration when it is loaded.

98

The Simon Handbook

The trigger that invokes the desktop grid is of course completely congurable. Moreover the user can use real or fake transparency. If your graphical environment allows for compositing effects (desktop effects) then you can safely use real transparency which will make the desktop grid transparent. If your platform does not support compositing Simon will simulate transparency by taking a screenshot of the screen before displaying the desktop grid and display that picture behind the desktop grid. If the desktop grid is congured to use real transparency and the system does not support compositing it will display a solid gray background. However, nearly all up-to-date systems will support compositing (real transparency). This includes: Microsoft Windows 2000 or higher (XP, Vista, 7) GNU/Linux using a composite manager like Compiz, KWin4, xcompmgr, etc. By default the desktop grid uses numbers to select the individual elds. To use the desktop grid, make sure that your language and acoustic model contains and allows for the following sentences: One Two Three Four Five Six Seven Eight Nine Cancel 99

The Simon Handbook

To congure these triggers, just congure the commands associated with the plugin.

4.8.8

Input Number

Using the input-number plugin the user can input large numbers easily. Using the Dictation or the Text-Macro plugin one could associate the numbers with their digits and use that as input method. However, to input larger numbers there are two ways that both have signicant disadvantages: Adding the words eleven, twelve, etc. While this seems like the most elegant solution as it would enable the user to say vehundredseventytwo we can easily see that it would be quite a problem to add all these words - let alone train them. What about twothousandninehundredtwo? Where to stop? Spell out the number using the individual digits While this is not as elegant as stating the complete number it is much more practical. However, many applications (like the great mouseless browsing refox addon) rely on the user to input large numbers without too much time passing between the individual keystrokes (mouseless browsing for example will wait exactly 500ms per default before it considers the input of the number complete). So if you want to enter 52 you would rst say Five (pause) Two. Because of the needed pause, the application (like the mouseless browsing plugin) would consider the input of Five complete. The input number plugin - when triggered - presents a calculator-like interface for inputting a number. The input can be corrected by saying Back. It features a decimal point accessible by saying Comma. When saying Ok the number will be typed out. As all the voice-input and the correction is handled by the plugin itself the application that nally receive the input will only get couple of milliseconds between the individual digits.

100

The Simon Handbook

While the input number plugin is active (the user currently inputs a number), all input that is not directed at the input number plugin (other commands, etc.) will be rejected. Say Cancel at any time to abort the process. As there can no command instances be created of this plugin it is not listed in the New Command dialog. However, the input number plugin registers a conguration screen right in the command conguration when it is loaded.

The trigger denes what word or phrase that will trigger the display of the interface. By default the input number plugin uses numbers to select the individual digits and a couple of control words. To use the input number plugin, make sure that your language and acoustic model contains and allows for the following sentences: Zero One

101

The Simon Handbook

Two Three Four Five Six Seven Eight Nine Back Comma Ok Cancel To congure these triggers, just congure the commands associated with the plugin.

4.8.9

Dictation

The dictation plugin writes the recognition result it gets using simulated keystrokes. Assuming you didnt dene a trigger for the dictation plugin it will accept all recognition results and just write them out. The written input will be considered as processed input and thus not be relayed to other plugins. This means that if you loaded the dictation plugin and dened no trigger for it, all plugins below it in the Selected Plug-Ins list in the command conguration will never receive any input. As there can no command instances be created of this plugin it is not listed in the New Command dialog. The dictation plugin can be congured to append texts after recognition results to for example add a space after each recognized word. 102

The Simon Handbook

4.8.10

Articial Intelligence

The Articial Intelligence is a just-for-fun plugin that emulates a human conversation. Using the text to speech system, the computer can talk with the user. The plugin uses AIMLs for the actual intelligence. Most AIML sets should be supported. The popular A. L. I. C. E. bot and a German version work and are shipped with the plugin.

The plugin registers a conguration screen in the command conguration menu where you can choose which AIML set to load. 103

The Simon Handbook

Simon will look for AIML sets in the following folder: GNU/Linux: kde4-config --prefix/share/apps/ai/aimls/ Microsoft Windows: [installation folder (C:\Program Files\simon 0.2\ by default)]\ share\apps\ai\aimls\ To add a new set just create a new folder with a descriptive name and copy the .aiml les into it. To adjust your bots personality have a look at the bot.xml and vars.xml les in the following folder: GNU/Linux: kde4-config --prefix/share/apps/ai/util/ Microsoft Windows: [installation folder (C:\Program Files\simon 0.2\ by default)]\ share\apps\ai\util\ As there can no command instances be created of this plugin it is not listed in the New Command dialog. It is recommended to not use any trigger for this plugin to provide a more natural feel for the conversation.

4.8.11

Calculator

The calculator plugin is a simple, voice controlled calculator.

The calculator extends the Input Number plugin by providing additional features. When loading the plugin, a conguration screen is added to the plugin conguration.

104

The Simon Handbook

There you can also congure the control mode of the calculator. Setting the mode to something else than Full calculator will hide options from the displayed widget.

However, the hidden controls will, in contrast to simply removing all associated command from the functions, still react to the congured voice commands. When selecting Ok, the calculator will by default ask you what to do with the generated result. You can for example output the calculation, the result, both, etc. Besides always selecting this from the displayed list after selecting the Ok button, this can also be set in the conguration options.

105

The Simon Handbook

4.8.12

Filter

Using the lter plugin, you can intercept recognition results from being passed on to further command plugins. Using this plugin you can for example disable the recognition by voice. The lter command plugin registers a conguration screen in the command conguration where you can change what results should be ltered.

The pattern is a regular expression that will be evaluated each time a recognition results receives the plugin for processing. 106

The Simon Handbook

The plugin also registers voice interface commands for activating and deactivating the lter. In total, the lter therefore has three states: Inactive The default state. All recognition results will be passed through. Half-active (if Two stage activation is selected) If the next command is the Deactivate lter command, the lter will enter the Inactive state. If, however, the next result is something else and Relay results in stage one of two stage activation is selected, this result will be passed on to other plugins. The lter will reset to Active afterwards. Active When activated, the lter will eat all results that match the congured pattern. By default this means every result that Simon recognizes will be accepted by the lter and therefore not relayed to any of the plugins following the lter plugin. If Two stage activation is enabled and the lter plugin receives the command to directly enter the Inactive state, this command is ignored. In other ways: If two stage activation is enabled, the lter can only be disabled by going through the intermediate stage.

4.8.13

Pronunciation Training

The pronunciation training, when combined with a good static base model, can be a powerful tool to improve your pronunciation of a new language.

Essentially, the plugin will prompt you to say specic words. The recognition will then recognize your pronunciation of the word and compare it to your speech model which should be a base model of native speakers for this to work correctly. Then Simon will display the recognition rate (how similar your version was to the stored base model). The closer to the native speaker, the higher the score.

107

The Simon Handbook

The plugin adds an entry to your Commands menu to launch the pronunciation training dialog. The training itself consists of multiple pages. Each page contains one word fetched from your active vocabulary. They are identied by a category which needs to be selected in the command conguration before starting the training.

4.8.14

Keyboard

The keyboard plugin displays a virtual, voice controlled keyboard.

The keyboard consists of multiple tabs, each possibly containing many keys. The entirety of tabs and keys are collected in sets. You can select sets in the conguration but also create new ones from scratch in the keyboard command conguration.

108

The Simon Handbook

Keys are usually mapped to single characters but can also hold long texts and even shortcuts. Because of this, keyboard sets can contain special keys like a select all key or a Password key (typing your password). Next to the tabs that hold the keys of your set, the keyboard may also show special keys like Ctrl, Shift, etc. Those keys are provided as voice interface commands and are displayed regardless of what tab of the set is currently active. As with all voice triggers, removing the associated command, hides the buttons as well. Moreover, the keyboard provides a numpad that can be shown by selecting the appropriate option in the keyboard conguration.

Next to the number keys and the delete key for the number input eld (Number backspace), the numpad provides two options on what to do with the entered number. When selecting Write number, the entered number will be written out using simulated key 109

The Simon Handbook

presses. Selecting Select number tries to nd a key or tab in the currently active set that has this number as a trigger. This way you can control a complete keyboard just using numbers.

The keys on the num pad are congurable voice interface commands.

4.8.15

Dialog

The dialog plugin enables users to engage in a scripted dialog with Simon. 4.8.15.1 Dialog design

Simon treats dialogs as a succession of different states. Each state can have a text and several associated options.

110

The Simon Handbook

Dialogs can have more than one text variants - one of which will be randomly picked when the dialog is displayed. This can help to make dialogs feel more natural by providing several, alternative formulations. The texts can use bound values and template options.

Dialog options capsule the logic of the conversation. They are the active components of the dialog.

Similar to commands, dialog options have a name (trigger) that, when recognized while the dialog is active and in the options parent state, will cause this option to activate. Alternatively, options can also be congured to trigger automatically after a set time period. This time is relative to when the state is entered. 111

The Simon Handbook

Dialog options, when shown through the graphical output module can show an arbitrary text (that will most likely be equivalent to the trigger but doesnt have to be) and, optionally, an icon. If the text-to-speech output module is used, the text (not the trigger) will be read aloud unless this is disabled by selecting the Silent option. Every state can also optionally have an avatar that will be displayed when using the graphical output module.

4.8.15.2

Dialog: Bound values

The text of dialog states can contain variables - so called bound values - that will be lled in during runtime. For example, the dialog text This is a $variable$ would replace $variable$ with the result of a bound value called variable.

112

The Simon Handbook

There are four types of bound values: Static

Static bound values will always be resolved to the same text. They are useful to provide conguration options to be lled in to personalize the dialog (e.g., the name of the user). QtScript

113

The Simon Handbook

QtScript bound values resolve to the result of the entered QtScript code. Command arguments

If the dialog trigger command (the Simon command that initiates the dialog) uses placeholders, they can be accessed through command argument bound values. The Argument number refers to the index of the placeholder you want to access. For example, if your dialog is started with the command Call %1, and name is a command argument bound value, then launching the dialog by recognizing Call Peter, will turn the dialog text Are you sure you want to call $name$? into Are you sure you want to call Peter?.

114

The Simon Handbook

Plasma data engine

This type of bound value can readily access a wide array of high-level information through plasma data engines. 4.8.15.3 Template options

Dialog texts can further be parametrized through template options.

These boolean values choose between different or optional text snippets.

115

The Simon Handbook

For example, the template option formal above, would change the dialog text Would you please {{{formal}}be quiet{{elseformal}}shut up{{endformal}} to Would you please be quit or Would you please shut up depending on if the template option is set to true or false. The else-path can be omitted if it is not required (e.g. Would you {{formal}}please {{endformal}}be quiet). 4.8.15.4 Avatars

Every state can potentially show a different avatar. These images can range from the picture of a (simulated) speaker to an image of something topically appropriate.

To use an avatar, rst add it here and later dene where to use it in the dialog design section. 4.8.15.5 Output

Dialogs can be displayed graphically, use text-to-speech or combine both approaches.

116

The Simon Handbook

The Separator to options will be spoken between the dialog text and the current states options (if there are any). If there are no options to this state or all are congured to be silent, this will not be said. The option to listen to the whole announcement again is triggered when saying one of the congured Repeat on trigger. Additionally, the text-to-speech output can optionally be congured to repeat the listing of the available options (including the congured separator) when the user says a command that does not match any of the available dialog options.

4.8.16

Akonadi

The Akonadi plugin allows Simon to plug into KDEs PIM infrastructure.

117

The Simon Handbook

The plugin fullls two major purposes: Execute Simon commands at scheduled times The Akonadi plugin can monitor a specic collection (calendar) and react on entries whose summary start with a specic prex. Per default, this prex is [simon-command], meaning that events of the form [simon-command] <plugin name>//<command name> will trigger the appropriate Simon command at the start time of the event. The name of the plugins and commands are equivalent to the ones shown in the command dialog and do not necessarily need to reference commands in the same scenario as the Akonadi plugin instance. Show reminders for events in the given calendar If congured to do so, the Akonadi plugin can show reminders for calendar events with a set alarm ag. These reminders will be shown through the Simon dialog engine.

4.8.17

D-Bus

With the D-Bus command, Simon can call exported methods in 3rd party applications directly. The screenshot below, for example, calls the Pause method of the MPRIS interface of the Tomahawk music playing software.

4.8.18

JSON

Similar to the D-Bus command plugin, the JSON plugin also allows to contact 3rd party applications to directly invoke functionality (instead of simulating user activity).

118

The Simon Handbook

119

The Simon Handbook

Chapter 5

Questions and Answers


In an effort to keep this section always up-to-date it is available at our online wiki.

120

The Simon Handbook

Chapter 6

Credits and License


Simon Program copyright 2006-2009 Peter Grasch peter.grasch@bedahr.org, Phillip Goriup, Tschernegg Susanne, Bettina Sturmann, Martin Gigerl Documentation Copyright (c) 2009 Peter Grasch peter.grasch@bedahr.org This documentation is licensed under the terms of the GNU Free Documentation License. This program is licensed under the terms of the GNU General Public License.

121

The Simon Handbook

Appendix A

Installation
Please see our wiki for install instructions.

122