Sunteți pe pagina 1din 73

Czech Technical University in Prague

Faculty of Electrical Engineering


Department of Computer Science and Engineering
Masters Thesis
An AIML Interpreter
Kim Sullivan
Supervisor: Ing. Miroslav Balk, Ph.D.
Study Programme: Electrical Engineering and Information Technology
Field of Study: Computer Science and Engineering
May 21, 2009
iv
v
Aknowledgements
I would like to thank my supervisor, Ing. Balk for his patience and for providing me with
advice and important materials to which I would otherwise not have had access, Dr. Richard
S. Wallace for creating AIML and all the people on the alicebot mailing lists that have helped
me nding the direction I wanted to pursue.
And last but not least, I would like to thank my mother whom I always can depend on.
Without her neverending support, I would never have managed to get so far.
vi
vii
Declaration
I declare that I have created this thesis on my own, and that I have listed all the literature
and publications used.
I hereby grant the Czech Technical University in Prague the right to utilize this school work
as described in Article 60 of Act No. 121/2000 on Copyright, Rights Related to Copyright
and on the Amendment of Certain Laws (Copyright Act).
In Prague, May 21, 2009 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
viii
Abstract
This work presents a new AIML interpreter written in the Java programming language. The
AIML language is used to write programs for conversation between humans and computers
using natural language. One of the main limiting factors for AIML based bots is the lack
of context they can handle, because the pattern matching algorithm described in the AIML
specication doesnt scale well with an increasing amount of contexts. Using techniques
from automata theory a more compact and scalable character-based implementation of the
algorithm is described, implemented and compared to a word-based approach.
Abstrakt
Prace prezentuje nov y interpret jazyka AIML napsaneho v programovacm jazyce Java.
Jazyk AIML se pouzva pro psan konverzacnch programu mezi clovekem a poctacem v
prirozenem jazyce. Jednim z hlavnch omezujcch faktoru techto programu je omezene
mnozstv kontextu, ktere jsou schopny zpracovat, nebo

t algoritmus pro klasikaci vzoru


popsan y ve specikaci jazyka AIML se vzrustajcm mnozstvm kontextu spatne skaluje.
Pouzitm postupu z teorie automatu byla popsana kompaktn a skalovatelna verze klasikacnho
algoritmu zalozena na porovnavan znaku, ktera byla nasledne srovnana prstupem zalozen ym
na srovnavan slov.
ix
x
Contents
List of Figures xiii
List of Tables xv
1 Introduction 1
1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 The AIML Language 3
2.1 Overview of the language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 The runtime environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Language features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 Variables and constants . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.2 Subroutines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.2.1 Subroutine parameters . . . . . . . . . . . . . . . . . . . . . 5
2.3.3 Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.4 Other language features . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 The current specication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 Limitations of the specication . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1.1 Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Current implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.1 Chatterbean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.2 J-Alice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.3 Pandorabots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.4 Program D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.5 Program E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.6 Program N/AIMLPad . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.7 Program O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.8 Program P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.9 RebeccaAIML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.10 Other implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
xi
xii CONTENTS
3 The classication algorithm 13
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Finite state automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 A set-based description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Generalization and optimization . . . . . . . . . . . . . . . . . . . . . 25
4 Implementation 31
4.1 Data ow overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Loading the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Creating the trie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Compact pattern node types . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.2 Naive pattern nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4.1 Compact pattern node matching . . . . . . . . . . . . . . . . . . . . . 36
4.4.2 Naive pattern node matching . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Testing 41
5.1 Shadowed categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Conclusion 47
6.1 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.1.1 Pattern to pattern matching . . . . . . . . . . . . . . . . . . . . . . . 48
6.1.2 Visualisation of AIML sets . . . . . . . . . . . . . . . . . . . . . . . . 48
Bibliography 49
A Category markup language syntax 53
B Template markup language syntax 55
C A list of abbreviations 57
D Contents of the CD 59
List of Figures
3.1 The SFOEWO to match a single pattern, p
1
p
2
p
3
p
4
. . . . . . . . . . . . . . 17
3.2 The SFFEWO automaton created to match a set of patterns from example
3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 The non-deterministic mealy automaton created to match the set of patterns
from example 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 An example of a transducer for a single pattern containing multiple wildcards 22
3.5 Example of a NFA that uses whole patterns as symbols . . . . . . . . . . . . 25
3.6 A NFA that matches sequences of patterns from Ex. 3.3.3 . . . . . . . . . . . 28
3.7 A compact NFA that matches sequences of patterns from Ex. 3.3.4 . . . . . . 29
xiii
xiv LIST OF FIGURES
List of Tables
5.1 Comparison of trees created from the AAA set . . . . . . . . . . . . . . . . . 43
5.2 Amount of overhead (in B) of wrapper level per node . . . . . . . . . . . . . . 43
5.3 Memory requirements depending on the used map (in MiB) . . . . . . . . . . 44
5.4 Properties of pattern sets used for benchmarking . . . . . . . . . . . . . . . . 44
5.5 Matching speed (in seconds) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
xv
xvi LIST OF TABLES
Chapter 1
Introduction
AIML is a markup language created by Dr. Richard Wallace in the mid 90s. Originally
an attempt at producing articial intelligence, it is nowadays used for creating chat bots
programs that are able to hold a conversation in natural language. One of the most famous
AI test [Tur50] relies on computers being able to have a chat, and todays chat bots are
almost up to the task [loe]. Whether or not this really is a true measure of intelligence or
not is left at the readers discretion.
The approach AIML takes is a lot like Weizenbaums ELIZA [Wei66]. Pieces of information
are stored in so-called categories. Each category consists of a response template, and a set
of conditions that give meaning to the template thes are called contexts. Each context
species the circumstances when a category will be activated in response to an input.
But instead of simple keywords like the original ELIZA, AIML uses a simple pattern
language, consisting of words and two wildcard characters. At rst, the only context was
tied to the user input, and was simply called the pattern. Gradually, the need for more
contexts became apparent, so a topic was added and a history sensitive context, tied to the
last sentence of the bots most current response (called that
1
) All of these were hardcoded
both into implementations, as well as the specication [aim05].
1.1 Terminology
AIML interpreter A piece of software, that can read an AIML set, match a user input to
a category, then it processes the contents of a category template and returns the input
to the user.
AIML set An AIML set is a set of AIML categories.
best match The best match is determined from all the matching categories by the classication
algorithm.
bot A live instance of an AIML set inside an interpreter. A user can converse with a bot.
1
The rationale for this name was the following:
Bot: Hello, nice to meet you. The weather sure is nice!
User: That is true.
1
2 CHAPTER 1. INTRODUCTION
category A response template together with a set of contexts.
context
context order To remain deterministic, a strict order of evaluation is enforced. The context
order inuences which category will match best.
match A single context matches if its pattern (or condition) matches the current value of
the context. A category matches, if all of its contexts match.
pattern This can either mean any string with wildcards used to specify a context, or
specically the context tied to the users input. To avoid ambiguity, only the rst
meaning will be used in this work.
reachable A category is reachable, if it is a best match for at least one combination of
context values.
shadowed A category that is unreachable (that is, there exists no combination of context
values, such that this category would be a best match).
srai A recursive call to the interpreter.
topic The context bound to the topic variable.
that The context bound to the last sentence of the most recent reply of the bot.
1.2 Notation
The AIML syntax is (due to its nature, based on XML) a bit unwieldy. For this reason, a
more simple notation of category context will be used in this thesis.
A single context will be written as:
[context name]context pattern
A category with more contexts will be annotated as follows:
[input]HELLO WORLD[that][topic]
[a]1[b]2[c]3
Patterns will be written with ALL UPPERCASE LETTERS, whereas values will use Mixed case.
1.3 References
Many denitions in this thesis are based on denitions from literature. Denitions 3.1.13.1.3
are taken from [MHP05] but adapted for pattern matching instead of searching. Denitions
3.2.13.2.4 are taken verbatim.
Denitions 3.1.43.1.6 and 3.3.1 are my own denitions. Denition 3.2.5 (Lexicographical
order) is my own, but was written with the intent of conforming to lexicographical ordering
as commonly understood.
Algorithm 3.2 has been adapted from [MHP05]. Algorithm 3.1 is my own, based on
the SFOEDO algorithm. Algorithms 4.1 through 4.3 are my own (even if implementing
behaviour from [aim05]).
Chapter 2
The AIML Language
In this section, an overview of the AIML language will be given.
2.1 Overview of the language
While AIML generally uses its own terminology, it can easily be described in terms of
conventional programming languages. In this section, concepts like AIML predicates,
srai and stars will be explained.
Also, the AIML runtime environment will be described.
2.2 The runtime environment
The purpose of an AIML interpreter is to load an AIML program (called a bot) into
memory, and provide an interface for user interaction with the bot. Because a single AIML
interpreter might support running several bots concurrently (and handle multiple users both
concurrently and over an extended period of time), the runtime environment can be divided
into several levels (or scopes), each being a subset of the previous:
the interpreter scope,
the program (bot) scope,
the user scope,
the session scope, and
the category scope.
All code is run at the category scope, and has access to information from higher scopes.
3
4 CHAPTER 2. THE AIML LANGUAGE
2.3 Language features
2.3.1 Variables and constants
The AIML language specication denes two types of named values variables and constants.
The language provides only a single (implicit) type a string of characters. Variables and
constants dier signicantly, both in scope and usability.
Constants are dened at bot level they are also called bot properties for this reason.
They are read once upon startup of a bot before any categories are read and may not be
modied afterwards. For this reason, bot properties may be used in patterns (because they
can be resolved at load time) and provide a simple way of parametrizing a bot (for example,
a bots name or his favorite food).
Variables exist in the user scope from a programmers standpoint (who works at
category level), they are global variables. For historical reasons, they are called predicates
in the specication. They may be persistent across sessions (and many implementations
support persistent storage of variables for each user). Variables do not need to be declared
(they can be used without prior declaration), but the exact behavior is implementation
dependent (the specication describes several mechanisms and places additional restrictions
on variables, which will be further discussed in 2.4). Variables cant be used in patterns, but
otherwise can be used everywhere an AIML element is allowed. Additionally, variables can
be used in conditions.
2.3.2 Subroutines
AIML is basically a procedural language, that borrows some concepts from object oriented
languages. The fundamental building blocks of AIML are called categories. Each category
represents a single executable piece of code, called template.
Unlike traditional procedural languages, however, AIML employs an extremely late
binding approach to subroutine calls (called srais
1
). Here, binding refers to the process of
resolving the names associated with each subroutine call.
Traditional procedural languages like C or Pascal utilize early binding (also sometimes
called static binding). Each subroutine is identied by its name (and in some languages, its
parameters). The subroutine to be called is determined statically at compile or link time.
Object oriented languages (especially pure object oriented languages like Smalltalk)
employ a dierent approach. Instead of calling subroutines directly, objects send each
other messages, and the appropriate subroutine is determined at run-time (so called late
binding). This allows dierent objects to respond to identical messages dierently (this is
called polymorphism).
2
In AIML, there are no objects, so there are no concrete recievers
to send messages to. Instead, all message sends are actually message broadcasts. The late
binding algorithm in the interpreter (called pattern matcher or classier) searches through
all known categories to nd the code to execute. A pattern matching approach to binding
1
The exact origin of this name is a bit unclear, however, it can be reasonably assumed that it means
something like Apply symbolic reduction using articial intelligence
2
Many object oriented languages employ a hybrid between early and late binding, or support static methods
that are bound at compile time.
2.3. LANGUAGE FEATURES 5
and dispatch is not unique to AIML however. Pattern matching dispatch is at the core
of pure functional and programming languages, and is one of the reasons these languages
are declarative. The Portland Pattern Repository [c207] provides an overview of dierent
ways pattern matching as a dispatch mechanism can be implemented. The earliest known
functional programming language implementing pattern dispatch is HOPE[BMS80], but the
foundations of pattern based dispatch are much older according to [MM04], the rst
implementation that allowed matching structures was a modied version of LISP [McB70].
Today, pattern matching is widely accepted and implemented in languages such as
Haskell[hkl], Erlang[erl] and of course Prolog[Hod99]. But even more recent languages like
Scala[Ode09] and F#[fsh] have discovered the power and expressivenes of patterns, even if
only as language constructs and not as a dispatch mechanism. A form of pattern matching
is also used to resolve knowledge frames in early expert systems like PROSPECTOR[MZ97].
Another dierence of AIML is the fact that there are no simple category identiers
(subroutine names, method selectors or prolog predicates). Categories are identied by
their context, which basically describes (using a simple pattern language) which messages a
category understands, and the structure of these messages. In this sense, the line between a
subroutine identier and its parameters is blurred.
2.3.2.1 Subroutine parameters
During pattern matching, parts of the input can be bound to wildcards in patterns. After the
best match is selected, the bound words and characters are made accessible to the template
using a special star element (named after the most common wildcard).
Bound wildcards are read-only, and local to the category in which they have been bound.
An important part of AIML is recursion support sucessive srai calls do not overwrite the
locally bound wildcards the local context is saved on a stack.
Because subroutine parameters are bound to wildcards in patterns, they are traded as
positional (unnamed) parameters of each pattern. Also, each pattern has its own set of
wildcards. To access the value bound to a wildcard, both the context and the wildcard
number have to be provided.
The specication doesnt dene a exible access scheme to bound wildcards. There are
three separate elements for this. The wildcards from the input context are acessed using
the star element. For the other two predened contexts, thatstar and topicstar are
provided respectively.
2.3.3 Conditions
The AIML language provides basic support for 3 conditional constructs an if statement,
an if-elseif-. . . -else statement and a switch/case statement. The AIML language does
not provide any means for arbitrary expressions in conditions or for dierent comparison
operators. Instead, values can be dened as patterns that will be matched to the value of a
variable.
6 CHAPTER 2. THE AIML LANGUAGE
2.3.4 Other language features
Being a lanugage for creating conversation programs, AIML has several special constructs
to help in that task:
random choice of alternatives,
substitution sets,
gossip the ability to store and later reproduce strings,
sentence case operations,
input normalization, and
conversation history.
The random alternative helps making the bot less deterministic. Because an alternative
can contain arbitrary pieces of AIML code (and can contain subroutine calls), the possibilities
of creating a less predictable bot are nearly limitless. The random choice construct can also
be used (with some additional programming) to shue lists of words, which may then serve
as a driving force for the conversation.
Substitution sets have been introduced to aid in the transformation of user input most
often pronouns. This allows to emulate the eect used by the ELIZA program, where user
statements are reproduced as questions (I have a problem You have a problem?).
The gossip function was intended as a simple way for a bot to learn by itself. The
bot can store strings in a global le (for example, after asking a question and suspecting
an interesting response), and later reproduce them verbatim. This approach to learning is
unfortunately rather brittle (neither the stored gossip, nor the context where it is reproduced
are guaranteed to make sense).
Sentence case capitalization aids through normalizing the case of words or sentences
from user input much like substitution sets, this helps with throwing a users words at
himself. AIML supports very simple capitalization schemas (ALL CAPS, lowercase, First
letter in a sentence, First Letter Of Each Word). This kind of rudimentary support helps in
most obvious cases, but fails for languages that use capitalization for other purposes (German
nouns), or lanugages that dont use capitalization at all (Japanese).
Input normalization is not really a feature of the language itself, but the way user inputs
are handled. Normalization applies a set of substitutions to each input before it is matched
against the patterns. These help to eliminate common misspellings (becasue because)
and eliminate characters that could confuse the matching algorithm (like angle brackets,
using the algorithm described in the specication) or dots in common abbreviations (that
aect sentence splitting).
Conversation history is rarely used in templates itself, but can often come in handy
during the actual matching process, where it serves to put a users reply in context with
the bots preceding question. For this reason, an interpreter often keeps a linear history of
last sentences in the conversation. There have been attempts to make conversation history
available to the bot in a more structured way (using multidimensional indexes), but they are
not very practical to use.
2.4. THE CURRENT SPECIFICATION 7
2.4 The current specication
The current formal specication of the AIML language [aim05] has been created mostly
during 2001 by the AIML Architecture commitee led by the creator of AIML, Dr. Richard
S. Wallace. The goal of the arch-comm as it came to be known was to standardize and
codify the AIML language, based mostly on existing implementations at that time.
The arch-comm had many highly intelligent, but sometimes opinionated members (the
author of this thesis unfortunately not having been an exception) and many very heated
discussions took place. Fortunately, before everyone grew tired of these ame-wars, Noel
Bush was able to create a draft of the specication that codied most of the behaviour of
then-current implementations which was agreed upon by the majority of comitee members.
2.4.1 Limitations of the specication
While the usefulnes and degree of specication of several features (such as multi-dimensional
indices in conversation history) might be questioned, there is one larger problem proble with
several implications. The specication does not allow an empty string as a pattern.
Because wildcards always match at least one word, this makes it eectively impossible
to match an empty string. But it was generally understood that the category that only
specied a single for its [input] context was the catch all category, that matched
everything. It was also understood that if an explicit pattern for the [that] or [topic]
wasnt dened, that the category somehow doesnt care about values of these particular
contexts.
Making the [input] context optional like the other two and dening an unspecied input
context as dont care what the value is would have been possible, but no implementation
at that time was actually able to handle such a special case. All contexts were concatenated
prior to creating the classier trie (and the input was preprocessed similarly, which will be
further elaborated upon in chapter 3) and if the pattern wasnt specied, the default was
used.
Instead of providing for a separate dont care pattern or leaving it up to the implementors
of the specication to come up with a solution for unspecied patterns, the specication went
the other direction, and simply explicitly empty strings as inputs to the matching algorithm
(a default pattern was always matched to a default input), rearming the use of the -
wildcard as the default pattern if none is explicitly specied.
This had implications on variables as well. Because conditions used patterns, and
patterns couldnt be empty, it became impossible to test if a variable has already been
set before, and a work-around had to be provided, by requiring even undened variables to
always have a value.
If an interpreter wants to claim 100% conformance with the specication, it must also
always use the default pattern (even if it really means I dont care), prohibiting optimization
of sparse pattern sequences with only a few contexts dened.
8 CHAPTER 2. THE AIML LANGUAGE
2.4.1.1 Constants
While constants may be used in most reasonable places inside a template, their use in
conditions is limited because of missing syntax. A pure XML syntax for expressions has
been deemed too complicated and verbose for a programer, while a simple domain specic
language for expressions (that could be used in XML attributes) would require implementing
another parser, thus raising the bar for implementing an interpreter.
2.4.1.2 Variables
The specication specically forbids variables containing an empty string (the reasons for
this are explained 2.4.1), and implementations are advised to use a predened default value
(e.g. the string undened).
During the evolution of AIML, there were several special variables, distinguished by
the value returned by the assignment expression. The specication permits implementations
to emulate this behaviour for compatibility with old AIML programs (the details are left to
the implementation). The value of an assignment can be the new value of a variable (this is
the default), the name of the variable (this is called return-name-when-set).
Due to its history as a language for creating articial intelligence, AIML variables are
ocially called predicates in the specication. With the exception of the special properties
of AIML variables mentioned above, they have no further semantics (like predicates in Prolog
or predicate logic). For this reason, this thesis refers to AIML variables simply as variables.
2.5 Current implementations
The following chapter provides a cursory overview of existing implementations. The main
dierences between them are the programming language used, wether the implementation
supports multiple bots and users and if the interpreter is designed like a library to be used
by other programs or the program is an application.
With the exception of Program O and Program N, all implementations try to follow
the specication and the matching algorithm described therein as closely as possible. Many
implementations also allow easy extension of the set of elements used in templates.
2.5.1 Chatterbean
Chatterbean is a Java implementation of an AIML interpreter. This interpreter is designed
to work as a JavaBeans compoment. It sacrices advanced features in order to provide a
clean and simple implementation. Notable is the reliance on automated testing (bot unit
testing using JUnit as well as acceptance testing). This interpreter can be extended in several
ways:
a custom global context class can be used, so that variables and constants can be stored
(for example) in a database,
2.5. CURRENT IMPLEMENTATIONS 9
new template elements can be added by simply placing the new classes in a package
(the interpreter loads them via reection) and
The system tag is scriptable via BeanShell this means the full expressive power of
Java is available in templates.
An interesting issue is how input preprocessing is handled. Chatterbean maintans a
mapping from the original input to the preprocessed input, this allows input substitution
and normalization to be reverted.
Homepage: http://chatterbean.bitoflife.cjb.net/
2.5.2 J-Alice
Unfortunately, not much information about this implementation has survived until this day,
and the project has been discontinued in favor of partnering with RebeccaAIML.
J-Alice is one of the few AIML interpreters that enhanced the context of categories.
It supported one additional context, called context that was placed before the input
context and could be specied the same way the topic context is specied. [Roe04]
Homepage: http://j-alice.sourceforge.net/
2.5.3 Pandorabots
A web based service where people can run their own AIML chatbot, without the need to
maintain their own server. The source codes of Pandorabots are not publicly available, but
the interpreter is written in a variant of Common Lisp. The web service runs on a linux
machine. A single service is able to host over 20,000 bots, and process between 5,00020,000
interactions per hour.[pan03]
Due to the direct involvement of Dr. Richard Wallace with Pandorabots, the service
is reasonably compliant with the specication, but it provides additional features using an
Embrace and extend approach [Wal].
Other features include:
multi-lingual bots,
japanese word splitting (using the lisp library Chasen),
text to speech and animated avatars (using Oddcast Inc.s VHost
TM
platform),
HTML, Flash or AIM interfaces, and
web based bot administration and editing.
Homepage: http://www.pandorabots.com/
10 CHAPTER 2. THE AIML LANGUAGE
2.5.4 Program D
Program D is one of the most widely used standalone AIML interpreters. It has a long
history that reaches back to the original java implementation from Richard Wallace that
was updated by Jon Baer. Since that time, Program D has several times changed homes,
maintainers (including the author of this thesis) and developers and has been extensively
re-written and re-factored. It is currently maintained by Noel Bush and the code has been
moved from CVS to launchpad at the end of 2008, with the intention to revive development
[Bus08].
Program D excels in enterprise and web-oriented features, while achieving a high degree
of conformance with the specication using an automated AIML test suite. Some notable
features are mentioned in the release notes[Bus06]:
Works as a library. The interpreter core can be used in a stand-alone manner.
AIML les can be validated using XML Schemas and Schematron.
Contains an automated AIML Test Suite that guarantees conformance with the specication.
Has the ability to run as a J2EE web application.
Supports congurable logging using Log4J.
Is able to run multible bots and multiple concurrent users (both anonymous and
authenticated)
AJAX enabled
Homepage: http://aitools.org/
2.5.5 Program E
The speciality of Program E is its implementation language. Program E is written in PHP
and utilizes MySQL this allows it to be run on common web hosts (which usually dont
support running Java applications, which is needed for Program D).
Due to the limitations of the platform, Program E has to load and preprocess all categories
in advance, and store them in the MySQL database. The context tree is also stored in the
database while storing and searching trees in a database is not as ecient as an in-memory
implementation, it is the only way to ensure persistency in a stateless environment such as
PHP.
This program is currently not maintained.
Homepage: http://sourceforge.net/projects/programe/
2.5. CURRENT IMPLEMENTATIONS 11
2.5.6 Program N/AIMLPad
Program N is in several aspects the most advanced AIML interpreter ever created. It
doesnt focus on robustness, scalability, automated tests, validation or strict conformance
to the AIML specication. Instead, its author Gary Dubuque focuses on experimentation,
integration of semantic technologies and AIML authoring.
Apart from an AIML interpreter, web and IRC server, Program N is bundled with a
Notepad-like environment to easily create AIML les. A powerful scripting language even
provides options to generate new categories automatically, and to easily program games
inside AIML.
Program N contains an embedded expert system and is able to access and use several
onthology and common knowledge systems OpenCyC, WordNet and ConceptNet.
Two ways of matching patterns are supported. One is a sequential approach. The
list of patterns stored lexicographically (with special handling of wildcard characters) and
searched sequentially. While this approach is quite slow and doesnt give exact results, it
has the advantage that the pattern matching trie does not need to be kept in memory.
The ocial trie based algorithm has also been implemented, and Program N was one
the rst (and to my best knowledge only) interpreter that supported additional congurable
pattern contexts. A list of up to 256 variables that can be used as a priority context can
be dened (these work the same way like input, that and topic). Any variable not in this
list can also be used as a context. In that case, these variables will be matched in order of
last modication time (the more recently set variable takes precedence). Each template can
also be assigned a CyC guard, which acts like yet another context, but instead of pattern
matching uses a CyC condition that must evaluate to true.
Homepage: http://www.aimlpad.com/
2.5.7 Program O
Program O is a new implementation of an AIML bot written in PHP. It is still in its
early phases of development. The most notable feature is the pattern matching algorithm.
Patterns for the three basic contexts are stored as a whole in a database table, and instead of
matching the pattern to an input, a regular expression is constructed for each input and the
database is then searched for pattern strings that match this regular expression. Choosing
the best pattern is not done using the default algorithm. Rather, each word and wildcard
from the pattern contributes to the nal score with varying weights.
Homepage: http://www.program-o.com/
2.5.8 Program P
Program P (nicknamed PascAlice) A very simple implementation of an AIML interpreter,
written in Delphi. Its main strengths lie in its simple conguration and speed, and can
therefore be easily used for testing or prototyping bots.
12 CHAPTER 2. THE AIML LANGUAGE
The main drawbacks of this implementation are the fact that it doesnt support multiple
users or bots at the same time, no server (a single user GUI on windows), no support for
server side javascript and no output processing (AIML often contains HTML markup).
Program P complies very much with the specication, both in terms of the matching
algorithm and the tag set.
This program is currently not maintained.
Homepage: http://alicebot.sweb.cz/
2.5.9 RebeccaAIML
According to its homepage:
RebeccaAIML is an enterprise cross platform open source AIML development
platform.
The core of RebeccaAIML is an interpreter written in C++ and backed by a Berkeley db.
While a direct C++ API is available, the strength of RebeccaAIML lies in its network
protocol and bindings for many dierent languages (Java, Python, C++ or C#). The
interpreter can run as a service, and dierent front-ends (and administration tools) written
in dierent languages can connect to it.
In addition to an AIML interpreter, RebeccaAIML includes an Eclipse plugin. The plugin
provides an AIML le editor with syntax highlighting and code completion, and a console
to test directly the set of loaded AIML les.
Homepage: http://rebecca-aiml.sourceforge.net/
2.5.10 Other implementations
There are several other implementations. Some of them are only of historical interest, such
as the original Program A (written in SETL) or Program B (the rst implementation in
Java, precursor to program D) or Program C (written in C++).
Others are forks of the above mentioned programs. For example, AWAlice
3
uses Program
P as a backend for an ActiveWorlds
4
enabled bot. Program P is also used as the backend
for the russian IM client Exilty
5
, AIML TPC
6
, which is a library for using AIML inside
DarkBasic or the french AIML editing evironment Chat 4D
7
. Program W
8
enhances Program
D with WordNet.
3
http://www.turtleflight.com/magine/mb.html
4
http://www.activeworlds.com/
5
http://sourceforge.net/projects/exilty-icq
6
http://www.whits-end.co.uk/deimos/tpc_aiml.html
7
http://www.toolbox.free.fr/TB/Chat4D.html
8
http://programw.sourceforge.net/
Chapter 3
The classication algorithm
3.1 Overview
The notational apparatus and terminology for describing pattern matching used in the
AIML classication algorithm is heavily based on computational linguistics, grammars and
automatons. Many of these are described in the book Text Searching Algorithms[MHP05],
with one signicant dierence. The denitions and notions provided in [MHP05] are concerned
with searching problems. For this thesis, I have adapted the denitions used therein for
pattern matching problems. Both pattern searching and pattern matching problems are
in many ways parallel, and many algorithms can be used, either without change or with
minimal changes, for both kinds of problems.
An alphabet is a nonempty nite set of symbols.
A string over an alphabet is a nite sequence of symbols. The symbol denotes an empty
string. Concatenation of two strings x and y is dened as the string x y = xy. The symbol

denotes the set of all strings over an alphabet . The symbol


+
denotes the set of all
non-empty strings over . It holds that

=
+
. The length of the string [x[ is the
number of symbols of x.
A repeated string will be written using integer exponents: a
0
= , a
1
= a,a
2
= aa,. . . ,
for a and x
0
= , x
1
= x,x
2
= xx,. . . , for x

.
A prex of a string x is any string y such that x = yz, z

. A proper prex is any


prex of x which is not equal to x.
A dont care symbol is a special symbol that is equal to any symbol.
Denition 3.1.1. (Pattern matching problems)
Given a string S = s
1
s
2
. . . s
n
and a pattern P = p
1
p
2
. . . p
n
, we can dene two simple
pattern matching problems.
Exact string matching. Verify wether pattern P is equal to the string S, that is, wether
p
i
= s
i
for all i = 1 . . . n.
Pattern matching with dont care symbols. Verify wether pattern P containing dont
care symbols is equal to the string S. This means that p
i
= s
i
p
i
= for all i = 1 . . . n.
13
14 CHAPTER 3. THE CLASSIFICATION ALGORITHM
If a pattern P matches a string S, we can write P = S.
Denition 3.1.2. (Matching a sequence of patterns) Given a string S = s
1
s
2
. . . s
n
and
a sequence of pattern strings P
1
, P
2
, . . . P
s
, the problem of matching the sequence to the
string means determining that S starts with P
1
, ends with P
s
and each occurence of P
i
is
immediately followed by an occurence of P
i+1
, 1 i s.
In other words, that the pattern P = P
1
P
2
. . . P
s
matches the string S.
Denition 3.1.3. (Matching a nite set of patterns) Given a string S = s
1
s
2
. . . s
n
and a
set of patterns = P
1
, P
2
, . . . , P
n
, the problem of matching a set of patterns is verifying,
wether there exists at least one pattern in the set that matches the string. Formally:
S = P ; P = S.
Using the classication system described in [MH97] as a base, the original pattern
matching algorithm can be described as a SFFEWF problem. This means that:
patterns are Strings,
we always match the Full pattern,
there is a Finite number of patterns
matching is Exact and doesnt employ any approximate measure,
patterns can contain Wildcards, and
each category is described by a nite Sequence of patterns.
Unfortunately, the third dimension that describes the number of patterns is a little
blurred. The 6D model proposes two options a nite amount of patters greater than
one, or an innite amount of patterns. To describe an innite number of patterns usually
means the employment of regular expressions to describe the set of patterns which is not
the case for AIML, and the set of patterns is nite. On the other hand, AIML does provide
mechanisms (via wildcards) to specify that a particular single pattern can match an innite
set of input strings.
Because of its origins as a tool for processing natural language inputs, we not only want
to know if any pattern matches, but we also want to know exactly what patterns match
and furthermore, which one of them matches best. For many string searching problems,
this distinction is not important and the 6D model doesnt dierentiate between pattern
matching problems dened by a single regular expression, and a set of regular expressions
(they can be trivially converted into a single regular expression). To achieve an eect similar
to using regular expressions, AIML enhances the dont care symbol.
AIML combines the notion of the dont care symbol with the concept of the regular
expression Kleene star and introduces wildcards special symbols that match one or
more symbols from the alphabet. Having been inspired by the rudimentary support for
wildcards in DOS, the wildcards are (a star) and (an underscore). This sometimes
leads to confusion with regular expressions (where the Kleene star signies repetition of the
3.1. OVERVIEW 15
preceding expression) and typical glob style wildcards (where the star matches zero or more
characters
1
). Wildcards create a new type of pattern matching problem, pattern matching
with wildcards.
Denition 3.1.4. (Pattern matching with AIML wildcards)
Given an alphabet , a string
S = s
1
s
2
. . . s
n
, s
i
,
and a pattern
P = w
0
p
1
w
1
p
2
w
2
. . . p
m
w
m
, p
i
, w
i
, , ,
the problem of pattern matching with wildcards means to verify that all symbols p
i
occur in
the text, and that each occurence of p
i
is either followed by an occurence of p
i+1
immedately
if w
i
= , or that there is at least one symbol between these occurences if w
i
,= . More
formally:
p
i
= s
j
p
i+1
= s
j+1
w
i
=
p
i
= s
j
p
i+1
= s
j+k+1
w
i
,=
p
1
= s
1
w
0
=
p
m
= s
n
w
m
=
1 i m, 1 j n, 0 < k
A sequence of patterns in this case means that they have to occur one after another.
Unfortunately, simple concatenation does not work if the resulting sequence is to be used as
a key for example, the two sequences P
11
= p
1
p
2
, P
12
= p
3
and P
21
= p
1
, P
22
= p
2
p
3
, are
indistinguishable, because they both produce the pattern P = p
1
p
2
p
3
.
Because of this, special boundary markers need to be inserted in front of each pattern.
Given a sequence of patterns P
1
, P
2
, . . . P
n
and a set of symbols C
i
, i = 1 . . . n, C
i
/

, we
rst construct a sequence of bounded patterns, P

i
= C
i
P
i
. These patterns can then be
unambiguously concatenated. The presence of boundary markers that are not part of the
original alphabet in the pattern sequence means that the resulting pattern sequence must
be matched either against a text that contains these boundary markers as well, or disregard
these markers during matching. AIML choses the former approach.
Denition 3.1.5. (Matching a sequence of AIML patterns)
Given a sequence of strings S
1
, S
2
. . . S
s
and a sequence of pattern strings P
1
, P
2
, . . . P
s
(that may contain wildcards) and a set of boundary markers C = C
1
, C
2
. . . C
s
matching
means determining that the pattern P = C
1
P
1
C
2
P
2
. . . C
s
P
s
matches the string S = C
1
S
1
C
2
S
2
. . . C
s
S
s
.
It is of course possible to match a nite set of sequences of AIML patterns, but special
care must be taken that all the pattern sequences remain compatible with the input sequence.
1
The idea that or . matches any le, and therefore a single wildcard should also match any input
has lead to inconsistencies in the specication that were described earlier in chapter 2.4
16 CHAPTER 3. THE CLASSIFICATION ALGORITHM
This means that all pattern sequences must contain the same number of patterns as there
are strings in the input sequence.
Until now, the problem of pattern matching has been examined as a decision problem. If
pattern matching is to be used to retrieve categories that are described by pattern sequences
(classify an input into a category), we need to know not only which patterns (or sequences)
match a given input (or input sequence), but also which pattern (sequence) describes a
particular input (sequence) best.
Denition 3.1.6. (AIML classication problem)
Given an input sequence S = S
1
, S
2
, . . . , S
s
, a nite set of pattern sequences =
P
1
, P
2
, . . . , P
n
where each pattern sequence P
i
consists of s patterns for 1 i n, the
problem of classifying the input as a single pattern sequence P
x
means nding the
pattern with the smallest value of a measure function that also matches the input. More
formally:
let

= P : P , P = S be the set of patterns that match the input sequence,


let m(S, P) be a measure function that imposes a total ordering of patterns in

,
then the input S is classied as the pattern P
x
: m(S, P
x
) = maxm(S, P) : P

.
What exactly is the measure function and how it can be computed is explained in the
following chapters.
3.2 Finite state automata
Finite state automata are a convenient formalism to describe algorithms used for pattern
matching. In this chapter, I will describe how to construct a nondeterministic nite automaton
that can be used to determine the set of all patterns that match a particular input. After
describing the way inputs are classied using a deterministic simulation of the NFA, I will
extend the transitions of the NFA in such a way that it will be possible to pick the best
matching pattern from the nal states. The initial denitions in this chapter are again taken
from [MHP05].
Denition 3.2.1. (Nondeterministic nite state automaton with -transitions)
A nondeterministic nite automaton (NFA) is a quintuple M = (Q, , , q
0
, F), where
Q is a nite set of states,
is a nite input alphabet,
is a mapping fron Q( ) into the set of subsets of Q,
q
0
Q is an initial state and
F Q is the set of nal states.
3.2. FINITE STATE AUTOMATA 17
Denition 3.2.2. (Conguration of FA)
Let M = (Q, , , q
0
, F) be a nite state automaton. A pair (q, w) Q

is a
conguration of the nite state automaton M.
Denition 3.2.3. (Transition in NFA with -transitions)
Let M = (Q, , , q
0
, F) be a nondeterministic nite automaton with -transitions.
Relation
M
(Q

) (Q

) will be called a transition in automaton M if p


(q, a), a , then (w, aw)
m
(p, w), for each w

.
Denition 3.2.4. (Language accepted by NFA)
String w

is said to be accepted by nondeterministic nite automaton M = (Q, , , q


0
, F),
if there exists a sequence of transitions (q
0
, w)

(q, ) for some q F. Language L(M) =


w : w

, (q
0
, w)

(q, ) for some q F is then the language accepted by nondeterministic


nite automaton M.
The algorithms for creating an automaton that matches a simple string pattern are well
described in literature. I will therefore directly describe the construction of an automaton
accepting a pattern with wildcards. It diers from a NFA accepting a single pattern in
the fact, that for each wildcard, there exists an incoming transition for each symbol of the
alphabet A, and also a self loop in that state.
Algorithm 3.1: Construction of a SFOEWO automaton
Input: Pattern P = p
1
p
2
. . . p
m
, where p
i
, .
Output: SFOEWO automaton M.
Method: NFA M = (q
0
, q
1
, . . . , q
m
, , , q
0
, q
m
), where the mapping is
constructed in the following way:
1. q
i+1
(q
i
, p
i+1
) for 0 i < m and p
i
,
2. q
i+1
(q
i
, a) for 0 i < m, p
i
, and all a ,
3. q
i
(q
i
, a) for 0 < i < m,p
i
, and all a .
Example 3.2.1. An example of a SFOEWO automaton that matches the pattern P =
p
1
p
2
p
3
p
4
is shown in Fig. 3.1. States are not labeled, with the exception of states that
represent wildcards. These are labeled with the actual wildcard symbol, the letter S (which
stands for star) and the number of the wildcard in the pattern.
START
S
1
p
2

p
4
p
1
p
3
Figure 3.1: The SFOEWO to match a single pattern, p
1
p
2
p
3
p
4
A universal algorithm for matching a nite set of patterns has also been described in
[MHP05]. First, individual automata for each pattern are constructed. These are then
18 CHAPTER 3. THE CLASSIFICATION ALGORITHM
combined to create an NFA that accepts the union of all languages accepted by the individual
automata. There are several ways this union can be constructed. One approach is to create
a new initial state with -transitions leading to the initial states of each of the original
automata. Another approach is to create a new automaton that is the result of simulating
each individual automaton in parallel. [Mel03] The latter approach has the advantage that
common prexes share states and the resulting automaton is smaller. This is an important
trait that is also exploited by search tries[Fre60] and PATRICIA indexes[Mor68]. Algorithm
3.2 is a version of the universal ??F??? construction algorithm, which I have modied to
take this into account.
Algorithm 3.2: Construction of a ??F??? automaton with shared prexes
Input: A set of patterns with a specication of the way of matching
P = P
1
(w
1
), P
2
(w
2
), . . . , P
r
(w
w
) where P
i
are patterns and w
i
are
specications of the ways of matching them, for 1 i r.
Output: The ??F??? automaton.
Method:
1. Construct an NFA for each pattern P
i
, 1 i r, taking into account the matching
specication w
i
.
2. Create a NFA for a language which is the union of all input languages of the
automata constructed in step 1 in such a way that common prexes share states. The
resulting automaton is the ??F??? automaton.
Example 3.2.2. An example of an automaton that matches the set of patterns
P =
BCD,
RTUV,
RT,
IJK,
,
TUV
is shown in Fig. 3.2.
At this point, I wont go into detail with regards to the construction of a SFFEWF
automaton (that matches sequences of patterns with wildcards). According to denition
3.1.5 this problem can be solved by converting a sequence of patterns into a single pattern.
The only thing to note here is that, in addition to the number of patterns in each sequence
being the same as the number of strings in the input sequence, the set of boundary markers
must also be shared between all pattern sequences as dened in def. 3.1.6.
Lets examine the question of how matching is performed. The AIML specication
explains the matching algorithm using a trie structure called the Graphmaster. An in-
order depth-rst search with backtracking is performed. The order in which child nodes are
examined is as follows.
3.2. FINITE STATE AUTOMATA 19
S
1
S
1
S
2
START
S
1
b
i
t
k

j
v
r
v

u
u

c
t

Figure 3.2: The SFFEWO automaton created to match a set of patterns from example 3.2.2
1. The underscore is tried rst, with progressively shorter suxes of the input string.
2. An exact match.
3. The star is tried third, it is matched the same as the underscore.
4. If there are no more words in the input and the current node in the trie contains a
reference to a template, the search terminates.
Due to the way the SFFEWO automaton has been created using algorithm 3.2, its
structure is nearly identical to that of a Graphmaster search trie. Also, the way the
Graphmaster trie is searched corresponds to a deterministic simulation of a NFA using a
depth rst algorithm with backtracking when the order in which transitions are examined
corresponds to the priorities of wildcards, then the rst accepting congration (q, ) where
q F that is reached is also the best match.
An enhancement of NFAs with transition priorities and special tags that serve the purpose
of remembering the position in the input string when a transition has been taken has been
proposed in [Lau00]. This paper also discusses determinization of such automata and their
application in regular expression matching. In my thesis I want to provide a more abstract
view of the whole process and its generalization for matching arbitrary patterns, while
keeping it compatible with the basic algorithm.
I have previously dened the AIML pattern classication problem as an optimization
problem, where each pattern is assigned a metric that denes its optimality. I will now
explain how to get this metric from the NFA, using a transducer.
A transducer is a nite automaton that translates an input string into an output string.
Two basic types of transducers are Moore transducers (where the symbols are output in
states) and Mealy transducers that output a symbol on transitioning. Often both of these
20 CHAPTER 3. THE CLASSIFICATION ALGORITHM
automata are deterministic there can always be only one valid transition, and they produce
a single output. A non-deterministic transducer can produce multiple outputs, one for each
nal state. This means that the output needs to be carried by each active state (or
conguration), and the transition function needs to be modied accordingly.
How does a non-deterministic mealy transducer need to be constructed (what should the
output be for each transition) so that we get the desired optimization function as the output?
The key lies in the way the automaton would be simulated using a depth rst search. Each
transition in a mealy transducer adds a symbol to the output string. If we want these strings
to act as an optimization metric for the classication problem, all the output strings for a
given input string must be totally ordered. The easiest way to accomplish this is to use a
totally ordered alphabet and compare the output strings lexicographically.
Denition 3.2.5. (Lexicographical order of strings)
Given an ordered alphabet and two strings A = a
1
a
2
. . . a
n
, B = b
1
b
2
. . . b
n
where
A, B

, we can say that


a
1
a
2
..a
n
< b
1
b
2
..b
n
(m > 0) (i < m)(a
i
= b
i
) (a
m
< b
m
).
In practice, this means that the transitions to a state created from a wildcard must
output the biggest symbol from the alphabet the larger the symbol, the bigger the priority.
A transition to a normal pattern symbol must have a smaller output symbol, and transitions
to a wildcard state must have an even smaller symbol. Self-loops to wildcard states are
a special case. The simulation is depth-rst. Because self loops dont go deeper but stay
in the same state, they must output an even smaller symbol (the lowest priority). Using an
alphabet consisting of numbers (each number represents one symbol), the output symbols
for transitions might be assigned as follows:
wildcard > exact match > wildcard > self loop
3 > 2 > 1 > 0
Lets also consider the issue of boundary markers used for pattern sequence matching.
Assigning them an output symbol is not strictly necessary because they work by virtue of
always exactly matching the corresponding boundary markers in the input. But because
they always occur in a predened sequence, it will be useful to assign them decreasing
output symbols that is, the rst boundary marker has always a bigger priority than the
second. Given a set C = C
1
, C
2
, . . . , C
n
of boundary markers and using the same numerical
alphabet as for patterns, the transitions might be assigned as follows:
C
1
> C
2
> . . . > C
n
1 > 2 > . . . > n
Example 3.2.3. Using the same set of patterns as in 3.2.2, the the transducer created from
a SFFEWO automaton can be seen in g. 3.3. Examples of inputs and the patterns and
outputs:
ijk IJK/222, /100
xyabcd BCD/300222, /100000
rstuv TTUV/21222, RT/21210, TUV/10222, /10000
rxytubcd BCD/300000222, RT/21021000, /10000000
3.2. FINITE STATE AUTOMATA 21
S
1
S
1
S
2
START
S
1
b/2
i/2
t/2
k/2
/0
j/2
v/2
r/2
v/2
/0
u/2
u/2
/0
/0
c/2
t/2
/1
/1
d/2
/3
/1
Figure 3.3: The non-deterministic mealy automaton created to match the set of patterns
from example 3.2.2
The preceding example showed how the algorithm deals with multiple patterns that
match the same input. There was always a single pattern that had the highest rating. But
if a pattern contains multiple wildcards, it can happen that a single pattern matches several
times. The behaviour exhibited by the backtracking algorithm can be described as non-
greedy or lazy wildcards always match the shortest possible input. The following
example shows how the non-deterministic Mealy machine resolves this issue.
Example 3.2.4. Let pattern P = . Figure 3.4 shows the automaton created for the
pattern P. To following are the outputs for the input aaaaa (in order of descending
priority):
aaaaa
/11100,
/11010,
/11001,
/10110,
/10101,
/10011
It is clear that the rst wilcard matches the rst character, the second wildcard matches the
second character and the third wildcard matches the rest of the input.
22 CHAPTER 3. THE CLASSIFICATION ALGORITHM
S
1
START
S
3
S
2
/1
/0 /0 /0
/1 /1
Figure 3.4: An example of a transducer for a single pattern containing multiple wildcards
3.3 A set-based description
In the previous chapter, I have shown how to arrive at a measure function from the original
AIML pattern matching algorithm, as described in the specication. Each part of the output
string exactly corresponds to a substring from the input and the value is generated by states
and transitions from a single pattern constituent (either a wildcard, or a symbol). In other
words, for each pattern constituent p we have a relation m
p
that tells us if the constituent
p matches a string S. We can also dene an ordering function for each pattern constituent,
which assigns each string that matches a pattern constituent a non-empty output string.
The whole pattern matches a string if we can nd a mapping between all parts of the
string and all pattern constituents. The order of a particular mapping is the concatenation
of the ordering functions of each pattern constituent. The order of the pattern is the
lexicographically maximal order of all mappings between parts of the string and pattern
constituents. For the whole system to be unambiguous, the following rules of the order
function have to be observed.
Rule 1. The order relation must be dened for each pattern constituent in such a way that
if both constituents match the same string, then the order function is the same i the
constituents are the same.
Rule 2. Because patterns are a concatenation of pattern constituents, the order must also
be dened in a way that remains unambiguous after concatenation. Specically, this
means that if P
1
= p
1
p
2
matches a string S = uv so that m
p
1
(u) and m
p
2
(v), and
P
2
= p
1
also matches the string, so that m
p
1
(uv), then
o
p
1
(u) o
p
2
(v) ,= o
p
1
(uv).
Rule 3. The ordering function must also be consistent. If a pattern constituent matches two
strings, one of which is a proper prex of the other, then the order of one string must
also be a proper prex of the other.
m
p
(u) m
p
(uv) o
p
(uv) = o
p
(u) , .
Rule 4. The ordering function must be non-associative. This means that if an input
S = uvw, v
+
, u, w

can be mapped to a pattern P = p


1
p
2
in dierent ways,
then these ways must have a dierent order:
o
p
1
(u) o
p
2
(vw) ,= o
p
1
(uv) o
p
2
(w).
3.3. A SET-BASED DESCRIPTION 23
The above rules make it possible to add aditional pattern constituents and integrate
them with the existing pattern constituents, without breaking the original pattern matching
behaviour.
Denition 3.3.1. (Basic AIML pattern constituents)
AIML denes 3 basic pattern constituents. Symbols that match exactly and two wildcards.
Given an input alphabet and an output alphabet , they are as follows:
A symbol matches a string S = s if s = . The order for each symbol is
o

(s) = a _.
An underscore wildcard matches any string S
+
. The order for each string is
o (S) = _ 0 _
|S|1
.
A star wildcard matches any string S
+
. The order for each string is
o

(S) = _ 0 _
|S|1
.
The order of symbols in = _, a _, _, 0 _ is
_ > a _ > _ > 0 _
Example 3.3.1. Lets try to add a new wildcard , that matches zero or one symbols from
the input ( ). Let it have the following ordering function:
o

(S) = _ 0 _
|S|1
for S
and modify the output alphabet:

= _
_ > a _ > _ > _ > 0 _
This satises rule 1, because the output always starts with a _ symbol (which wasnt even
present in the original alphabet).
Rule 2 is also satised. If it was not, we would have an input uv that matches a single
wildcard, an moreover
o

(uv) = o

(u) o
p
(v)
Because the wildcard matches only a single symbol or an empty string, it follows that
uv ( ) and that u = v = . If uv = then
o

() = o

() o
p
(v)
_ = _ o
p
(v)
24 CHAPTER 3. THE CLASSIFICATION ALGORITHM
and we need to nd a pattern constituent p that matches an empty string. The only such
constituent is , but that would lead to a contradiction
o

() = o

() o

()
_ = _ _.
If only u = and uv = v then we have
o

(v) = o

() o
p
(v)
_ 0 _ = _ o
p
(v),
and the only way this could be true is if o
p
(v) = 0 _ (we dont have any ordering function
that outputs 0 _). If v = and uv = u then
o

(u) = o

(u) o
p
()
_ 0 _ = _ 0 _ o
p
(),
and we would need an ordering function that outputs an empty string.
Rule 3 is trivially true because o
p
(v) = o
p
() 0 _.
Showing that rule 4 is satised is similar to rule 3. The most critical case here is the
pattern that matches a single character x. I can easily show that
o

(x) o

() ,= o

() o

(x)
_ 0 _ _ ,= _ _ 0 _
An often requested feature inspired by regular expressions is a pattern constituent that
matches a set of strings (or words). This is easy to implement for a single pattern, but the
following example shows that this pattern constituent is not without problems.
Example 3.3.2. Lets try to add a new pattern constituent that allows us to match a set
of words. For example the alternation (AA[AB) matches either the strings aa or bb.
For any string that a particular alternation matches, let its order (analogous to the other
constituents) be
o
(S
1
|S
2
|...|S
n
)
(S) = | _ 0 _
|S|1
for S S
1
, S
2
, . . . , S
n

and the output alphabet

= | _
_ > a _ > | _ > _ > 0 _.
Unfortunately, such an order doesnt even satisfy the rst rule. For example, two dierent
alternations (A[B) and (A[C) both match the input a and both have the same order | _.
There are several possible solutions to this.
Only allow a predened set of alternations, each with a manually specied order.
Assign each alternation a random order
3.3. A SET-BASED DESCRIPTION 25
Restrict the set of patterns and allow only non-overlapping alternations (the languages
that are matched by each alternation must have an empty intersection)
Incrementally construct the set of patterns and check each new pattern for conicts
with other patterns.
Each solution has its benets and drawbacks. When adding such a pattern constituent the
developer has to consciously pick one option and weigh its pros and cons.
3.3.1 Generalization and optimization
Ive shown how each individual pattern constituent works and how its order contributes
to the order of a particular mapping between parts of the input, and pattern constituents.
The whole system can be generalized. Instead of looking at each constituent of the pattern
separately, we can think of each pattern as a single complex pattern constituent. Each input
string becomes a single complex input symbol, and the output string becomes a complex
output symbol. This doesnt change the set of accepted inputs, neither their order. Going
back to nite automata, this transforms the whole automaton into a single large branch.
From the point of the deterministic simulation of the automaton, almost nothing has changed
when at a branch, the individual complex pattern constituents are still tried depth-rst
and in-order, with backtracking. How the order of the complex constituents is computed is
irrelevant, the ordering function is a black box (unlike the original algorithm, where each
branch had a pre-dened order). An example for such an automaton is shown in Fig. 3.5.
START
RT/oRT
(S)
IJK/o
IJK
(S)
/o
(S
)
T
U
V
/o
T
U
V (S
)
B
C
D
/o B
C
D
(S
)
R

T
U
V
/
oR
T
U
V
(S
)
Figure 3.5: Example of a NFA that uses whole patterns as symbols
Such an abstraction of the internals of each branch allows us to visualise larger patterns
and also pattern sequences. Some optimization possibilities will also become apparent. And
26 CHAPTER 3. THE CLASSIFICATION ALGORITHM
while the ordering function has been carefully derived from the original pattern matching
algorithm used in AIML, some patterns in the sequence can use a completely dierent
ordering.
In AIML, all categories are represented by a sequence of 3 patterns, each belonging to a
certain context (input, that or topic). In many cases, we dont care about the inputs from
all 3 contexts, in these cases a default pattern is used (the input sequence is modied
accordingly, so it doesnt contain empty strings). The boundary markers correspond to the
name of the context [input],[that] and [topic].
Example 3.3.3. Examples of pattern sequences from the AAA set:
P
1
= [input]YES[that]DOES IT PAY WELL[topic]
P
2
= [input]YES[that] YOU A STUDENT[topic]
P
3
= [input]YES[that][topic]
P
4
= [input] [that]HOW MANY COINS DO YOU WISH TO BET 1 10[topic]BLACKJACK
P
5
= [input] ABOUT ALICE[that][topic]
P
6
= [input]SEVERAL WHO ARE [that][topic]
P
7
= [input]THE IS THE BEST [that][topic]
P
8
= [input]I DO NOT LIKE AT ALL[that][topic]
P
9
= [input]I DO NOT LIKE ANY [that][topic]
P
10
= [input]YOU ARE SERIOUSLY [that][topic]
P
11
= [input][that]MY NAME IS ALICE WHAT IS YOURS[topic]
P
12
= [input][that][topic]
The general structure of an automaton matching these patterns is shown in Fig. 3.6. Final
states have been labeled with the pattern they belong to.
Of 45244 categories from the AAA set, only 56 categories explicitly use the topic context
(and most of these were used for a simple wordplay game). The that context is used by
1389 (about 3% of all categories). But all categories must use the default pattern if the
topic or that isnt explicitly dened. Many implementations strictly follow the specication
and dont apply any optimizations. This means that even though we dont care about other
contexts than the input all contexts must be matched and bound to wildcards.
One simple optimization that can be applied is optimizing trailing wildcards. If the
wildcard is the last constituent of a pattern (and there are no other patterns), the trailing
wildcard can simply match the rest of the input instead of processing the input one symbol
at a time. This is certainly a useful approach, but it still imposes a penalty that increases
almost linearly when the number of contexts (the number of patterns in a sequence) increases.
Because we dont actually care about the value of undened contexts, we can remove
those patterns and their respective boundary markers from the sequence altogether. This
will not aect the matching order, because boundary markers also have transition priorities,
and matching the second pattern from a sequence cant start earlier than matching the rst
pattern. The problem with the structure of the input sequence not corresponding to the
structure of the automaton can be solved by setting the current input when transitioning to
a boundary marker.
What happens if we dont care about any pattern? In that case, the initial state of
the automaton would also become a nal state. This isnt acceptable and it is necessary
3.3. A SET-BASED DESCRIPTION 27
to intoduce another boundary marker that signies the end of a pattern sequence. The
specication uses a [template] boundary marker, which has the lowest priority of all.
Example 3.3.4. A compact version of the automaton from example 3.3.3 is shown in Fig.
3.7. Boundary markers have been replaced with -transitions, instead, the input is changed
in nodes marked C
n
. Labeling of nal states is the same as in Fig. 3.6.
28 CHAPTER 3. THE CLASSIFICATION ALGORITHM
P
6
P
8
P
2
P
4
P
10
START
P
7
P
12
P
9
P
1
P
5
P
3
P
11
[topic]/-3
[topic]/-3
[topic]/-3
P
9
1

[topic]/-3
[topic]/-3
P
11
2
[that]/-2
[that]/-2
[that]/-2
P
1,2,3
1
P
1
2
[topic]/-3
[that]/-2
[that]/-2

[topic]/-3
[that]/-2

P
6
1

[that]/-2
[topic]/-3
[topic]/-3

P
8
1
P
7
1
[that]/-2

P
10
1
[topic]/-3

P
4
1

P
4
2
[topic]/-3

P
5
1
P
4
3
[input]/-1
[that]/-2
P
2
2

[topic]/-3
Figure 3.6: A NFA that matches sequences of patterns from Ex. 3.3.3
3.3. A SET-BASED DESCRIPTION 29
C
2
P
11
P
6
P
8
P
2
P
4
C
3
C
2
START
P
7
P
12
P
10
P
9
C
1
C
2
P
1
P
5
P
3
/
/
/-3
/
P
8
1
/-2
/
P
1,2,3
1
/
P
9
1
/
/-2
/
P
6
1
/
/
P
7
1
/
P
11
2
P
10
1
P
4
1
/
P
1
2
P
4
2
P
5
1
P
4
3
/-1
P
2
2
/-2
/
Figure 3.7: A compact NFA that matches sequences of patterns from Ex. 3.3.4
30 CHAPTER 3. THE CLASSIFICATION ALGORITHM
Chapter 4
Implementation
The project is written in Java 5, and has been developed over a period of several years. One
of the main goals was to create a library of core AIML functions, so that it would be possible
to easily create an actual working AIML interpreter with a customizable feature set with
minimum overhead. The provided interpreter, demo.InterpreterDemo, tries to provide an
example of how to set up and use the dierent classes that make up an AIML interpreter.
The most important of the core classes is the pattern matching engine,
aiml.classifier.Classifier.
4.1 Data ow overview
This section provides an overview of the main interpreter loop, starting with input from the
user and ending in a response from the bot.
AIML interpretation starts with a users input. The rst stage of preprocessing applies
input substitutions, splits the input into sentences and strips punctuation and trims whitespace
from each sentence.
For each input sentence, the interpreter updates the environment (an instance of
aiml.environment.Environment) and starts the classication algorithm that is implemented
by the aiml.classifier.Classifier class.
To provide predictable behaviour for recursion, the classier rst creates a snapshot of
the current context the values from the environment that take part in matching. These
values, along with information about the state of wildcards in each context are stored in the
aiml.classifier.MatchState class.
Because the Classifier supports arbitrary contexts (that might have various dierent
data sources), it is not advisable to hardcode context value retrieval into the environment
itself, and acces it directly. Instead, a double dispatch approach is used. Each individual
context, represented by an instance of aiml.context.Context, knows how to retrieve its
associated data from the environment.
After creating the MatchState, control is passed to the root node of the context trie
and the actual matching can take place (this is discussed in detail in 4.4). Upon succesfully
31
32 CHAPTER 4. IMPLEMENTATION
matching a category, the resulting template script is contained in the MatchState which is
returned to the interpreter.
The interpreter then evaluates the script using the current environment. This turns out
to be quite easy, because the source code of the templates has already been converted to an
abstract syntaxt tree by the parser.
4.2 Loading the data
Several dierent data formats are used to create a fully operational AIML interpreter. At
the core lie AIML individual les containing categories and response templates the
syntax of these les is based on the AIML specication, but with further enhancements
and modications.
A set of AIML les is loaded as a part of a Bot. The syntax of bot les is not part of
the AIML specication, but its adapted from the XML syntax used by Program D.
All les are processed using a top-down recursive descent parser, using a custom implementation
of the Xml Pull Parser API[xpp06]. This API was chosen in the beginning of the implementation
since 2003, there exists a standard XML streaming API [jsr03] which is bundled with JDK
1.6[sxp]. The implementation has not been updated to use the new standardized API.
The recursive descent parser to load bot xml les is implemented as a part of the
aiml.bot.Bot class. Parsing AIML les can be logically separated into parsing category
markup (aiml.parser.AIMLParser) and parsing template markup.
The main role of the category markup parser is keeping track of the current pattern
sequence which contexts have an associated pattern or restriction placed on them. Contexts
can be arbitrarily nested and grouped, but once a context is dened for the current group,
it may not be overridden.
Due to the large number of template tags and to provide the possibility to extend
the system, templates are parsed dynamically. Every template element implements the
aiml.script.Script interface that denes two methods parse() that returns the root
node of the resulting abstract syntax tree and evaluate() that returns the result of evaluating
the tree in the current environment. Element nodes are registered in the aiml.script.ElementParserFactory
class (which also includes nodes that handle character data and a fallback element handler
for unknown elements).
After the AIMLParser has nished parsing a category, the resulting pattern sequence and
script tree are added to the Classifier.
4.3 Creating the trie
There are several ways to represent a matching automaton or a trie data structure in memory.
In this implementation, I have chosen to use an object based tree. Each node is representad
by an instance of a node class, and contains references to its child nodes. Dierent nodes
can be represented by dierent classes. There are several reasons for this.
The trie is heterogenous, dierent nodes are matched dierently.
4.3. CREATING THE TRIE 33
It is easy to add new nodes implementing dierent matching strategies.
It is easier to implement a single matching behaviour in a single node, than to implement
matching behaviour of all dierent node types in one large method. Also, class
polymorphism can be taken advantage of.
The graph contains cycles (for wildcards), and it is easier to track state when taking
advantage of recursion and the built-in call stack.
Instead of using a generic data structure for storing child nodes, each node implementation
can decide how to store its children.
At the highest level, the trie consists of nodes which are instances of the ContextNode
class. They manage the overall sequence of patterns and implement skipping of contexts that
are not dened. It is assumed (but not enforced in any way by the current implementation)
that instances of context nodes implement the AIML matching algorithm as described in 3
that they sequentially iterate over all child context nodes that match the value of the current
context and return the rst sucessful result. The exact algorithm by which a ContextNode
iterates over child context nodes is left up to the implementations of ContextNode subclasses.
In case matching a context fails, the ContextNode passes control to the following context.
There is a special terminal context node represented by the LeafContextNode class
that is automatically created whenever the pattern sequence reaches its end.
Each pattern in the sequence knows which context it belongs to. This information is
used by the ContextNodes to maintain order - unless the patterns context is equal, either
a new context node has to be prepended, or added to the following context in line. New
context nodes are never created directly. Instead, each context has a MatchingBehaviour
that provides a factory method that provides a new instance of a ContextNode appropriate
for the context.
For this thesis, I have implemented the PatternBehaviour class which is used for
matching AIML patterns with wildcards. When asked for a context node, it returns an
instance of the PatternContextNode class. Internally, PatternContextNode contains a trie
of PatternNodes. There can be many dierent types of nodes, each having a specic function
and being able to handle a certain part of a pattern.
To be able to create pattern nodes dynamically and to decouple the creation mechanism
from the list of known node types a special factory mechanism is used. Each of these base
basic node types must register an instance of a class implementing the Creatable interface
in a PatternNodeFactory. When a new pattern node is needed, the pattern node factory
goes through the known list of creatable nodes, and asks each Creatable in turn do
you know how to handle this part of the pattern? If the answer is yes, a new instance of
an actual PatternNode is created, and the pattern node is then asked to add the pattern to
itself. Otherwise the next Creatable is asked. If a Creatable that knows how to handle the
current pattern constituent is found, an exception is thrown.
A PatternNodeFactory is specic to a PatternBehaviour. Each context can not only be
assigned a dierent behaviour, but each pattern behaviour can also have a dierent pattern
node factory. Most of the time, users of the library dont care about the actual underlying
34 CHAPTER 4. IMPLEMENTATION
implementation of the pattern node tree, and can just call the getDefaultBehaviour()
method, or globally override it using setDefaultBehaviour().
I have implemented two concrete behaviour classes, both of which extend the PatternBehaviour.
The rst one, CompactPatternBehavour is the default behaviour. It uses an incrementally
created compact automaton that implements character based matching using a host of
specialized node classes.
For comparison with other implementations, I have also implemented a naive AIMLPatternBehaviour
that tries to conform to the GraphMaster algorithm reccomended by the AIML specication
(and implemented, for example, by Program D). It uses a single heavyweight node implementation
that is able to handle both wildcards and words, and also supplies a default pattern for
contexts that we dont care about.
For evaluating memory requirements, the PatternBehaviour also contains a map factory
method, that returns an empty map instance using a prototypical instance.
4.3.1 Compact pattern node types
Implementing matching using an optimized compact character based automaton is one of the
main goals of this thesis. The CompactPatternBehaviour registers several specialized node
types in its pattern node factory. There are three basic node types that directly correspond
to the parts of a pattern:
StringNode which represents a continuous part of a pattern that doesnt contain
wildcards,
WildcardNode which represents a single wildcard and
EndOfStringNode which represents the end of a pattern that matches the end of a
string.
Each of these basic creatable nodes implements the simplest possible behaviour and
structure, but, most importantly, doesnt support multiple branches (there is onlyu a single
next node). This is enough to add and match a single pattern. But once more patterns are
added, the structure needs to be changed, edges split and branches created. To facilitate
this, there are two more specic node types.
The rst is a StringBranchNode. This is created instead of the original node every time a
StringNode is asked to add a string that has a dierent prex. The current implementation
of the StringBranchNode uses a map that branches using the rst character (by default,
CompactPatternBehaviour supplies a HashMap).
The second special node is a BranchNode. This implements the heart of the AIML
matching algorithm the ordered traversal of dierent pattern constituents. It is created
every time a string is added to a WildcardNode or a wildcard to a StringNode. It doesnt
match any characters from the input, but branches out to an underscore WildcardNode, an
exact match node (either a StringNode or a StringBranchNode) and a star WildcardNode.
Adding patterns to PatternNodes is straightforward for the most part (either the pattern
is the same, in which case we continue adding the rest of the pattern to the child node or
it is dierent, in which case we substitute a branch node). Algorithm 4.1 describes in more
detail the process used to add a pattern to a StringNode.
4.3. CREATING THE TRIE 35
Algorithm 4.1: Adding a pattern to an already existing StringNode
Input: A string pattern denoting an AIML pattern and an existing pattern tree with
a StringNode at its root
Output: A new root of a pattern tree that matches the same patterns as the original
pattern tree and the new pattern and the leaf
begin
if pattern is an empty string then
prepend a new EndOfString node to the tree
return the new tree
end
if pattern starts with a wildcard then
prepend a new BranchNode to the tree
add the pattern to the new tree
return the new root
end
calculate the longest common prex of this nodes string and the pattern if the
prex is equal to this nodes string then
remove the prex from the pattern
add the rest of the pattern to this nodes child node
return this root
else if there is no common prex then
prepend a new StringBranchNode to the tree
add the pattern to the new tree
return the new root
else if there is a common prex shorter than this nodes string then
remove the prex from the current string
prepend a new StringNode that matches the prex
return the new root
end
end
4.3.2 Naive pattern nodes
The simple AIMLPatternBehaviour uses a single node type for everything, implemented by
the AIMLNode class. This class is word based and contains a single map for all branches
(both wildcards and words).
Every single node contains a map and there is no compaction of nodes that have only
a single child. This, together with the fact that unspecifed contexts must use a default
pattern, results in a large amount of nodes. Using a simple HashMap for each node regardless
of the actual number of branches (which may very well be 0) quickly results in large memory
consumption. Because of this, Program D uses special map wrappers. All wrappers have
an initial capacity (ranging from 0 to three) and defer creating an actual map instance only
after this initial capacity is exceeded.
By virtue of the GPL license under which both my implementation and Program D are
published, I have taken these 4 classes NonOptimalNodemaster, OneOptimalNodemaster,
TwoOptimalNodemaster and ThreeOptimalNodemaster from Program D, and adapted them
so they could be used as a map for any branch nodes.
36 CHAPTER 4. IMPLEMENTATION
4.4 Classication
Classication refers to the process of simulating the NFA used to describe the set of pattern
sequences, nding a best match and binding values from the input to wildcards.
All important information during matching a single input sequence is maintained in the
MatchState class, which is passed around as a parameter during the recursive depth-rst
search. Apart from encapsulating the inputs of dierent contexts, it also keeps track of
the current position in the input string of the currently processed context (the depth) and
maintains an array of wildcards. Wildcards are handled lazily, and until the actual value of
a wildcard is requested (either during matching, or while evaluating the template), they only
store a pair of indices into the input string. This avoids costly string manipulation during
matching
Each context node rst updates the match state, by telling it that a new context has
been entered. After this, it tries to match itself to the current input (provided by the match
state) In the case of a pattern context, this means trying to match its pattern node subtree.
If the context node fails to match, it updates the match state again by telling it to leave
the current context, and passes matching to the next higher context present. This continues
until matching arrives at a LeafContextNode, in which case matching has been successful
and the match state is updated with the result, or there are no more contexts to try.
Each PatternNode has its own function. The compact representation tries to make each
node as simple as possible - each node is independent from other nodes. The only interaction
occurs via the match state, where nodes update the depth accordingly.
4.4.1 Compact pattern node matching
Most compact pattern nodes are highly specialized, and after succesfully matching, they
either pass control to a child pattern node, or fail. The excepions to this rule are nodes that
can function as accepting states for a single pattern, and pass control to the next context
node.
If a StringNode matches succesfully, it rst checks if it has reached the end of the
input, in which case it passes matching to the next context. This avoids creating special
EndOfString nodes in cases where we know we already are at the end.
WildcardNodes also check if they have matched the rest of the input. In addition, if a
wildcard doesnt have any child pattern nodes, it automatically matches the rest of the input,
instead of incrementally growinging the wildcard and passing the result to an EndOfString
node.
Normally, ends of a string are handled by one of the above node classes. The EndOfStringNode
is therefore used rarely, as a child of nodes that dont support matching the end of a string
(like branches, or for matching an empty pattern).
The following is a summary of compact pattern node matching behaviour.
BranchNode A BranchNode doesnt advance the depth all. Instead, it passes control to
the pattern nodes stored in underscore, string and star branches.
4.4. CLASSIFICATION 37
EndOfStringNode If the whole input has already been matched, the EndOfStringNode
succeeds with the next context. But in contrast to other nodes, if it fails to match the
end of a string, matching can still continue with child pattern nodes.
StringBranchNode A StringBranchNode takes a single character from the input string,
and looks if it can nd it in a map of child nodes. If it is succesfull, it advances the
depth by 1, and passes control to the child node found in the map.
StringNode This node tests a string to nd out if it is a prex of the remaining input. If
successful, it advances the depth by the length of the string andpasses control to its
child pattern node or the next context node, depending on the rest of the input.
WildcardNode The WildcardNode is used for both kinds of wildcards. It starts by telling
the match state to create a new wildcard. It then grows the wildcard by 1 character,
updates the depth accordingly and passes control to its child pattern node or the
context node, depending on the depth. The wildcard is grown by 1 character until the
end of input is reached.
The most time-consuming part of matching is processing the wildcard. Apart from
optimizing trailing wildcards (which has been implemented), there are further optimization
possibilities. One is keeping track of the minimal remaining height of the subtree. This
would allow pruning the search in wildcard nodes as soon as the length of the remaining
input is shorter than the minimum remaining height of the trie (the height is counted in
terms of matched input symbols, not nodes)
A second possible optimization would be to employ look-ahead. Instead of blindly
growing the wildcard by 1 character and hoping for the best, a wildcard node could ask its
child if it matches a nite set of characters and advance accordingly. Because the AIML
specication requires word-based matching, almost all existing categories use wildcards
separated with spaces, and look-ahead could be tailored specically for this case.
The reason that these two optimizations havent been implemented is the fact that
(unlike trailing wildcard optimization) they are global optimizations that require cooperation
between dierent node types and make them dependent on each other. It also complicates
the node API, whose purpose was to be very simple and easily extensible.
4.4.2 Naive pattern node matching
In contrast to the multitude of compact pattern nodes, there is only a single AIMLNode.
While creating the trie is simple, matching is more complicated because everything must be
done in a single function, as shown in algorithms 4.2 and 4.3.
Having all functionality in a single function means that the system isnt extensible by
simply adding a new node type and registering it in a pattern node factory. Instead, the
matching function has to be modied accordingly, and the class recompiled.
Implementing the trailing wildcard optimization is a neecssity, because when using the
AIMLPatternBehaviour, every possible context has a pattern associated (if it isnt specied
explicitly a wildcard is used).
38 CHAPTER 4. IMPLEMENTATION
The advantage of using a word-based approach is the speed of wildcard matching. Even
if inputs arent preprocessed, skipping to the start of the next word is very quick, this is
something that the character based implementation cant do easily.
Algorithm 4.2: Matching an AIMLNode
begin
if we are at the end of the input then
return the result of matching the next context
end
if the map contains the string then
if matching the wildcard is successful then
return success
end
end
Get the next word from the input
if the map contains the word then
advance the depth by the size of the word
get the child node from the map
if matching the child node is successful then
return success
else
restore the original depth
end
end
if the map contains the string then
if matching the wildcard is successful then
return success
end
end
return failure
end
4.5 Interpretation
The template script is represented by an abstract syntax tree in memory. After the tree
has been created, it can be executed again and again without having to repeatedly parse
the template. While this may seem like a non-issue, several implementations (like Program
D, Program O or Program P) store the template code as a string and interpret it during
parsing.
Evaluation of Script nodes is very straightforward - each script node implements an
evaluate() method that has a MatchState as its parameter, through which it can get
information about the whole system. A few basic syntactic structures can be identied in
the template syntax, and the nodes correspond to these structures.
Empty elements. These elements never have any content and their result is dened solely
by their name and attributes. All of them subclass the EmptyElement class. Examples
4.5. INTERPRETATION 39
Algorithm 4.3: Matching a wildcard in an AIMLNode
begin
Add a wildcard binding to the match state
Get the child node representing this wildcard
if child node has no children then
grow the wildcard to the rest of the input
else
add the rst word to the wildcard
end
update the depth
if matching the child node is successful then
return success
end
while there is still some input left do
add the next word to the wildcard
update the depth
if matching the child node is successful then
return success
end
end
remove the wildcard binding
restore original depth
return failure
end
are the BotElement which returns a constant dened for a bot, and StarElement which
is used to retrieve the values bound to wildcards.
Simple elements. All of these SimpleElements can have any script as content. Many of
them rst evaluate the inner script, and process the result afterwards. Examples of
these are the SetElement used to set variables to a value or the ThinkElement used
to evaluate a script but supress its output.
Other simple elements rst evaluate a condition and return the evaluated contents
depending on the result. The If class checks to see if the value of a variable is equal
to a string. The GetElement tries to return the value of a variable, but if that variable
is not set, evaluates and returns the inner script.
Complex elements. Complex elements dont follow a uniform syntax. They contain elements
with their own semantics (for example, case statements in the Switch node or random
alternatives from RandomElement).
Text. Instead of providing a dedicated output element like <say>AIML uses mixed content.
Any characters that are not part of element markup are automatically output. This is
handled by the TextElement class.
In addition to nodes that have a direct correspondence to XML elements, there are two
helper node classes. The rst one is a Block and is used to group other script nodes and
40 CHAPTER 4. IMPLEMENTATION
evaluate them sequentially. There is no explicit block syntax in AIML, instead, any non-
empty element denotes a block of mixed character and element content by virtue of its start
and end tag.
The second is an EmptyScript, which does nothing. An empty script is an optimization
that allows to collapse parts of the AST that have eect on the output and no side eects. For
example, raw text inside a think element, empty condition branches or empty transformation
elements (trying to uppercase an empty string is pointless).
Chapter 5
Testing
In this chapter, I will evaluate the matching algorithm.
5.1 Shadowed categories
The benchmakrs are basd on a test for shadowed categories. A shadowed category is a
category identied by a sequence of patterns that can never be matched, because there is
always a pattern with higher priority.
Example 5.1.1. A few example sequences that overshadow other sequences. Any input
that matchess the second pattern will also match the rst pattern, but the rst pattern will
always have a higher priority.
[input] ELIZA [input]ARE YOU A ELIZA
[input] YOU KNOW [input]DO YOU KNOW ANY OTHER SONGS
[input]IS HE TOO [input]IS HE YOUR FATHER TOO
[input]YOU A LOT [input]YOU ARE ASKING A LOT
[input] [that] NATION
[input]I DO NOT KNOW[that] NATION ON EARTH
Testing for overshadowed categories is built on the simple premise that there has to exist
at least one an input that is matched to its pattern sequence. For patterns that dont contain
wildcards, this is trivial the input is the same as the pattern.
Wildcards match strings one or more symbols from the input alphabet. To construct
an input that matches such a wildcard, we only need to substitute all wildcards with strings
that match it (but no other pattern). To prevent creating an input that matches a dierent
pattern accidentially, these strings should not contain symbols from from the input alphabet.
For exact results, we need to keep in mind that an input that matches strings made from
wildcards also matches a single wildcard. In this case shorter matches are preferred (see Ex.
3.2.4). The easiest way to prevent this from happening is always creating a minimal input
that matches a certain pattern, so that each wildcard matches only a single symbol.
On the other hand, matching wildcards involves a lot of backtracking and probably
constitutes the bottleneck of the system and the more symbols a wildcard has to match,
41
42 CHAPTER 5. TESTING
the more work the algorithm has to do. For this reason, I have chosen to create longer than
minimal inputs. While such an approach may result in false positives for patterns that dier
only in the number of wildcards, I believe it more accurately reects real world performance.
5.2 Experiments
The rst experiment was designed to show memory consumption. The interpreter was tasked
with loading the complete AAA aiml set (including templates) into memory. The AAA set
represents a very typical AIML bot, and it can be assumed that many botmasters use the
AAA pattern set as a base for developing their own bot.
For the purpose of these experiments, I have implemented a naive version of the
matching algorithm that uses a single heavyweight node implementation to store branches
and perform matching. This version of the algorithm stays true to the specication in several
respects:
Matching is performed on a word basis
Contexts that are not explicitly dened use a single -wildcard as a default pattern.
The Java programming language oers 2 basic implementations of maps to store key-
value pairs. The HashMap class uses a chained hash table which guarantees a near O(1) acces
time, but has a large overhead by allocating a number of entry slots in. Given a load-factor
of k, the overhead is always between
1
k
and
2
k
, and the minimum size of slots is 16. The other
implementation is the TreeMap backed by a Red-Black tree. It doesnt have such a large
overhead, but as an O(log n) acces time. A third implementation, LinkedHashMap, inhances
the HashMap with a linked list of entries that allows iteration in order of insertion.
Because of the large overhead of hash maps, Program D implements 4 variants of
optimized nodes, called Nodemappers. Each node has its own capacity n = 0 . . . 3 and
allocates a LinkedHashMap after this capacity is exceeded. I wanted to make a comparison
with the approach used in Program D possible. I took the node mapper classes from Program
D and converted them into wrappers for a plain map. This made it possible to switch map
implementations without having to create 8 dierent node implementations.
Unfortunately, these wrapper classes all have their own memory footprint (which is not
simple to estimate because Java doesnt dene the size of an object reference and doesnt
provide a sizeof operator). To compensate for this, I implemented a set of dummy nodes
that could be used for estimating the overhead of the wrapper classes.
For evaluating matching performance, I rst extracted the patterns from the AAA AIML
set and stored them separately. I then created two additional sets. One is a subset of the
original set and contains only patterns without wildcards. The other is the original set of
patterns where all wildcards have been substituted by a single word.
Based on these pattern sets, I created a random set of inputs, where each wildcard was
randomly transformed into a string of words of dierent length. The number of words and
characters is based on a normal distribution N(1, 5
2
). Unspecied contexts were treated the
same as wildcards, and a random input was generated for them.
5.3. RESULTS 43
5.3 Results
The structure of the matching tree depends on the implementation of nodes, but not on the
internal implementation of the map. Table 5.1 shows a comparison of a naive word-based
tree, and my own compact character based automaton. The number of categories is the
number of loaded pattern sequences. The number of self loops corresponds to the amount of
wildcards in all patterns. The number in the Maps columns contains the amount of nodes
that internally use a generic map, which is used to calculate overhead of the wrapper.
Algorithm Categories Nodes Maps Loops
Naive 45234 381631 247149 108728
Compact 45234 133144 14710 19787
Table 5.1: Comparison of trees created from the AAA set
The compact representation is clearly better - it uses about a third of the nodes the naive
implementation uses, even though it is character based. The large amount of self-loops for
the naive implementation can be explained by the fact that most categories only dene the
[input] context, but the naive implementation has to use a default -wildcard pattern for
the two remaining contexts, [that] and [topic].
Table 5.3 summarizes memory requirements of the automatons, with regards to the
underlying map implementation. The Key column provides information about the type
of keys used for the map. The implementation of wrappers is a little limited because it
allows only for String keys. The overhead column species the amount of branches that have
space allocated for them, but are unused (for example, the size of unused buckets in the hash
table, or the own capacity of the wrappers, after they start using the map). The MiB column
shows how many MiB the runtime reported as in use, with corrections applied for using
the wrapper classes (Tab 5.2). For each Map implementation, the results are ordered by
decreasing memory consumption of the naive implementation. The # column shows which
wrapper is the best for a particular implementation, and its overall rank.
Capacity 0 1 2 3
Overhead 24 B 40 B 48 B 56 B
Table 5.2: Amount of overhead (in B) of wrapper level per node
The results clearly show the advantages of a compact implementation. By only storing
branches when necessary, memory consumption stays more or less the same, regardless of
the underlying map or wrapper. The theoretical overhead doesnt seem to be correlated with
actual memory usage.
For the naive implementation, the situation looks dierently. Automatically allocating a
map for every node (as is the case without using a wrapper) has a severe negative eect on
memory consumption. Using a specialised implementation that allocates a map only after
its initial capacity is exceeded has clear benets, about 4661% depending on the underlying
map implementation. What comes as a surprise is the fact that (for the AAA set), the
44 CHAPTER 5. TESTING
Map Key Wrap
Naive Compact
Overhead MiB # Overhead MiB #
LinkedHash
Char none N/A N/A 199570 26.4
String
none 3839491 80.9 199570 28.1
0 1781939 52.8 199570 26.8
2 661699 32.8 97380 25.3
1 485815 31.6 214280 26.9
3 894463 31.4 5. 74830 24.0 2.
Hash
Char none N/A N/A 198562 25.6
String
none 3838963 71.2 198562 28.1
0 1781411 45.5 198562 26.0
3 893935 33.7 73822 24.6 3.
2 661171 31.6 96372 24.7
1 485287 31.4 2. 213272 26.3
Tree
Char none N/A N/A 5892 26.0
String
none 0 54.8 5892 27.6
0 0 38.5 5892 24.9
3 863375 31.2 36280 25.1
2 618144 31.1 26538 23.9 1.
1 375746 29.5 1. 20602 25.8
Table 5.3: Memory requirements depending on the used map (in MiB)
wrappers own capacity doesnt make much dierence, provided its at least 1. The positive
benets of not allocating a map seem to be oset by the larger overhead of maintaining a
small set of own slots. It is possible to try to determine the optimal capacity globally, but
when using dedicated branch nodes, the results are always better.
The three sets of patterns and inputs used for testing the speed of matching are summarized
in Table 5.4. This table also contains the total number of nodes that have been traversed by
each implementation (either by entering the node normally, or via a self-loop).
Set Patterns Words Characters
Nodes traversed
Naive Compact
aaa 45234 559728 3642658 3552361 4641103
aaa-wc-subst 45234 449367 3166426 839824 772761
aaa-no-wc 16837 160615 1078038 297933 269371
Table 5.4: Properties of pattern sets used for benchmarking
Using dierent map implementations had only marginal impact on matching performance.
What proved to be a much more important aspect was pattern normalization, as is shown
in Table 5.5 (times are in seconds).
The original AIML specication only species matching to be case in-sensitive. But for
many languages (including Czech), it is often benecial to match without diacritical marks
5.3. RESULTS 45
as well. For this to work, a string is rst converted to Unicode NFD, where diacritical marks
are represented by separate code-points. The diacritical marks are then removed, the string
converted to uppercase and converted back to NFC.
Removing diacritics is a complicated and lenghty process, and in the original implementation
each node performed its own normalization. When backtracking, this meant that each
character from the input would be normalized many, many times. One possibility was
normalizing all inputs before matching a particular pattern started. Another was normalizing
on-demand, the rst time a node requested the input.
From the results it is clear that input normalization has dramatic eects on the time
spent matching. Without preprocessing, matching took over three minutes. Preprocessing
allowed the compact automaton to match in a time that is on-par with the much simpler
word-based implementation, despite being character based (which means about 6 more
input symbols).
Interesting to note is the fact that for the naive implementation, processing time for
pattern sets without wildcards actually went up. Without any preprocessing, the heavyweight
node implementation normalized only the rst word of each input for dont care patterns,
and after failing to match it, it was able to skip the rest of the input because of a traling
wildcard optimization. Using preprocessing, the whole input is always normalized (and
because uninteresting inputs cant be skipped, all inputs will be normalized, so lazy normalization
doesnt have any positive eect).
Set Normalization
Time (s)
Naive Compact
aaa
every node 16.63 202.27
up-front 7.58 7.82
lazy 7.68 7.13
aaa-subst-wc
every node 3.83 8.35
up-front 6.23 6.05
lazy 6.90 2.38
aaa-no-wc
every node 1.17 2.67
up-front 2.27 2.60
lazy 2.32 1.07
Table 5.5: Matching speed (in seconds)
A sophisticated strategy that combines normalizing parts of the input on-demand with a
cache and a mapping between the original and the normalized input could probably provide
further speed-ups, but the author feels that it is outside the scope of this thesis.
46 CHAPTER 5. TESTING
Chapter 6
Conclusion
The main goal of this thesis was to implement an interpreter of the AIML programming
language with character based pattern matching using a compact trie with sparse pattern
sequences, and compare it to the prevalent approach of using an ordinary word based trie
with xed pattern sequences.
First, we gave a brief overview of the AIML language and its specication and the role
of pattern matching dispatch of functions and methods.
We have then examined the problem of AIML pattern matching with wildcards. We
have shown how this optimization problem can be dened and how it relates to other
pattern matching problems. In order to solve this optimization problem in a way that
complies with the AIML specication for pattern matching, we have applied principles from
automata theory, and created a non-deterministic Mealy automaton for matching nite sets
of sequences of patterns.
The output from the Mealy automaton assigns each pattern sequence a value which can
be used to pick the best match. By examining the properties of these outputs, we derived
a set of rules that make it possible to extend the patterns with more wildcards, but we also
showed the limits of extension. We then described the principles on which pattern sequences
can easily be abstracted and compacted, while still retaining the same pattern ordering.
The fourth chapter describes our implementation of an AIML interpreter. Insight gained
from the above analysis allowed us to implement a exible and extensible system for matching
sparse sequences of abstract patterns. Two implementations for concrete AIML patterns were
created. One is compact, uses several highly specialized node types and performs character
based matching. The other uses a simpler, word based approach for comparison. Having
both of these implementations using the same framework and subject to the sema overhead,
it was possible to provide a direct comparison of their relative performance.
We have shown with experiments involving the Annotated A.L.I.C.E. AIML set that
a character based approach is indeed viable. The resulting implementation consumes less
memory and given the right amount of preprocessing, is able to outperform a word based
implementation.
What the experiments have also shown is that the system is very susceptible to the
amount of preprocessing needed to ensure case and diacritic insensitive matching. The
47
48 CHAPTER 6. CONCLUSION
eects of badly implemented preprocessing far outweigh any positive speed gains resulting
from a smaller and more compact representation.
6.1 Further research
Often, pattern matching problems are dened using simple and relatively small alphabets of
symbols. In real world applications, programs often need to work with Unicode, which
is a quite large and somewhat complex alphabet. Implementing case-insensitive string
comparison using the ASCII alphabet is trivial and fast, but implementing similar insensitive
matching when dealing with Unicode can quickly ruin performance of the most carefully
designed system (not to mention the fact that it is locale dependent). The Unicode standard
anticipates the need for dierent levels of insensivity by providing dierent collation levels
in its Unicode Collation Algorithm[uca]. Unfortunately for AIML, the UCA is not guaranteed
to be reversible, which means that the possibility of properly binding wildcards must be
carefully evaluated.
With regards to the data structures used, it would be interesting to see parts of the trie
represented by a PATRICIA index, instead of a combination of string nodes, branch nodes
and HashMaps. Contrary to popular belief which describes PATRICIA tries as compact tries,
with edges labelled with sequences of characters rather than with single characters[Wik09],
true PATRICIA indexes as described in [Mor68] actually employ a lossy form of compression
and store only osets in edges, and are able to retrieve values based on prexes of keys.
6.1.1 Pattern to pattern matching
From a practical standpoint of an AIML author, it would be advantageous to be able to
search the set of patterns using patterns. Given a pattern P, pattern to pattern matching
is the problem of nding a set of patterns such that the languages matched by P and
Q have a non-empty intersection.
6.1.2 Visualisation of AIML sets
One large problem of AIML is the sheer volume of categories. With 40000+ categories per
bot, it is very hard to keep track of everything and keep it consistent. Displaying the set
of categories like a tree (similar to the one used for matching) is certainly a possibility, but
this groups patterns only by a common prex. Much more interesting would be to use the
metrics dened by pattern order and display the patterns in a map (for example, using a
statistically signicant set of actual user inputs.
Bibliography
[aim05] Articial Intelligence Markup Language - language specication (working draft).
http://www.alicebot.org/TR/2005/WD-aiml/ Accessed May 20, 2009, 2005.
[BMS80] R. M. Burstall, D. B. MacQueen, and D. T. Sannella. Hope: An experimental
applicative language. In LFP 80: Proceedings of the 1980 ACM conference on
LISP and functional programming, pages 136143, New York, NY, USA, 1980.
ACM. http://homepages.inf.ed.ac.uk/dts/pub/hope.pdf Accessed May 20,
2009.
[Bus06] N. Bush. Program D Release Notes. http://files.aitools.org/programd/
docs/release-notes.html Accessed May 20, 2009, Mar. 2006.
[Bus08] N. Bush. Program D on Launchpad (Message to mailinglist). http://www.nabble.
com/Program-D-on-Launchpad-td21115511.html Accessed May 20, 2009, Dec.
2008.
[c207] Portland Pattern Repositorys wiki: Pattern Matching. http://c2.com/cgi/
wiki?PatternMatching Accessed May 20, 2009, May 2007.
[erl] Erlang Reference Manual: Pattern Matching. http://erlang.org/doc/
reference_manual/patterns.html Accessed May 20, 2009.
[Fre60] E. Fredkin. Trie memory. Commun. ACM, 3(9):490499, 1960.
[fsh] The F# 1.9.6 Draft Language Specication - Pattern Matching Expressions and
Functions. http://research.microsoft.com/en-us/um/cambridge/projects/
fsharp/manual/spec2.aspx#_Toc207785630 Accessed May 20, 2009.
[hkl] A Gentle Introduction to Haskell: Patterns. http://www.haskell.org/
tutorial/patterns.html Accessed May 20, 2009.
[Hod99] J. Hodgson. Project Contraintes Prolog Web Pages: Unication. http:
//pauillac.inria.fr/
~
deransar/prolog/unification.html Accessed May 20,
2009, Jan. 1999.
[jsr03] JSR 173: Streaming API for XML. http://jcp.org/en/jsr/detail?id=173
Accessed May 20, 2009, 2003.
49
50 BIBLIOGRAPHY
[Lau00] V. Laurikari. NFAs with Tagged Transitions, their Conversion to Deterministic
Automata and Application to Regular Expressions. String Processing and
Information Retrieval, International Symposium on, 0:181, 2000.
[loe] Home Page of The Loebner Prize in Articial Intelligence. http://www.loebner.
net/Prizef/loebner-prize.html Accessed May 20, 2009.
[McB70] F. McBride. Computer aided manipulation of symbols. PhD thesis, Queens
University of Belfast, 1970.
[Mel03] B. Melichar. Jazyky a preklady.

CVUT, 2nd edition, 2003. in Czech.
[MH97] B. Melichar and J. Holub. 6D Classication of Pattern Matching Problems. In
J. Holub, editor, Proceedings of the Prague Stringology Club Workshop 97, pages
2432. Czech Technical University in Prague, Prague, July 1997.
[MHP05] B. Melichar, J. Holub, and T. Polcar. Text searching algorithms. http://www.
stringology.org/athens/ Accessed May 20, 2009, Nov. 2005.
[MM04] C. McBride and J. McKinna. The view from the left. J. Funct. Program., 14(1):69
111, 2004.
[Mor68] D. R. Morrison. PATRICIAPractical Algorithm To Retrieve Information Coded
in Alphanumeric. J. ACM, 15(4):514534, 1968.
[MZ97] V. Mark and Z. Zdrahal. Expertn systemy. In V. Mark, O.

Stepankova, and
J. Lazansk y, editors, Umela inteligence (2), chapter 1, pages 1574. Academia,
Praha, Czech Republic, 1997.
[Ode09] M. Odersky. The Scala Language Specication 2.7. http://www.scala-lang.
org/docu/files/ScalaReference.pdf Retrieved on May 20, 2009, Mar. 2009.
[pan03] Pandorabots A Common Lisp-based Software Robot Hosting System. http://
www.pandorabots.com/pandora/pics/pandorabotsinjapan.ppt Accessed May
20, 2009, May 2003.
[Roe04] J. Roewen. Re: [alicebot-deveoloper] context tag (message to mailing list).
http://list.alicebot.org/pipermail/alicebot-developer/2004-April/
001767.html Accessed May 20, 2009, Apr. 2004.
[sxp] Package javax.xml.stream. http://java.sun.com/javase/6/docs/api/javax/
xml/stream/package-summary.html Accessed May 20, 2009.
[Tur50] A. M. Turing. Computing machinery and intelligence. MIND, 59:433460, Oct.
1950.
[uca] Unicode collation algorithm. http://unicode.org/reports/tr10/ Accessed May
20, 2009.
[Wal] R. S. Wallace. Pandorabots Embrace and Extend. http://www.alicebot.org/
Embrace.html Accessed May 20, 2009.
BIBLIOGRAPHY 51
[Wei66] J. Weizenbaum. ELIZAa computer program for the study of natural language
communication between man and machine. Commun. ACM, 9(1):3645, 1966.
[Wik09] Wikipedia. Radix tree wikipedia, the free encyclopedia. http:
//en.wikipedia.org/w/index.php?title=Radix_tree&oldid=288105297
Accessed May 21, 2009, 2009.
[xpp06] XML Pull Parsing. http://www.xmlpull.org/, 2006.
52 BIBLIOGRAPHY
Appendix A
Category markup language syntax
This section describes the markup used to describe categories. It uses a simple notation,
where + means one or more occurences, zero or more occurences, [elements] means zero
or more elements and [ is alternation between two elements.
aiml Attributes: version; Contents: [category [ topic [ contextgroup]; The root element of
every AIML le. The attribute version species the version or dialect of AIML used
in this le. Files that do not adhere to the standard specication should provide their
own version identier.
category Contents: [pattern] [that] context template; Groups a template and contexts
that apply only to this template.
context Attributes: name; Content: mixed; Species a context for the current category or
context group.
contextgroup Content: context+ category+; The contextgroup element provides a way to
group a set of categories using a common context. This is a generalized version of the
topic element.
pattern Content: pattern; Species the pattern for the input context inside a category.
template Content: script; Contains the response template.
that Content: pattern; Species the pattern for the that context inside a category.
topic Attributes: name; Content: [category [ contextgroup]; A simple way to specify a
contextgoup with the topic context.
53
54 APPENDIX A. CATEGORY MARKUP LANGUAGE SYNTAX
Appendix B
Template markup language syntax
This section provides a simple reference to all implemented template markup tags. The
notation used is the same as in Appendix A.
bot Attributes: name; Content: Empty; Return the bot property specied in the name
attribute.
condition Three dierent types of conditions (if, if-elif-else, swtich/case) are dierentiated
by the use of dierent attributes and element content.
date Content: Empty; Return the current date.
formal Content: mixed; Evaluate the contents and return a string with each word capitalized.
gender Content: mixed; Evaluate the contents and apply gender substitution. Same as
subst with name set to gender.
get Attributes: name; Content: mixed; If the variable specied in the name attribute is
specied, return its contents. Otherwise return the result of evaluating contents.
id Attrobutes: none; Content: Empty; Return the current user ID.
input Return the original user input
lowercase Content: mixed; Evaluate the contents and return a string with all letters
lowercase.
person Content: mixed; Evaluate the contents and convert from rst to third person (and
vice versa). Same as subst with name set to person.
person2 Content: mixed; Evaluate the contents and convert from rst to second person
(and vice versa). Same as subst with name set to person2.
random Content: list; Randomly evaluate and return the contents of one list item.
sentence Content: mixed; Evaluate the contents and return a string with the rst letter of
each sentence capitalized.
55
56 APPENDIX B. TEMPLATE MARKUP LANGUAGE SYNTAX
set Attributes: name; Content: mixed; Evaluate the content and set the value of the variable
specied in the name attribute to the result. If the contents are empty or evaluate to
an empty string, unset the variable.
size Attributes: none; Content: none; Return the number of known categories for the
current bot.
sr Attributes: none; Content: Empty; Perform a call to the classier with the contents of
the rst wildcard from the input context used as the value for the input context. Same
as <srai><star/></srai>.
srai Attributes: none; Content: mixed; Evaluate the contents, use the result as a new value
for the input context and perform classication.
star Attributes: [context], [index]; Contents: Empty; Return the value that is currently
bound to a wildcard. The attribute context species which context, and index species
the number of the wildcard (one based). If not specied, context defaults to input
and index defaults to 1.
subst Attributes: name; Contents: mixed; Evaluate the contents and apply the substitution
list specied in the name attribute. See also gender, person and person2 elements.
that Attributes: [index]; Access sentences from previous responses of the bot, in reverse
order of history. The attribute index is 1-based, where 1 represents the bot response
before the current user input. Optionally, a second index may be provided, where 1
represents the last sentence, 2 the second-to-last sentence. If not specied, the index
attribute defaults to 1,1.
thatstar Attributes: [index]; Contents: Empty; Acess wildcards bound to the that context.
Index defaults to 1. Same as supplying that as a parameter for the context attribute
of the star element.
think Contents: mixed; Evaluate the contents, but return an empty string.
topicstar Attributes: [index]; Contents: Empty; Acess wildcards bound to the topic
context. Index defaults to 1. Same as supplying topic as a parameter for the context
attribute of the star element.
uppercase Content: mixed; Evaluate the contents and return a string with all letters
uppercase.
version Content: Empty; Returns a string identifying the current version of the interpreter.
Appendix C
A list of abbreviations
AAA Annotated A.L.I.C.E. AIML set
AI Articial Intelligence
AIM AOL Instant Messenger
AIML Articial Intelligence Markup Lanugage
AJAX Asynchronous Javascript And XML
API Application Programming Interface
ASCII American Standard Code for Information Interchange
AST Abstract Syntax Tree
DOS Microsoft Disk Operating System
GPL General Public License
GUI Graphical User Interface
HTML Hypertext Markup Language
IRC Internet Relay Chat
J2EE Java 2 Enterprise Edition
JDK Java Software Development Kit
MiB Mebibyte (1 MiB = 2
20
B)
NFA Nondeterministic Finite Automaton
NFC Unicode Normalization Form C (Canonical Decomposition, followed by Canonical
Composition)
NFD Unicode Normalization Form D (Canonical Decomposition)
57
58 APPENDIX C. A LIST OF ABBREVIATIONS
PHP The PHP programming language
UCA Unicode Collation Algorithm
XML eXtensible Markup Language
Appendix D
Contents of the CD
/
|-- interpreter/ The directory containing the implementation
| |
| |-- aiml/ AIML Files
| | |-- aaa/ The AAA set
| | |-- cloze/ Implementation of a random cloze using standard AIML
| | |-- cz/ Simple Czech bot
| | |-- example/ Simple English bot
| | -- utils/ Miscelaneous AIML utility classes
| |
| |-- classes/ Compiled binaries
| |-- doc/ Generated javadoc
| |-- experiments/ Experimental data
| |-- lib/ Third party libraries
| |-- src/ Source files
| |-- test/ Source files for unit tests
| |-- tests/ data for unit tests
| |
| |-- aaa.xml startup file for the AAA set
| |-- bot.xml startup file for the simple English bot
| |-- czbot.xml startup file for the simple Czech bot
| |-- sentencerules.txt sentence splitting rules
| |-- start.bat batch file to run the interpreter
| -- test.xml
|
|-- text/ The diploma thesis
| |-- src/ Source files for the thesis
| | |-- graph/ Source files for automatically generated graphs
| -- aimlthesis.pdf The printable thesis
-- readme.txt Basic information
59

S-ar putea să vă placă și