Documente Academic
Documente Profesional
Documente Cultură
=
+
. The length of the string [x[ is the
number of symbols of x.
A repeated string will be written using integer exponents: a
0
= , a
1
= a,a
2
= aa,. . . ,
for a and x
0
= , x
1
= x,x
2
= xx,. . . , for x
.
A prex of a string x is any string y such that x = yz, z
, we
rst construct a sequence of bounded patterns, P
i
= C
i
P
i
. These patterns can then be
unambiguously concatenated. The presence of boundary markers that are not part of the
original alphabet in the pattern sequence means that the resulting pattern sequence must
be matched either against a text that contains these boundary markers as well, or disregard
these markers during matching. AIML choses the former approach.
Denition 3.1.5. (Matching a sequence of AIML patterns)
Given a sequence of strings S
1
, S
2
. . . S
s
and a sequence of pattern strings P
1
, P
2
, . . . P
s
(that may contain wildcards) and a set of boundary markers C = C
1
, C
2
. . . C
s
matching
means determining that the pattern P = C
1
P
1
C
2
P
2
. . . C
s
P
s
matches the string S = C
1
S
1
C
2
S
2
. . . C
s
S
s
.
It is of course possible to match a nite set of sequences of AIML patterns, but special
care must be taken that all the pattern sequences remain compatible with the input sequence.
1
The idea that or . matches any le, and therefore a single wildcard should also match any input
has lead to inconsistencies in the specication that were described earlier in chapter 2.4
16 CHAPTER 3. THE CLASSIFICATION ALGORITHM
This means that all pattern sequences must contain the same number of patterns as there
are strings in the input sequence.
Until now, the problem of pattern matching has been examined as a decision problem. If
pattern matching is to be used to retrieve categories that are described by pattern sequences
(classify an input into a category), we need to know not only which patterns (or sequences)
match a given input (or input sequence), but also which pattern (sequence) describes a
particular input (sequence) best.
Denition 3.1.6. (AIML classication problem)
Given an input sequence S = S
1
, S
2
, . . . , S
s
, a nite set of pattern sequences =
P
1
, P
2
, . . . , P
n
where each pattern sequence P
i
consists of s patterns for 1 i n, the
problem of classifying the input as a single pattern sequence P
x
means nding the
pattern with the smallest value of a measure function that also matches the input. More
formally:
let
,
then the input S is classied as the pattern P
x
: m(S, P
x
) = maxm(S, P) : P
.
What exactly is the measure function and how it can be computed is explained in the
following chapters.
3.2 Finite state automata
Finite state automata are a convenient formalism to describe algorithms used for pattern
matching. In this chapter, I will describe how to construct a nondeterministic nite automaton
that can be used to determine the set of all patterns that match a particular input. After
describing the way inputs are classied using a deterministic simulation of the NFA, I will
extend the transitions of the NFA in such a way that it will be possible to pick the best
matching pattern from the nal states. The initial denitions in this chapter are again taken
from [MHP05].
Denition 3.2.1. (Nondeterministic nite state automaton with -transitions)
A nondeterministic nite automaton (NFA) is a quintuple M = (Q, , , q
0
, F), where
Q is a nite set of states,
is a nite input alphabet,
is a mapping fron Q( ) into the set of subsets of Q,
q
0
Q is an initial state and
F Q is the set of nal states.
3.2. FINITE STATE AUTOMATA 17
Denition 3.2.2. (Conguration of FA)
Let M = (Q, , , q
0
, F) be a nite state automaton. A pair (q, w) Q
is a
conguration of the nite state automaton M.
Denition 3.2.3. (Transition in NFA with -transitions)
Let M = (Q, , , q
0
, F) be a nondeterministic nite automaton with -transitions.
Relation
M
(Q
) (Q
.
Denition 3.2.4. (Language accepted by NFA)
String w
, (q
0
, w)
p
4
p
1
p
3
Figure 3.1: The SFOEWO to match a single pattern, p
1
p
2
p
3
p
4
A universal algorithm for matching a nite set of patterns has also been described in
[MHP05]. First, individual automata for each pattern are constructed. These are then
18 CHAPTER 3. THE CLASSIFICATION ALGORITHM
combined to create an NFA that accepts the union of all languages accepted by the individual
automata. There are several ways this union can be constructed. One approach is to create
a new initial state with -transitions leading to the initial states of each of the original
automata. Another approach is to create a new automaton that is the result of simulating
each individual automaton in parallel. [Mel03] The latter approach has the advantage that
common prexes share states and the resulting automaton is smaller. This is an important
trait that is also exploited by search tries[Fre60] and PATRICIA indexes[Mor68]. Algorithm
3.2 is a version of the universal ??F??? construction algorithm, which I have modied to
take this into account.
Algorithm 3.2: Construction of a ??F??? automaton with shared prexes
Input: A set of patterns with a specication of the way of matching
P = P
1
(w
1
), P
2
(w
2
), . . . , P
r
(w
w
) where P
i
are patterns and w
i
are
specications of the ways of matching them, for 1 i r.
Output: The ??F??? automaton.
Method:
1. Construct an NFA for each pattern P
i
, 1 i r, taking into account the matching
specication w
i
.
2. Create a NFA for a language which is the union of all input languages of the
automata constructed in step 1 in such a way that common prexes share states. The
resulting automaton is the ??F??? automaton.
Example 3.2.2. An example of an automaton that matches the set of patterns
P =
BCD,
RTUV,
RT,
IJK,
,
TUV
is shown in Fig. 3.2.
At this point, I wont go into detail with regards to the construction of a SFFEWF
automaton (that matches sequences of patterns with wildcards). According to denition
3.1.5 this problem can be solved by converting a sequence of patterns into a single pattern.
The only thing to note here is that, in addition to the number of patterns in each sequence
being the same as the number of strings in the input sequence, the set of boundary markers
must also be shared between all pattern sequences as dened in def. 3.1.6.
Lets examine the question of how matching is performed. The AIML specication
explains the matching algorithm using a trie structure called the Graphmaster. An in-
order depth-rst search with backtracking is performed. The order in which child nodes are
examined is as follows.
3.2. FINITE STATE AUTOMATA 19
S
1
S
1
S
2
START
S
1
b
i
t
k
j
v
r
v
u
u
c
t
Figure 3.2: The SFFEWO automaton created to match a set of patterns from example 3.2.2
1. The underscore is tried rst, with progressively shorter suxes of the input string.
2. An exact match.
3. The star is tried third, it is matched the same as the underscore.
4. If there are no more words in the input and the current node in the trie contains a
reference to a template, the search terminates.
Due to the way the SFFEWO automaton has been created using algorithm 3.2, its
structure is nearly identical to that of a Graphmaster search trie. Also, the way the
Graphmaster trie is searched corresponds to a deterministic simulation of a NFA using a
depth rst algorithm with backtracking when the order in which transitions are examined
corresponds to the priorities of wildcards, then the rst accepting congration (q, ) where
q F that is reached is also the best match.
An enhancement of NFAs with transition priorities and special tags that serve the purpose
of remembering the position in the input string when a transition has been taken has been
proposed in [Lau00]. This paper also discusses determinization of such automata and their
application in regular expression matching. In my thesis I want to provide a more abstract
view of the whole process and its generalization for matching arbitrary patterns, while
keeping it compatible with the basic algorithm.
I have previously dened the AIML pattern classication problem as an optimization
problem, where each pattern is assigned a metric that denes its optimality. I will now
explain how to get this metric from the NFA, using a transducer.
A transducer is a nite automaton that translates an input string into an output string.
Two basic types of transducers are Moore transducers (where the symbols are output in
states) and Mealy transducers that output a symbol on transitioning. Often both of these
20 CHAPTER 3. THE CLASSIFICATION ALGORITHM
automata are deterministic there can always be only one valid transition, and they produce
a single output. A non-deterministic transducer can produce multiple outputs, one for each
nal state. This means that the output needs to be carried by each active state (or
conguration), and the transition function needs to be modied accordingly.
How does a non-deterministic mealy transducer need to be constructed (what should the
output be for each transition) so that we get the desired optimization function as the output?
The key lies in the way the automaton would be simulated using a depth rst search. Each
transition in a mealy transducer adds a symbol to the output string. If we want these strings
to act as an optimization metric for the classication problem, all the output strings for a
given input string must be totally ordered. The easiest way to accomplish this is to use a
totally ordered alphabet and compare the output strings lexicographically.
Denition 3.2.5. (Lexicographical order of strings)
Given an ordered alphabet and two strings A = a
1
a
2
. . . a
n
, B = b
1
b
2
. . . b
n
where
A, B
(s) = a _.
An underscore wildcard matches any string S
+
. The order for each string is
o (S) = _ 0 _
|S|1
.
A star wildcard matches any string S
+
. The order for each string is
o
(S) = _ 0 _
|S|1
.
The order of symbols in = _, a _, _, 0 _ is
_ > a _ > _ > 0 _
Example 3.3.1. Lets try to add a new wildcard , that matches zero or one symbols from
the input ( ). Let it have the following ordering function:
o
(S) = _ 0 _
|S|1
for S
and modify the output alphabet:
= _
_ > a _ > _ > _ > 0 _
This satises rule 1, because the output always starts with a _ symbol (which wasnt even
present in the original alphabet).
Rule 2 is also satised. If it was not, we would have an input uv that matches a single
wildcard, an moreover
o
(uv) = o
(u) o
p
(v)
Because the wildcard matches only a single symbol or an empty string, it follows that
uv ( ) and that u = v = . If uv = then
o
() = o
() o
p
(v)
_ = _ o
p
(v)
24 CHAPTER 3. THE CLASSIFICATION ALGORITHM
and we need to nd a pattern constituent p that matches an empty string. The only such
constituent is , but that would lead to a contradiction
o
() = o
() o
()
_ = _ _.
If only u = and uv = v then we have
o
(v) = o
() o
p
(v)
_ 0 _ = _ o
p
(v),
and the only way this could be true is if o
p
(v) = 0 _ (we dont have any ordering function
that outputs 0 _). If v = and uv = u then
o
(u) = o
(u) o
p
()
_ 0 _ = _ 0 _ o
p
(),
and we would need an ordering function that outputs an empty string.
Rule 3 is trivially true because o
p
(v) = o
p
() 0 _.
Showing that rule 4 is satised is similar to rule 3. The most critical case here is the
pattern that matches a single character x. I can easily show that
o
(x) o
() ,= o
() o
(x)
_ 0 _ _ ,= _ _ 0 _
An often requested feature inspired by regular expressions is a pattern constituent that
matches a set of strings (or words). This is easy to implement for a single pattern, but the
following example shows that this pattern constituent is not without problems.
Example 3.3.2. Lets try to add a new pattern constituent that allows us to match a set
of words. For example the alternation (AA[AB) matches either the strings aa or bb.
For any string that a particular alternation matches, let its order (analogous to the other
constituents) be
o
(S
1
|S
2
|...|S
n
)
(S) = | _ 0 _
|S|1
for S S
1
, S
2
, . . . , S
n
= | _
_ > a _ > | _ > _ > 0 _.
Unfortunately, such an order doesnt even satisfy the rst rule. For example, two dierent
alternations (A[B) and (A[C) both match the input a and both have the same order | _.
There are several possible solutions to this.
Only allow a predened set of alternations, each with a manually specied order.
Assign each alternation a random order
3.3. A SET-BASED DESCRIPTION 25
Restrict the set of patterns and allow only non-overlapping alternations (the languages
that are matched by each alternation must have an empty intersection)
Incrementally construct the set of patterns and check each new pattern for conicts
with other patterns.
Each solution has its benets and drawbacks. When adding such a pattern constituent the
developer has to consciously pick one option and weigh its pros and cons.
3.3.1 Generalization and optimization
Ive shown how each individual pattern constituent works and how its order contributes
to the order of a particular mapping between parts of the input, and pattern constituents.
The whole system can be generalized. Instead of looking at each constituent of the pattern
separately, we can think of each pattern as a single complex pattern constituent. Each input
string becomes a single complex input symbol, and the output string becomes a complex
output symbol. This doesnt change the set of accepted inputs, neither their order. Going
back to nite automata, this transforms the whole automaton into a single large branch.
From the point of the deterministic simulation of the automaton, almost nothing has changed
when at a branch, the individual complex pattern constituents are still tried depth-rst
and in-order, with backtracking. How the order of the complex constituents is computed is
irrelevant, the ordering function is a black box (unlike the original algorithm, where each
branch had a pre-dened order). An example for such an automaton is shown in Fig. 3.5.
START
RT/oRT
(S)
IJK/o
IJK
(S)
/o
(S
)
T
U
V
/o
T
U
V (S
)
B
C
D
/o B
C
D
(S
)
R
T
U
V
/
oR
T
U
V
(S
)
Figure 3.5: Example of a NFA that uses whole patterns as symbols
Such an abstraction of the internals of each branch allows us to visualise larger patterns
and also pattern sequences. Some optimization possibilities will also become apparent. And
26 CHAPTER 3. THE CLASSIFICATION ALGORITHM
while the ordering function has been carefully derived from the original pattern matching
algorithm used in AIML, some patterns in the sequence can use a completely dierent
ordering.
In AIML, all categories are represented by a sequence of 3 patterns, each belonging to a
certain context (input, that or topic). In many cases, we dont care about the inputs from
all 3 contexts, in these cases a default pattern is used (the input sequence is modied
accordingly, so it doesnt contain empty strings). The boundary markers correspond to the
name of the context [input],[that] and [topic].
Example 3.3.3. Examples of pattern sequences from the AAA set:
P
1
= [input]YES[that]DOES IT PAY WELL[topic]
P
2
= [input]YES[that] YOU A STUDENT[topic]
P
3
= [input]YES[that][topic]
P
4
= [input] [that]HOW MANY COINS DO YOU WISH TO BET 1 10[topic]BLACKJACK
P
5
= [input] ABOUT ALICE[that][topic]
P
6
= [input]SEVERAL WHO ARE [that][topic]
P
7
= [input]THE IS THE BEST [that][topic]
P
8
= [input]I DO NOT LIKE AT ALL[that][topic]
P
9
= [input]I DO NOT LIKE ANY [that][topic]
P
10
= [input]YOU ARE SERIOUSLY [that][topic]
P
11
= [input][that]MY NAME IS ALICE WHAT IS YOURS[topic]
P
12
= [input][that][topic]
The general structure of an automaton matching these patterns is shown in Fig. 3.6. Final
states have been labeled with the pattern they belong to.
Of 45244 categories from the AAA set, only 56 categories explicitly use the topic context
(and most of these were used for a simple wordplay game). The that context is used by
1389 (about 3% of all categories). But all categories must use the default pattern if the
topic or that isnt explicitly dened. Many implementations strictly follow the specication
and dont apply any optimizations. This means that even though we dont care about other
contexts than the input all contexts must be matched and bound to wildcards.
One simple optimization that can be applied is optimizing trailing wildcards. If the
wildcard is the last constituent of a pattern (and there are no other patterns), the trailing
wildcard can simply match the rest of the input instead of processing the input one symbol
at a time. This is certainly a useful approach, but it still imposes a penalty that increases
almost linearly when the number of contexts (the number of patterns in a sequence) increases.
Because we dont actually care about the value of undened contexts, we can remove
those patterns and their respective boundary markers from the sequence altogether. This
will not aect the matching order, because boundary markers also have transition priorities,
and matching the second pattern from a sequence cant start earlier than matching the rst
pattern. The problem with the structure of the input sequence not corresponding to the
structure of the automaton can be solved by setting the current input when transitioning to
a boundary marker.
What happens if we dont care about any pattern? In that case, the initial state of
the automaton would also become a nal state. This isnt acceptable and it is necessary
3.3. A SET-BASED DESCRIPTION 27
to intoduce another boundary marker that signies the end of a pattern sequence. The
specication uses a [template] boundary marker, which has the lowest priority of all.
Example 3.3.4. A compact version of the automaton from example 3.3.3 is shown in Fig.
3.7. Boundary markers have been replaced with -transitions, instead, the input is changed
in nodes marked C
n
. Labeling of nal states is the same as in Fig. 3.6.
28 CHAPTER 3. THE CLASSIFICATION ALGORITHM
P
6
P
8
P
2
P
4
P
10
START
P
7
P
12
P
9
P
1
P
5
P
3
P
11
[topic]/-3
[topic]/-3
[topic]/-3
P
9
1
[topic]/-3
[topic]/-3
P
11
2
[that]/-2
[that]/-2
[that]/-2
P
1,2,3
1
P
1
2
[topic]/-3
[that]/-2
[that]/-2
[topic]/-3
[that]/-2
P
6
1
[that]/-2
[topic]/-3
[topic]/-3
P
8
1
P
7
1
[that]/-2
P
10
1
[topic]/-3
P
4
1
P
4
2
[topic]/-3
P
5
1
P
4
3
[input]/-1
[that]/-2
P
2
2
[topic]/-3
Figure 3.6: A NFA that matches sequences of patterns from Ex. 3.3.3
3.3. A SET-BASED DESCRIPTION 29
C
2
P
11
P
6
P
8
P
2
P
4
C
3
C
2
START
P
7
P
12
P
10
P
9
C
1
C
2
P
1
P
5
P
3
/
/
/-3
/
P
8
1
/-2
/
P
1,2,3
1
/
P
9
1
/
/-2
/
P
6
1
/
/
P
7
1
/
P
11
2
P
10
1
P
4
1
/
P
1
2
P
4
2
P
5
1
P
4
3
/-1
P
2
2
/-2
/
Figure 3.7: A compact NFA that matches sequences of patterns from Ex. 3.3.4
30 CHAPTER 3. THE CLASSIFICATION ALGORITHM
Chapter 4
Implementation
The project is written in Java 5, and has been developed over a period of several years. One
of the main goals was to create a library of core AIML functions, so that it would be possible
to easily create an actual working AIML interpreter with a customizable feature set with
minimum overhead. The provided interpreter, demo.InterpreterDemo, tries to provide an
example of how to set up and use the dierent classes that make up an AIML interpreter.
The most important of the core classes is the pattern matching engine,
aiml.classifier.Classifier.
4.1 Data ow overview
This section provides an overview of the main interpreter loop, starting with input from the
user and ending in a response from the bot.
AIML interpretation starts with a users input. The rst stage of preprocessing applies
input substitutions, splits the input into sentences and strips punctuation and trims whitespace
from each sentence.
For each input sentence, the interpreter updates the environment (an instance of
aiml.environment.Environment) and starts the classication algorithm that is implemented
by the aiml.classifier.Classifier class.
To provide predictable behaviour for recursion, the classier rst creates a snapshot of
the current context the values from the environment that take part in matching. These
values, along with information about the state of wildcards in each context are stored in the
aiml.classifier.MatchState class.
Because the Classifier supports arbitrary contexts (that might have various dierent
data sources), it is not advisable to hardcode context value retrieval into the environment
itself, and acces it directly. Instead, a double dispatch approach is used. Each individual
context, represented by an instance of aiml.context.Context, knows how to retrieve its
associated data from the environment.
After creating the MatchState, control is passed to the root node of the context trie
and the actual matching can take place (this is discussed in detail in 4.4). Upon succesfully
31
32 CHAPTER 4. IMPLEMENTATION
matching a category, the resulting template script is contained in the MatchState which is
returned to the interpreter.
The interpreter then evaluates the script using the current environment. This turns out
to be quite easy, because the source code of the templates has already been converted to an
abstract syntaxt tree by the parser.
4.2 Loading the data
Several dierent data formats are used to create a fully operational AIML interpreter. At
the core lie AIML individual les containing categories and response templates the
syntax of these les is based on the AIML specication, but with further enhancements
and modications.
A set of AIML les is loaded as a part of a Bot. The syntax of bot les is not part of
the AIML specication, but its adapted from the XML syntax used by Program D.
All les are processed using a top-down recursive descent parser, using a custom implementation
of the Xml Pull Parser API[xpp06]. This API was chosen in the beginning of the implementation
since 2003, there exists a standard XML streaming API [jsr03] which is bundled with JDK
1.6[sxp]. The implementation has not been updated to use the new standardized API.
The recursive descent parser to load bot xml les is implemented as a part of the
aiml.bot.Bot class. Parsing AIML les can be logically separated into parsing category
markup (aiml.parser.AIMLParser) and parsing template markup.
The main role of the category markup parser is keeping track of the current pattern
sequence which contexts have an associated pattern or restriction placed on them. Contexts
can be arbitrarily nested and grouped, but once a context is dened for the current group,
it may not be overridden.
Due to the large number of template tags and to provide the possibility to extend
the system, templates are parsed dynamically. Every template element implements the
aiml.script.Script interface that denes two methods parse() that returns the root
node of the resulting abstract syntax tree and evaluate() that returns the result of evaluating
the tree in the current environment. Element nodes are registered in the aiml.script.ElementParserFactory
class (which also includes nodes that handle character data and a fallback element handler
for unknown elements).
After the AIMLParser has nished parsing a category, the resulting pattern sequence and
script tree are added to the Classifier.
4.3 Creating the trie
There are several ways to represent a matching automaton or a trie data structure in memory.
In this implementation, I have chosen to use an object based tree. Each node is representad
by an instance of a node class, and contains references to its child nodes. Dierent nodes
can be represented by dierent classes. There are several reasons for this.
The trie is heterogenous, dierent nodes are matched dierently.
4.3. CREATING THE TRIE 33
It is easy to add new nodes implementing dierent matching strategies.
It is easier to implement a single matching behaviour in a single node, than to implement
matching behaviour of all dierent node types in one large method. Also, class
polymorphism can be taken advantage of.
The graph contains cycles (for wildcards), and it is easier to track state when taking
advantage of recursion and the built-in call stack.
Instead of using a generic data structure for storing child nodes, each node implementation
can decide how to store its children.
At the highest level, the trie consists of nodes which are instances of the ContextNode
class. They manage the overall sequence of patterns and implement skipping of contexts that
are not dened. It is assumed (but not enforced in any way by the current implementation)
that instances of context nodes implement the AIML matching algorithm as described in 3
that they sequentially iterate over all child context nodes that match the value of the current
context and return the rst sucessful result. The exact algorithm by which a ContextNode
iterates over child context nodes is left up to the implementations of ContextNode subclasses.
In case matching a context fails, the ContextNode passes control to the following context.
There is a special terminal context node represented by the LeafContextNode class
that is automatically created whenever the pattern sequence reaches its end.
Each pattern in the sequence knows which context it belongs to. This information is
used by the ContextNodes to maintain order - unless the patterns context is equal, either
a new context node has to be prepended, or added to the following context in line. New
context nodes are never created directly. Instead, each context has a MatchingBehaviour
that provides a factory method that provides a new instance of a ContextNode appropriate
for the context.
For this thesis, I have implemented the PatternBehaviour class which is used for
matching AIML patterns with wildcards. When asked for a context node, it returns an
instance of the PatternContextNode class. Internally, PatternContextNode contains a trie
of PatternNodes. There can be many dierent types of nodes, each having a specic function
and being able to handle a certain part of a pattern.
To be able to create pattern nodes dynamically and to decouple the creation mechanism
from the list of known node types a special factory mechanism is used. Each of these base
basic node types must register an instance of a class implementing the Creatable interface
in a PatternNodeFactory. When a new pattern node is needed, the pattern node factory
goes through the known list of creatable nodes, and asks each Creatable in turn do
you know how to handle this part of the pattern? If the answer is yes, a new instance of
an actual PatternNode is created, and the pattern node is then asked to add the pattern to
itself. Otherwise the next Creatable is asked. If a Creatable that knows how to handle the
current pattern constituent is found, an exception is thrown.
A PatternNodeFactory is specic to a PatternBehaviour. Each context can not only be
assigned a dierent behaviour, but each pattern behaviour can also have a dierent pattern
node factory. Most of the time, users of the library dont care about the actual underlying
34 CHAPTER 4. IMPLEMENTATION
implementation of the pattern node tree, and can just call the getDefaultBehaviour()
method, or globally override it using setDefaultBehaviour().
I have implemented two concrete behaviour classes, both of which extend the PatternBehaviour.
The rst one, CompactPatternBehavour is the default behaviour. It uses an incrementally
created compact automaton that implements character based matching using a host of
specialized node classes.
For comparison with other implementations, I have also implemented a naive AIMLPatternBehaviour
that tries to conform to the GraphMaster algorithm reccomended by the AIML specication
(and implemented, for example, by Program D). It uses a single heavyweight node implementation
that is able to handle both wildcards and words, and also supplies a default pattern for
contexts that we dont care about.
For evaluating memory requirements, the PatternBehaviour also contains a map factory
method, that returns an empty map instance using a prototypical instance.
4.3.1 Compact pattern node types
Implementing matching using an optimized compact character based automaton is one of the
main goals of this thesis. The CompactPatternBehaviour registers several specialized node
types in its pattern node factory. There are three basic node types that directly correspond
to the parts of a pattern:
StringNode which represents a continuous part of a pattern that doesnt contain
wildcards,
WildcardNode which represents a single wildcard and
EndOfStringNode which represents the end of a pattern that matches the end of a
string.
Each of these basic creatable nodes implements the simplest possible behaviour and
structure, but, most importantly, doesnt support multiple branches (there is onlyu a single
next node). This is enough to add and match a single pattern. But once more patterns are
added, the structure needs to be changed, edges split and branches created. To facilitate
this, there are two more specic node types.
The rst is a StringBranchNode. This is created instead of the original node every time a
StringNode is asked to add a string that has a dierent prex. The current implementation
of the StringBranchNode uses a map that branches using the rst character (by default,
CompactPatternBehaviour supplies a HashMap).
The second special node is a BranchNode. This implements the heart of the AIML
matching algorithm the ordered traversal of dierent pattern constituents. It is created
every time a string is added to a WildcardNode or a wildcard to a StringNode. It doesnt
match any characters from the input, but branches out to an underscore WildcardNode, an
exact match node (either a StringNode or a StringBranchNode) and a star WildcardNode.
Adding patterns to PatternNodes is straightforward for the most part (either the pattern
is the same, in which case we continue adding the rest of the pattern to the child node or
it is dierent, in which case we substitute a branch node). Algorithm 4.1 describes in more
detail the process used to add a pattern to a StringNode.
4.3. CREATING THE TRIE 35
Algorithm 4.1: Adding a pattern to an already existing StringNode
Input: A string pattern denoting an AIML pattern and an existing pattern tree with
a StringNode at its root
Output: A new root of a pattern tree that matches the same patterns as the original
pattern tree and the new pattern and the leaf
begin
if pattern is an empty string then
prepend a new EndOfString node to the tree
return the new tree
end
if pattern starts with a wildcard then
prepend a new BranchNode to the tree
add the pattern to the new tree
return the new root
end
calculate the longest common prex of this nodes string and the pattern if the
prex is equal to this nodes string then
remove the prex from the pattern
add the rest of the pattern to this nodes child node
return this root
else if there is no common prex then
prepend a new StringBranchNode to the tree
add the pattern to the new tree
return the new root
else if there is a common prex shorter than this nodes string then
remove the prex from the current string
prepend a new StringNode that matches the prex
return the new root
end
end
4.3.2 Naive pattern nodes
The simple AIMLPatternBehaviour uses a single node type for everything, implemented by
the AIMLNode class. This class is word based and contains a single map for all branches
(both wildcards and words).
Every single node contains a map and there is no compaction of nodes that have only
a single child. This, together with the fact that unspecifed contexts must use a default
pattern, results in a large amount of nodes. Using a simple HashMap for each node regardless
of the actual number of branches (which may very well be 0) quickly results in large memory
consumption. Because of this, Program D uses special map wrappers. All wrappers have
an initial capacity (ranging from 0 to three) and defer creating an actual map instance only
after this initial capacity is exceeded.
By virtue of the GPL license under which both my implementation and Program D are
published, I have taken these 4 classes NonOptimalNodemaster, OneOptimalNodemaster,
TwoOptimalNodemaster and ThreeOptimalNodemaster from Program D, and adapted them
so they could be used as a map for any branch nodes.
36 CHAPTER 4. IMPLEMENTATION
4.4 Classication
Classication refers to the process of simulating the NFA used to describe the set of pattern
sequences, nding a best match and binding values from the input to wildcards.
All important information during matching a single input sequence is maintained in the
MatchState class, which is passed around as a parameter during the recursive depth-rst
search. Apart from encapsulating the inputs of dierent contexts, it also keeps track of
the current position in the input string of the currently processed context (the depth) and
maintains an array of wildcards. Wildcards are handled lazily, and until the actual value of
a wildcard is requested (either during matching, or while evaluating the template), they only
store a pair of indices into the input string. This avoids costly string manipulation during
matching
Each context node rst updates the match state, by telling it that a new context has
been entered. After this, it tries to match itself to the current input (provided by the match
state) In the case of a pattern context, this means trying to match its pattern node subtree.
If the context node fails to match, it updates the match state again by telling it to leave
the current context, and passes matching to the next higher context present. This continues
until matching arrives at a LeafContextNode, in which case matching has been successful
and the match state is updated with the result, or there are no more contexts to try.
Each PatternNode has its own function. The compact representation tries to make each
node as simple as possible - each node is independent from other nodes. The only interaction
occurs via the match state, where nodes update the depth accordingly.
4.4.1 Compact pattern node matching
Most compact pattern nodes are highly specialized, and after succesfully matching, they
either pass control to a child pattern node, or fail. The excepions to this rule are nodes that
can function as accepting states for a single pattern, and pass control to the next context
node.
If a StringNode matches succesfully, it rst checks if it has reached the end of the
input, in which case it passes matching to the next context. This avoids creating special
EndOfString nodes in cases where we know we already are at the end.
WildcardNodes also check if they have matched the rest of the input. In addition, if a
wildcard doesnt have any child pattern nodes, it automatically matches the rest of the input,
instead of incrementally growinging the wildcard and passing the result to an EndOfString
node.
Normally, ends of a string are handled by one of the above node classes. The EndOfStringNode
is therefore used rarely, as a child of nodes that dont support matching the end of a string
(like branches, or for matching an empty pattern).
The following is a summary of compact pattern node matching behaviour.
BranchNode A BranchNode doesnt advance the depth all. Instead, it passes control to
the pattern nodes stored in underscore, string and star branches.
4.4. CLASSIFICATION 37
EndOfStringNode If the whole input has already been matched, the EndOfStringNode
succeeds with the next context. But in contrast to other nodes, if it fails to match the
end of a string, matching can still continue with child pattern nodes.
StringBranchNode A StringBranchNode takes a single character from the input string,
and looks if it can nd it in a map of child nodes. If it is succesfull, it advances the
depth by 1, and passes control to the child node found in the map.
StringNode This node tests a string to nd out if it is a prex of the remaining input. If
successful, it advances the depth by the length of the string andpasses control to its
child pattern node or the next context node, depending on the rest of the input.
WildcardNode The WildcardNode is used for both kinds of wildcards. It starts by telling
the match state to create a new wildcard. It then grows the wildcard by 1 character,
updates the depth accordingly and passes control to its child pattern node or the
context node, depending on the depth. The wildcard is grown by 1 character until the
end of input is reached.
The most time-consuming part of matching is processing the wildcard. Apart from
optimizing trailing wildcards (which has been implemented), there are further optimization
possibilities. One is keeping track of the minimal remaining height of the subtree. This
would allow pruning the search in wildcard nodes as soon as the length of the remaining
input is shorter than the minimum remaining height of the trie (the height is counted in
terms of matched input symbols, not nodes)
A second possible optimization would be to employ look-ahead. Instead of blindly
growing the wildcard by 1 character and hoping for the best, a wildcard node could ask its
child if it matches a nite set of characters and advance accordingly. Because the AIML
specication requires word-based matching, almost all existing categories use wildcards
separated with spaces, and look-ahead could be tailored specically for this case.
The reason that these two optimizations havent been implemented is the fact that
(unlike trailing wildcard optimization) they are global optimizations that require cooperation
between dierent node types and make them dependent on each other. It also complicates
the node API, whose purpose was to be very simple and easily extensible.
4.4.2 Naive pattern node matching
In contrast to the multitude of compact pattern nodes, there is only a single AIMLNode.
While creating the trie is simple, matching is more complicated because everything must be
done in a single function, as shown in algorithms 4.2 and 4.3.
Having all functionality in a single function means that the system isnt extensible by
simply adding a new node type and registering it in a pattern node factory. Instead, the
matching function has to be modied accordingly, and the class recompiled.
Implementing the trailing wildcard optimization is a neecssity, because when using the
AIMLPatternBehaviour, every possible context has a pattern associated (if it isnt specied
explicitly a wildcard is used).
38 CHAPTER 4. IMPLEMENTATION
The advantage of using a word-based approach is the speed of wildcard matching. Even
if inputs arent preprocessed, skipping to the start of the next word is very quick, this is
something that the character based implementation cant do easily.
Algorithm 4.2: Matching an AIMLNode
begin
if we are at the end of the input then
return the result of matching the next context
end
if the map contains the string then
if matching the wildcard is successful then
return success
end
end
Get the next word from the input
if the map contains the word then
advance the depth by the size of the word
get the child node from the map
if matching the child node is successful then
return success
else
restore the original depth
end
end
if the map contains the string then
if matching the wildcard is successful then
return success
end
end
return failure
end
4.5 Interpretation
The template script is represented by an abstract syntax tree in memory. After the tree
has been created, it can be executed again and again without having to repeatedly parse
the template. While this may seem like a non-issue, several implementations (like Program
D, Program O or Program P) store the template code as a string and interpret it during
parsing.
Evaluation of Script nodes is very straightforward - each script node implements an
evaluate() method that has a MatchState as its parameter, through which it can get
information about the whole system. A few basic syntactic structures can be identied in
the template syntax, and the nodes correspond to these structures.
Empty elements. These elements never have any content and their result is dened solely
by their name and attributes. All of them subclass the EmptyElement class. Examples
4.5. INTERPRETATION 39
Algorithm 4.3: Matching a wildcard in an AIMLNode
begin
Add a wildcard binding to the match state
Get the child node representing this wildcard
if child node has no children then
grow the wildcard to the rest of the input
else
add the rst word to the wildcard
end
update the depth
if matching the child node is successful then
return success
end
while there is still some input left do
add the next word to the wildcard
update the depth
if matching the child node is successful then
return success
end
end
remove the wildcard binding
restore original depth
return failure
end
are the BotElement which returns a constant dened for a bot, and StarElement which
is used to retrieve the values bound to wildcards.
Simple elements. All of these SimpleElements can have any script as content. Many of
them rst evaluate the inner script, and process the result afterwards. Examples of
these are the SetElement used to set variables to a value or the ThinkElement used
to evaluate a script but supress its output.
Other simple elements rst evaluate a condition and return the evaluated contents
depending on the result. The If class checks to see if the value of a variable is equal
to a string. The GetElement tries to return the value of a variable, but if that variable
is not set, evaluates and returns the inner script.
Complex elements. Complex elements dont follow a uniform syntax. They contain elements
with their own semantics (for example, case statements in the Switch node or random
alternatives from RandomElement).
Text. Instead of providing a dedicated output element like <say>AIML uses mixed content.
Any characters that are not part of element markup are automatically output. This is
handled by the TextElement class.
In addition to nodes that have a direct correspondence to XML elements, there are two
helper node classes. The rst one is a Block and is used to group other script nodes and
40 CHAPTER 4. IMPLEMENTATION
evaluate them sequentially. There is no explicit block syntax in AIML, instead, any non-
empty element denotes a block of mixed character and element content by virtue of its start
and end tag.
The second is an EmptyScript, which does nothing. An empty script is an optimization
that allows to collapse parts of the AST that have eect on the output and no side eects. For
example, raw text inside a think element, empty condition branches or empty transformation
elements (trying to uppercase an empty string is pointless).
Chapter 5
Testing
In this chapter, I will evaluate the matching algorithm.
5.1 Shadowed categories
The benchmakrs are basd on a test for shadowed categories. A shadowed category is a
category identied by a sequence of patterns that can never be matched, because there is
always a pattern with higher priority.
Example 5.1.1. A few example sequences that overshadow other sequences. Any input
that matchess the second pattern will also match the rst pattern, but the rst pattern will
always have a higher priority.
[input] ELIZA [input]ARE YOU A ELIZA
[input] YOU KNOW [input]DO YOU KNOW ANY OTHER SONGS
[input]IS HE TOO [input]IS HE YOUR FATHER TOO
[input]YOU A LOT [input]YOU ARE ASKING A LOT
[input] [that] NATION
[input]I DO NOT KNOW[that] NATION ON EARTH
Testing for overshadowed categories is built on the simple premise that there has to exist
at least one an input that is matched to its pattern sequence. For patterns that dont contain
wildcards, this is trivial the input is the same as the pattern.
Wildcards match strings one or more symbols from the input alphabet. To construct
an input that matches such a wildcard, we only need to substitute all wildcards with strings
that match it (but no other pattern). To prevent creating an input that matches a dierent
pattern accidentially, these strings should not contain symbols from from the input alphabet.
For exact results, we need to keep in mind that an input that matches strings made from
wildcards also matches a single wildcard. In this case shorter matches are preferred (see Ex.
3.2.4). The easiest way to prevent this from happening is always creating a minimal input
that matches a certain pattern, so that each wildcard matches only a single symbol.
On the other hand, matching wildcards involves a lot of backtracking and probably
constitutes the bottleneck of the system and the more symbols a wildcard has to match,
41
42 CHAPTER 5. TESTING
the more work the algorithm has to do. For this reason, I have chosen to create longer than
minimal inputs. While such an approach may result in false positives for patterns that dier
only in the number of wildcards, I believe it more accurately reects real world performance.
5.2 Experiments
The rst experiment was designed to show memory consumption. The interpreter was tasked
with loading the complete AAA aiml set (including templates) into memory. The AAA set
represents a very typical AIML bot, and it can be assumed that many botmasters use the
AAA pattern set as a base for developing their own bot.
For the purpose of these experiments, I have implemented a naive version of the
matching algorithm that uses a single heavyweight node implementation to store branches
and perform matching. This version of the algorithm stays true to the specication in several
respects:
Matching is performed on a word basis
Contexts that are not explicitly dened use a single -wildcard as a default pattern.
The Java programming language oers 2 basic implementations of maps to store key-
value pairs. The HashMap class uses a chained hash table which guarantees a near O(1) acces
time, but has a large overhead by allocating a number of entry slots in. Given a load-factor
of k, the overhead is always between
1
k
and
2
k
, and the minimum size of slots is 16. The other
implementation is the TreeMap backed by a Red-Black tree. It doesnt have such a large
overhead, but as an O(log n) acces time. A third implementation, LinkedHashMap, inhances
the HashMap with a linked list of entries that allows iteration in order of insertion.
Because of the large overhead of hash maps, Program D implements 4 variants of
optimized nodes, called Nodemappers. Each node has its own capacity n = 0 . . . 3 and
allocates a LinkedHashMap after this capacity is exceeded. I wanted to make a comparison
with the approach used in Program D possible. I took the node mapper classes from Program
D and converted them into wrappers for a plain map. This made it possible to switch map
implementations without having to create 8 dierent node implementations.
Unfortunately, these wrapper classes all have their own memory footprint (which is not
simple to estimate because Java doesnt dene the size of an object reference and doesnt
provide a sizeof operator). To compensate for this, I implemented a set of dummy nodes
that could be used for estimating the overhead of the wrapper classes.
For evaluating matching performance, I rst extracted the patterns from the AAA AIML
set and stored them separately. I then created two additional sets. One is a subset of the
original set and contains only patterns without wildcards. The other is the original set of
patterns where all wildcards have been substituted by a single word.
Based on these pattern sets, I created a random set of inputs, where each wildcard was
randomly transformed into a string of words of dierent length. The number of words and
characters is based on a normal distribution N(1, 5
2
). Unspecied contexts were treated the
same as wildcards, and a random input was generated for them.
5.3. RESULTS 43
5.3 Results
The structure of the matching tree depends on the implementation of nodes, but not on the
internal implementation of the map. Table 5.1 shows a comparison of a naive word-based
tree, and my own compact character based automaton. The number of categories is the
number of loaded pattern sequences. The number of self loops corresponds to the amount of
wildcards in all patterns. The number in the Maps columns contains the amount of nodes
that internally use a generic map, which is used to calculate overhead of the wrapper.
Algorithm Categories Nodes Maps Loops
Naive 45234 381631 247149 108728
Compact 45234 133144 14710 19787
Table 5.1: Comparison of trees created from the AAA set
The compact representation is clearly better - it uses about a third of the nodes the naive
implementation uses, even though it is character based. The large amount of self-loops for
the naive implementation can be explained by the fact that most categories only dene the
[input] context, but the naive implementation has to use a default -wildcard pattern for
the two remaining contexts, [that] and [topic].
Table 5.3 summarizes memory requirements of the automatons, with regards to the
underlying map implementation. The Key column provides information about the type
of keys used for the map. The implementation of wrappers is a little limited because it
allows only for String keys. The overhead column species the amount of branches that have
space allocated for them, but are unused (for example, the size of unused buckets in the hash
table, or the own capacity of the wrappers, after they start using the map). The MiB column
shows how many MiB the runtime reported as in use, with corrections applied for using
the wrapper classes (Tab 5.2). For each Map implementation, the results are ordered by
decreasing memory consumption of the naive implementation. The # column shows which
wrapper is the best for a particular implementation, and its overall rank.
Capacity 0 1 2 3
Overhead 24 B 40 B 48 B 56 B
Table 5.2: Amount of overhead (in B) of wrapper level per node
The results clearly show the advantages of a compact implementation. By only storing
branches when necessary, memory consumption stays more or less the same, regardless of
the underlying map or wrapper. The theoretical overhead doesnt seem to be correlated with
actual memory usage.
For the naive implementation, the situation looks dierently. Automatically allocating a
map for every node (as is the case without using a wrapper) has a severe negative eect on
memory consumption. Using a specialised implementation that allocates a map only after
its initial capacity is exceeded has clear benets, about 4661% depending on the underlying
map implementation. What comes as a surprise is the fact that (for the AAA set), the
44 CHAPTER 5. TESTING
Map Key Wrap
Naive Compact
Overhead MiB # Overhead MiB #
LinkedHash
Char none N/A N/A 199570 26.4
String
none 3839491 80.9 199570 28.1
0 1781939 52.8 199570 26.8
2 661699 32.8 97380 25.3
1 485815 31.6 214280 26.9
3 894463 31.4 5. 74830 24.0 2.
Hash
Char none N/A N/A 198562 25.6
String
none 3838963 71.2 198562 28.1
0 1781411 45.5 198562 26.0
3 893935 33.7 73822 24.6 3.
2 661171 31.6 96372 24.7
1 485287 31.4 2. 213272 26.3
Tree
Char none N/A N/A 5892 26.0
String
none 0 54.8 5892 27.6
0 0 38.5 5892 24.9
3 863375 31.2 36280 25.1
2 618144 31.1 26538 23.9 1.
1 375746 29.5 1. 20602 25.8
Table 5.3: Memory requirements depending on the used map (in MiB)
wrappers own capacity doesnt make much dierence, provided its at least 1. The positive
benets of not allocating a map seem to be oset by the larger overhead of maintaining a
small set of own slots. It is possible to try to determine the optimal capacity globally, but
when using dedicated branch nodes, the results are always better.
The three sets of patterns and inputs used for testing the speed of matching are summarized
in Table 5.4. This table also contains the total number of nodes that have been traversed by
each implementation (either by entering the node normally, or via a self-loop).
Set Patterns Words Characters
Nodes traversed
Naive Compact
aaa 45234 559728 3642658 3552361 4641103
aaa-wc-subst 45234 449367 3166426 839824 772761
aaa-no-wc 16837 160615 1078038 297933 269371
Table 5.4: Properties of pattern sets used for benchmarking
Using dierent map implementations had only marginal impact on matching performance.
What proved to be a much more important aspect was pattern normalization, as is shown
in Table 5.5 (times are in seconds).
The original AIML specication only species matching to be case in-sensitive. But for
many languages (including Czech), it is often benecial to match without diacritical marks
5.3. RESULTS 45
as well. For this to work, a string is rst converted to Unicode NFD, where diacritical marks
are represented by separate code-points. The diacritical marks are then removed, the string
converted to uppercase and converted back to NFC.
Removing diacritics is a complicated and lenghty process, and in the original implementation
each node performed its own normalization. When backtracking, this meant that each
character from the input would be normalized many, many times. One possibility was
normalizing all inputs before matching a particular pattern started. Another was normalizing
on-demand, the rst time a node requested the input.
From the results it is clear that input normalization has dramatic eects on the time
spent matching. Without preprocessing, matching took over three minutes. Preprocessing
allowed the compact automaton to match in a time that is on-par with the much simpler
word-based implementation, despite being character based (which means about 6 more
input symbols).
Interesting to note is the fact that for the naive implementation, processing time for
pattern sets without wildcards actually went up. Without any preprocessing, the heavyweight
node implementation normalized only the rst word of each input for dont care patterns,
and after failing to match it, it was able to skip the rest of the input because of a traling
wildcard optimization. Using preprocessing, the whole input is always normalized (and
because uninteresting inputs cant be skipped, all inputs will be normalized, so lazy normalization
doesnt have any positive eect).
Set Normalization
Time (s)
Naive Compact
aaa
every node 16.63 202.27
up-front 7.58 7.82
lazy 7.68 7.13
aaa-subst-wc
every node 3.83 8.35
up-front 6.23 6.05
lazy 6.90 2.38
aaa-no-wc
every node 1.17 2.67
up-front 2.27 2.60
lazy 2.32 1.07
Table 5.5: Matching speed (in seconds)
A sophisticated strategy that combines normalizing parts of the input on-demand with a
cache and a mapping between the original and the normalized input could probably provide
further speed-ups, but the author feels that it is outside the scope of this thesis.
46 CHAPTER 5. TESTING
Chapter 6
Conclusion
The main goal of this thesis was to implement an interpreter of the AIML programming
language with character based pattern matching using a compact trie with sparse pattern
sequences, and compare it to the prevalent approach of using an ordinary word based trie
with xed pattern sequences.
First, we gave a brief overview of the AIML language and its specication and the role
of pattern matching dispatch of functions and methods.
We have then examined the problem of AIML pattern matching with wildcards. We
have shown how this optimization problem can be dened and how it relates to other
pattern matching problems. In order to solve this optimization problem in a way that
complies with the AIML specication for pattern matching, we have applied principles from
automata theory, and created a non-deterministic Mealy automaton for matching nite sets
of sequences of patterns.
The output from the Mealy automaton assigns each pattern sequence a value which can
be used to pick the best match. By examining the properties of these outputs, we derived
a set of rules that make it possible to extend the patterns with more wildcards, but we also
showed the limits of extension. We then described the principles on which pattern sequences
can easily be abstracted and compacted, while still retaining the same pattern ordering.
The fourth chapter describes our implementation of an AIML interpreter. Insight gained
from the above analysis allowed us to implement a exible and extensible system for matching
sparse sequences of abstract patterns. Two implementations for concrete AIML patterns were
created. One is compact, uses several highly specialized node types and performs character
based matching. The other uses a simpler, word based approach for comparison. Having
both of these implementations using the same framework and subject to the sema overhead,
it was possible to provide a direct comparison of their relative performance.
We have shown with experiments involving the Annotated A.L.I.C.E. AIML set that
a character based approach is indeed viable. The resulting implementation consumes less
memory and given the right amount of preprocessing, is able to outperform a word based
implementation.
What the experiments have also shown is that the system is very susceptible to the
amount of preprocessing needed to ensure case and diacritic insensitive matching. The
47
48 CHAPTER 6. CONCLUSION
eects of badly implemented preprocessing far outweigh any positive speed gains resulting
from a smaller and more compact representation.
6.1 Further research
Often, pattern matching problems are dened using simple and relatively small alphabets of
symbols. In real world applications, programs often need to work with Unicode, which
is a quite large and somewhat complex alphabet. Implementing case-insensitive string
comparison using the ASCII alphabet is trivial and fast, but implementing similar insensitive
matching when dealing with Unicode can quickly ruin performance of the most carefully
designed system (not to mention the fact that it is locale dependent). The Unicode standard
anticipates the need for dierent levels of insensivity by providing dierent collation levels
in its Unicode Collation Algorithm[uca]. Unfortunately for AIML, the UCA is not guaranteed
to be reversible, which means that the possibility of properly binding wildcards must be
carefully evaluated.
With regards to the data structures used, it would be interesting to see parts of the trie
represented by a PATRICIA index, instead of a combination of string nodes, branch nodes
and HashMaps. Contrary to popular belief which describes PATRICIA tries as compact tries,
with edges labelled with sequences of characters rather than with single characters[Wik09],
true PATRICIA indexes as described in [Mor68] actually employ a lossy form of compression
and store only osets in edges, and are able to retrieve values based on prexes of keys.
6.1.1 Pattern to pattern matching
From a practical standpoint of an AIML author, it would be advantageous to be able to
search the set of patterns using patterns. Given a pattern P, pattern to pattern matching
is the problem of nding a set of patterns such that the languages matched by P and
Q have a non-empty intersection.
6.1.2 Visualisation of AIML sets
One large problem of AIML is the sheer volume of categories. With 40000+ categories per
bot, it is very hard to keep track of everything and keep it consistent. Displaying the set
of categories like a tree (similar to the one used for matching) is certainly a possibility, but
this groups patterns only by a common prex. Much more interesting would be to use the
metrics dened by pattern order and display the patterns in a map (for example, using a
statistically signicant set of actual user inputs.
Bibliography
[aim05] Articial Intelligence Markup Language - language specication (working draft).
http://www.alicebot.org/TR/2005/WD-aiml/ Accessed May 20, 2009, 2005.
[BMS80] R. M. Burstall, D. B. MacQueen, and D. T. Sannella. Hope: An experimental
applicative language. In LFP 80: Proceedings of the 1980 ACM conference on
LISP and functional programming, pages 136143, New York, NY, USA, 1980.
ACM. http://homepages.inf.ed.ac.uk/dts/pub/hope.pdf Accessed May 20,
2009.
[Bus06] N. Bush. Program D Release Notes. http://files.aitools.org/programd/
docs/release-notes.html Accessed May 20, 2009, Mar. 2006.
[Bus08] N. Bush. Program D on Launchpad (Message to mailinglist). http://www.nabble.
com/Program-D-on-Launchpad-td21115511.html Accessed May 20, 2009, Dec.
2008.
[c207] Portland Pattern Repositorys wiki: Pattern Matching. http://c2.com/cgi/
wiki?PatternMatching Accessed May 20, 2009, May 2007.
[erl] Erlang Reference Manual: Pattern Matching. http://erlang.org/doc/
reference_manual/patterns.html Accessed May 20, 2009.
[Fre60] E. Fredkin. Trie memory. Commun. ACM, 3(9):490499, 1960.
[fsh] The F# 1.9.6 Draft Language Specication - Pattern Matching Expressions and
Functions. http://research.microsoft.com/en-us/um/cambridge/projects/
fsharp/manual/spec2.aspx#_Toc207785630 Accessed May 20, 2009.
[hkl] A Gentle Introduction to Haskell: Patterns. http://www.haskell.org/
tutorial/patterns.html Accessed May 20, 2009.
[Hod99] J. Hodgson. Project Contraintes Prolog Web Pages: Unication. http:
//pauillac.inria.fr/
~
deransar/prolog/unification.html Accessed May 20,
2009, Jan. 1999.
[jsr03] JSR 173: Streaming API for XML. http://jcp.org/en/jsr/detail?id=173
Accessed May 20, 2009, 2003.
49
50 BIBLIOGRAPHY
[Lau00] V. Laurikari. NFAs with Tagged Transitions, their Conversion to Deterministic
Automata and Application to Regular Expressions. String Processing and
Information Retrieval, International Symposium on, 0:181, 2000.
[loe] Home Page of The Loebner Prize in Articial Intelligence. http://www.loebner.
net/Prizef/loebner-prize.html Accessed May 20, 2009.
[McB70] F. McBride. Computer aided manipulation of symbols. PhD thesis, Queens
University of Belfast, 1970.
[Mel03] B. Melichar. Jazyky a preklady.
CVUT, 2nd edition, 2003. in Czech.
[MH97] B. Melichar and J. Holub. 6D Classication of Pattern Matching Problems. In
J. Holub, editor, Proceedings of the Prague Stringology Club Workshop 97, pages
2432. Czech Technical University in Prague, Prague, July 1997.
[MHP05] B. Melichar, J. Holub, and T. Polcar. Text searching algorithms. http://www.
stringology.org/athens/ Accessed May 20, 2009, Nov. 2005.
[MM04] C. McBride and J. McKinna. The view from the left. J. Funct. Program., 14(1):69
111, 2004.
[Mor68] D. R. Morrison. PATRICIAPractical Algorithm To Retrieve Information Coded
in Alphanumeric. J. ACM, 15(4):514534, 1968.
[MZ97] V. Mark and Z. Zdrahal. Expertn systemy. In V. Mark, O.
Stepankova, and
J. Lazansk y, editors, Umela inteligence (2), chapter 1, pages 1574. Academia,
Praha, Czech Republic, 1997.
[Ode09] M. Odersky. The Scala Language Specication 2.7. http://www.scala-lang.
org/docu/files/ScalaReference.pdf Retrieved on May 20, 2009, Mar. 2009.
[pan03] Pandorabots A Common Lisp-based Software Robot Hosting System. http://
www.pandorabots.com/pandora/pics/pandorabotsinjapan.ppt Accessed May
20, 2009, May 2003.
[Roe04] J. Roewen. Re: [alicebot-deveoloper] context tag (message to mailing list).
http://list.alicebot.org/pipermail/alicebot-developer/2004-April/
001767.html Accessed May 20, 2009, Apr. 2004.
[sxp] Package javax.xml.stream. http://java.sun.com/javase/6/docs/api/javax/
xml/stream/package-summary.html Accessed May 20, 2009.
[Tur50] A. M. Turing. Computing machinery and intelligence. MIND, 59:433460, Oct.
1950.
[uca] Unicode collation algorithm. http://unicode.org/reports/tr10/ Accessed May
20, 2009.
[Wal] R. S. Wallace. Pandorabots Embrace and Extend. http://www.alicebot.org/
Embrace.html Accessed May 20, 2009.
BIBLIOGRAPHY 51
[Wei66] J. Weizenbaum. ELIZAa computer program for the study of natural language
communication between man and machine. Commun. ACM, 9(1):3645, 1966.
[Wik09] Wikipedia. Radix tree wikipedia, the free encyclopedia. http:
//en.wikipedia.org/w/index.php?title=Radix_tree&oldid=288105297
Accessed May 21, 2009, 2009.
[xpp06] XML Pull Parsing. http://www.xmlpull.org/, 2006.
52 BIBLIOGRAPHY
Appendix A
Category markup language syntax
This section describes the markup used to describe categories. It uses a simple notation,
where + means one or more occurences, zero or more occurences, [elements] means zero
or more elements and [ is alternation between two elements.
aiml Attributes: version; Contents: [category [ topic [ contextgroup]; The root element of
every AIML le. The attribute version species the version or dialect of AIML used
in this le. Files that do not adhere to the standard specication should provide their
own version identier.
category Contents: [pattern] [that] context template; Groups a template and contexts
that apply only to this template.
context Attributes: name; Content: mixed; Species a context for the current category or
context group.
contextgroup Content: context+ category+; The contextgroup element provides a way to
group a set of categories using a common context. This is a generalized version of the
topic element.
pattern Content: pattern; Species the pattern for the input context inside a category.
template Content: script; Contains the response template.
that Content: pattern; Species the pattern for the that context inside a category.
topic Attributes: name; Content: [category [ contextgroup]; A simple way to specify a
contextgoup with the topic context.
53
54 APPENDIX A. CATEGORY MARKUP LANGUAGE SYNTAX
Appendix B
Template markup language syntax
This section provides a simple reference to all implemented template markup tags. The
notation used is the same as in Appendix A.
bot Attributes: name; Content: Empty; Return the bot property specied in the name
attribute.
condition Three dierent types of conditions (if, if-elif-else, swtich/case) are dierentiated
by the use of dierent attributes and element content.
date Content: Empty; Return the current date.
formal Content: mixed; Evaluate the contents and return a string with each word capitalized.
gender Content: mixed; Evaluate the contents and apply gender substitution. Same as
subst with name set to gender.
get Attributes: name; Content: mixed; If the variable specied in the name attribute is
specied, return its contents. Otherwise return the result of evaluating contents.
id Attrobutes: none; Content: Empty; Return the current user ID.
input Return the original user input
lowercase Content: mixed; Evaluate the contents and return a string with all letters
lowercase.
person Content: mixed; Evaluate the contents and convert from rst to third person (and
vice versa). Same as subst with name set to person.
person2 Content: mixed; Evaluate the contents and convert from rst to second person
(and vice versa). Same as subst with name set to person2.
random Content: list; Randomly evaluate and return the contents of one list item.
sentence Content: mixed; Evaluate the contents and return a string with the rst letter of
each sentence capitalized.
55
56 APPENDIX B. TEMPLATE MARKUP LANGUAGE SYNTAX
set Attributes: name; Content: mixed; Evaluate the content and set the value of the variable
specied in the name attribute to the result. If the contents are empty or evaluate to
an empty string, unset the variable.
size Attributes: none; Content: none; Return the number of known categories for the
current bot.
sr Attributes: none; Content: Empty; Perform a call to the classier with the contents of
the rst wildcard from the input context used as the value for the input context. Same
as <srai><star/></srai>.
srai Attributes: none; Content: mixed; Evaluate the contents, use the result as a new value
for the input context and perform classication.
star Attributes: [context], [index]; Contents: Empty; Return the value that is currently
bound to a wildcard. The attribute context species which context, and index species
the number of the wildcard (one based). If not specied, context defaults to input
and index defaults to 1.
subst Attributes: name; Contents: mixed; Evaluate the contents and apply the substitution
list specied in the name attribute. See also gender, person and person2 elements.
that Attributes: [index]; Access sentences from previous responses of the bot, in reverse
order of history. The attribute index is 1-based, where 1 represents the bot response
before the current user input. Optionally, a second index may be provided, where 1
represents the last sentence, 2 the second-to-last sentence. If not specied, the index
attribute defaults to 1,1.
thatstar Attributes: [index]; Contents: Empty; Acess wildcards bound to the that context.
Index defaults to 1. Same as supplying that as a parameter for the context attribute
of the star element.
think Contents: mixed; Evaluate the contents, but return an empty string.
topicstar Attributes: [index]; Contents: Empty; Acess wildcards bound to the topic
context. Index defaults to 1. Same as supplying topic as a parameter for the context
attribute of the star element.
uppercase Content: mixed; Evaluate the contents and return a string with all letters
uppercase.
version Content: Empty; Returns a string identifying the current version of the interpreter.
Appendix C
A list of abbreviations
AAA Annotated A.L.I.C.E. AIML set
AI Articial Intelligence
AIM AOL Instant Messenger
AIML Articial Intelligence Markup Lanugage
AJAX Asynchronous Javascript And XML
API Application Programming Interface
ASCII American Standard Code for Information Interchange
AST Abstract Syntax Tree
DOS Microsoft Disk Operating System
GPL General Public License
GUI Graphical User Interface
HTML Hypertext Markup Language
IRC Internet Relay Chat
J2EE Java 2 Enterprise Edition
JDK Java Software Development Kit
MiB Mebibyte (1 MiB = 2
20
B)
NFA Nondeterministic Finite Automaton
NFC Unicode Normalization Form C (Canonical Decomposition, followed by Canonical
Composition)
NFD Unicode Normalization Form D (Canonical Decomposition)
57
58 APPENDIX C. A LIST OF ABBREVIATIONS
PHP The PHP programming language
UCA Unicode Collation Algorithm
XML eXtensible Markup Language
Appendix D
Contents of the CD
/
|-- interpreter/ The directory containing the implementation
| |
| |-- aiml/ AIML Files
| | |-- aaa/ The AAA set
| | |-- cloze/ Implementation of a random cloze using standard AIML
| | |-- cz/ Simple Czech bot
| | |-- example/ Simple English bot
| | -- utils/ Miscelaneous AIML utility classes
| |
| |-- classes/ Compiled binaries
| |-- doc/ Generated javadoc
| |-- experiments/ Experimental data
| |-- lib/ Third party libraries
| |-- src/ Source files
| |-- test/ Source files for unit tests
| |-- tests/ data for unit tests
| |
| |-- aaa.xml startup file for the AAA set
| |-- bot.xml startup file for the simple English bot
| |-- czbot.xml startup file for the simple Czech bot
| |-- sentencerules.txt sentence splitting rules
| |-- start.bat batch file to run the interpreter
| -- test.xml
|
|-- text/ The diploma thesis
| |-- src/ Source files for the thesis
| | |-- graph/ Source files for automatically generated graphs
| -- aimlthesis.pdf The printable thesis
-- readme.txt Basic information
59