Sunteți pe pagina 1din 3

N-gram Markov Chain Epistemological

Text Generation with Weighted Outputs


[Name]
[College]
[Email]@[College].edu

Abstract
N-gram Markov chains are stochastic
processes in which the probability of the next
step in the process, given the entire history of the
process thus far, is equal to the probability of the
next step in the process given the last N steps.
English grammar can be modeled by stochastic
processes, as shown by Shannon1. However,
proper text generation requires a diverse source
of information from which a Markov chain can
generate sentences. While web crawlers exist
which can access these diverse sources of
information for cataloging, exhaustively logging
every word ever said is impossible, and so
methods are needed for discrete and small
sources of information.
Tri-gram Markov chains create more
human-sounding sentences than uni-gram
Markov chains. However, in a small set of data,
naturally occurring tri-grams are rare except in
specific combinations. Consequently, exclusively
choosing tri-gram Markov chains often results in
simply repeating the entire source material, not
generating unique sentences at all. While unigram Markov chains can produce a more diverse
range of sentences, they can often be
nonsensical. A proper selection mechanism is
then needed for optimal Markov chain
selections.

Stochatta, a web app which does exactly


this, generates uni-gram, bi-gram, and tri-gram
Markov chains, and compares the rate of
occurrence between each chain. It values higher
order Markov chains more highly than lower
order chains, but still considers lower order
chains with higher incidence as more statistically
relevant than higher order chains with lower
incidence. The incidence of each Markov chain
from the source material is compared on
logarithmic, weighted scales, meaning that unigram Markov chains which could have
incidences as large as 10x the largest tri-gram are
valued proportionally against tri-grams, since the
magnitude of the weighing functions becomes
less significant with higher incidence.
The goal, then, is to create a textgeneration tool which can pass for humancreated quotations without directly copying
actual human-created quotations from the source
material. A test is given by the website which
asks users to choose which selections are from a
human source and which selections are generated
by Stochatta. Ideally, the average performance of
a human unassisted by a search engine should be
0.5

Lexicography and Source Materials


The lexicography is handled by a PHP
script which separates tokens into N-gram
Markov chains. Certain types of punctuation are
considered tokens in themselves, and
punctuation which requires an opening and a
closing (such as parentheticals or quotations) are
ignored, since Stochatta cannot guarantee a
closing punctuation mark later on in the process.
The lexicographical regular expression is, then
/[A-Za-z-]+|\,|\.|\!|\;|\:|\?|\`|\~|\&/

which also matches words containing hyphens


A queue ADT enqueues each word
encountered and dequeues the oldest word when
the length of the queue exceeds four members.
When the length of the queue is at four member,
it inserts three new Markov chains into a MySQL
database. Tri-gram chains include all four words,
with the newest word being the predicted next
word in the process given the last three words in
the queue. Bi-gram and uni-gram chains ignore
the last word in the queue, and last 2 words in
the queue, respectively, when inserting into the
database. If a Markov chain already matches one
or all of these chains in the database, the field
which indicates incidence is simply incremented
and no new entry is created.
Source material was chosen to be
particularly vague. Epistemology books from
philosophers
Immanuel
Kant,
Arthur
Schopenhauer, and GWF Hegel were chosen for
their particularly abstract concepts. As
philosophy texts require several sentences or
even paragraphs of context for a reader to derive
meaning, it is easier to generate human-sounding
philosophy than it is to generate human-sounding
story telling. Epistemology, the philosophical
study of human conceptions and knowledge, is
particularly vague. This choice was inspired by

the Sokal Affair, in which a physics professor


submitted a nonsensical essay to a respected
philosophy journal to try to point out the poor
editing choices made by the journal as well as the
nonsensical
direction
that
post-modern
philosophy has taken the academic world2. Part
of the goal of Stochatta, then, is to test the degree
to which a human can be fooled by fake
philosophy and perhaps create discussion on the
legitimacy of certain philosophical movements.
The challenge with the source material is
that formatting marks and notes are also often
collected by the lexicographer. For example, a
source from Project Gutenberg includes the
licensing information at the beginning and end of
the material. This factor is mitigated by the lower
incidence of these selections, giving a low
chance of selection.

Markov Selection and Weighing


To begin a sentence, a random tri-gram is
selected with no weighing involved. The three
tokens in the trigram are put into a queue, and the
database is then queried for tri-grams, bi-grams,
and uni-grams which match the queue's tokens.
The incidence for each entry is then weighed
with a function respective to each N-gram. Each
chain is weighed by the natural logarithm. Bigrams are then divided by 1.5, and uni-grams are
divided by 2. Whichever has the highest
weighted relevancy is then selected, and
enqueued into the queue for the next step in the
Markov process, while the oldest member of the
queue is dequeued.
Complete sentences are not common in
Stochatta; text generation ends when the
requested number of tokens ends. Leading and
trailing marks ( [] ) are inserted to indicate that
a selection would make more sense in the context
of a larger, more complete selection.

Testing
To test the performance of Stochatta, part
of the website contains a survey which asks
participants to select which generated quotes are
human-created and which are machinegenerated. Participants are tested with 9 quotes,
with between 2 and 5 of them being machinegenerated. Because querying the MySQL
database results in a long execution time to
generate sentences, all quotes, machine and
human, are pre-generated. Human-generated
quotes are curated by hand and measured for
appropriate length before being put into the
database. Content of the human quotes is not
considered, but all human quotes are selected to
begin and end in the middle of sentence to match
this factor in machine-generated quotes. Both
machine and computer-generated quotes are
about 50 words long each.

Sources cited
1

Shannon, C. E. A mathematical theory of


communication. ACM SIGMOBILE Mobile Computing
and Communications Review 5, no. 1 (January 2001): 610. Accessed April 26, 2016.

http://www.physics.nyu.edu/faculty/sokal/index.html

S-ar putea să vă placă și