Documente Academic
Documente Profesional
Documente Cultură
Abstract
N-gram Markov chains are stochastic
processes in which the probability of the next
step in the process, given the entire history of the
process thus far, is equal to the probability of the
next step in the process given the last N steps.
English grammar can be modeled by stochastic
processes, as shown by Shannon1. However,
proper text generation requires a diverse source
of information from which a Markov chain can
generate sentences. While web crawlers exist
which can access these diverse sources of
information for cataloging, exhaustively logging
every word ever said is impossible, and so
methods are needed for discrete and small
sources of information.
Tri-gram Markov chains create more
human-sounding sentences than uni-gram
Markov chains. However, in a small set of data,
naturally occurring tri-grams are rare except in
specific combinations. Consequently, exclusively
choosing tri-gram Markov chains often results in
simply repeating the entire source material, not
generating unique sentences at all. While unigram Markov chains can produce a more diverse
range of sentences, they can often be
nonsensical. A proper selection mechanism is
then needed for optimal Markov chain
selections.
Testing
To test the performance of Stochatta, part
of the website contains a survey which asks
participants to select which generated quotes are
human-created and which are machinegenerated. Participants are tested with 9 quotes,
with between 2 and 5 of them being machinegenerated. Because querying the MySQL
database results in a long execution time to
generate sentences, all quotes, machine and
human, are pre-generated. Human-generated
quotes are curated by hand and measured for
appropriate length before being put into the
database. Content of the human quotes is not
considered, but all human quotes are selected to
begin and end in the middle of sentence to match
this factor in machine-generated quotes. Both
machine and computer-generated quotes are
about 50 words long each.
Sources cited
1
http://www.physics.nyu.edu/faculty/sokal/index.html