Documente Academic
Documente Profesional
Documente Cultură
Paola Cotticelli-Kurras
and Federico Giusfredi
Formal Representation and the Digital Humanities
All rights for this book reserved. No part of this book may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
electronic, mechanical, photocopying, recording or otherwise, without
the prior permission of the copyright owner.
HANNES A. FELLNER
AND BERNHARD KOLLER
Abstract
Since early 2011 the Linguistics Department at the University of Vienna
has hosted a project to create an electronic edition of all available
Tocharian manuscript fragments combined with a linguistic database (A
Comprehensive Edition of Tocharian Manuscripts [CEToM]:
www.univie.ac.at/tocharian). The present study demonstrates how the
CEToM database can effectively be employed in the study of Tocharian B
phonology, specifically its accentual system. Jasanoff (2015) argued that
the location of the stress accent is the result of sonority-based principles.
The eventual goal of our ongoing study is to test the predictions of
Jasanoff’s proposal on the entire Tocharian B nominal system. To that end
we present an early version of an algorithm implemented in Perl that
automatically determines the underlying accent of nominal stems.
1. Tocharian
Tocharian is a branch of the Indo-European language family.2 It consists
of two languages, designated Tocharian A and Tocharian B. The
Tocharian linguistic material was discovered around the turn of the 19th
and 20th century by Central Asian expeditions of the major political
powers at that time in today’s Xinjiang Uyghur Autonomous Region of the
1
We would like to thank the participants of the workshop Formal Representation
and Digital Humanities for useful comments and suggestions on the present
material, as well as Anna Pagé for providing comments and corrections on a draft
of this paper. The usual disclaimer applies. Research for his paper has been
supported by the Austrian Science Fund (FWF): project number Y 492-G20.
2
See Fortson (2009) for a comprehensive overview and introduction to Indo-
European languages and linguistics.
80 On Sonority and Accent in Tocharian B1
The vast majority of Tocharian texts are Buddhist in nature and – due to
their early date of attestation that makes Tocharian the oldest extant non-
Indian Buddhist language – have a bearing on the question of the spread of
Buddhism along the ancient Silk Road to China. The Tocharian tradition,
though fragmentary, belongs to the oldest extant Buddhist literature. While
many texts are translations or adaptions of Sanskrit or Prakrit originals,
Buddhist drama enjoyed popularity among the Tocharians and this is
where Tocharian literature made original contributions to Buddhism.9
There is also some secular literature, such as medicinal and grammatical
texts, commercial exchanges, letters, and caravan passes.
3
See Fellner (2007) for an overview of the expeditions.
4
See Mallory (2015), with further references.
5
See Anthony and Ringe (2015), with further references.
6
See Sander (1968).
7
See Malzahn (2007b).
8
See Malzahn (2007c).
9
See Pinault (2016).
Hannes A. Fellner and Bernhard Koller 81
Despite some great scholarly achievements during the first decades after
the discovery of the Tocharian languages, until twenty years ago only a
very few experts had access to manuscripts. In the past the virtual non-
availability of the Tocharian linguistic material stood in the way of
thorough linguistic research. Therefore, up until recently, there were only
a few text editions covering merely a fraction of the attested corpus. This
resulted in incoherent and non-exhaustive handbooks that in turn made
Tocharian the most understudied major branch of Indo-European. This
made studies in the history, culture, and religion of Tocharian based on the
linguistic material almost impossible for non-specialists.
2. CEToM
Since early 2011 the Linguistics Department at the University of Vienna
has hosted the CEToM project to create an electronic edition of all
available Tocharian material combined with a linguistic database. This
project was generously funded by the START Program of the Austrian
10
See, e.g., Adams (2013); Carling, Pinault, Winter (2009).
11
Pinault (2008).
12
Malzahn (2007a).
13
See, e.g., Ringe (1996); Peyrot (2008); Malzahn (2010).
14
Tocharian and Indo-European Studies, Museum Tusculanum Press, established
by the late Jörundur Hilmarsson (University of Iceland) and currently edited by
Birgit Anette Olsen (University of Kopenhagen), Georges-Jean Pinault (École
Pratique des Hautes Études, Paris), Michaël Peyrot (Leiden University), and
Thomas Olander (University of Kopenhagen).
15
See, e.g., the treatment of the verbal system in Jasanoff (2003: 144-214).
16
See, e.g., the contributions in Malzahn et al. (2015).
82 On Sonority and Accent in Tocharian B1
17
http://www.perl.org/.
18
http://www.w3.org/TR/html4/.
19
http://www.w3.org/Style/CSS/.
20
http://www.unicode.org/.
21
http://titus.fkidg1.uni-frankfurt.de/framee.htm?/index.htm.
22
http://idp.bl.uk/.
Hannes A. Fellner and Bernhard Koller 83
The parameters for a given text are: press mark(s); provenience (main find
spot, specific find spot, expedition code, collection); language and script
(language, linguistic stage, additional linguistic characteristics, script); text
contents (title of the work, passage, manuscript, text genre, text subgenre,
verse/prose, parallel text); object (manuscript, leaf number, material, form,
accessibility, size, completeness, number of lines, line distance); image;
manuscript remarks, transliteration; transcription; translation; philological
commentary; linguistic commentary; parallel text commentary; references;
editor; bibliography. The lexical database is a thesaurus and provides the
relevant morphological information on all parts of speech. By now,
CEToM contains entries for more than 10000 manuscript fragments,
13000 lexical items, and 1200 bibliographical items.
In Tocharian B the location of the word accent can be inferred from vowel
alternations, specifically, the behavior of the central vowel phonemes / /
and /a/. These underlying segments show different realizations depending
on whether they are accented or not. The underlying segment / / is
rendered by <a> [ ] if accented and by <ä> [ ] if unaccented. The
underlying segment /a/ is rendered by < > [a] if accented and by <a> [ ] if
unaccented. In the Tocharian version of the Br hm script the difference
between <ä> [ ], <a> [ ], and < > [ ] is expressed by the use of different
characters.
Table 1.
Table 2.
Table 3.
In the first column of table 3 the lexical accent has been retracted from its
underlying position in order to avoid accenting the word-final syllable. In
23
More precisely, the final / / is deleted in prose context but can be realized as [ ]
or [o] in metrical contexts (for details see Malzahn 2012).
Hannes A. Fellner and Bernhard Koller 85
the second column the same stem is followed by an unaccented suffix and
therefore, the accent can remain in its original position. Note that the
forms in row b) and c) involve the same underlying representations, the
only difference being that the final / / in genitive plural morpheme /nts / is
realized as [o] in b) and deleted in c). Yet, both forms exhibit accent
retraction.
Table 4.
Table 5.
86 On Sonority and Accent in Tocharian B1
Table 6.
Apart from the nominal forms discussed above, Jasanoff built his
hypothesis about the development of the Tocharian B accent system
primarily on the verbal system, where underlyingly initial accent occurs
systematically in a number of categories. Since, according to him, Weight-
to-Stress was a purely phonological principle operating in Pre-Tocharian,
we would expect to find some reflexes of it within the nominal system as
well. The most obvious way such a reflex could manifest itself would be if
nominal stems with a full vowel in the first syllable and a schwa in the
second syllable bear initial accent more frequently than other types of
stems. Determining whether this prediction is actually borne out within the
attested corpus of Tocharin B requires parsing every single nominal
paradigm within Tocharian B in order to determine the underlying accent
Hannes A. Fellner and Bernhard Koller 87
5. Approach
We employed an algorithm implemented in Perl 5 to automatically parse
the CEToM dictionary in order to determine the location of the accent in
nominal and adjectival forms. The goal is to be able to automatically
retrieve stems with the relevant structure (i.e. heavy in the first syllable,
light in the second) and initial accent, and compare their distribution with
other types of stems. The corpus we are using for this study comes from
two sources. 2706 nominal forms are taken from the CEToM dictionary,
which is still a work in progress and does not contain all of the forms
attested in the Tocharian B text corpus. Therefore, we increased the data
coverage by adding forms from the Tocharian B dictionary by Douglas
Adams (Adams 1999). The resulting 4562 individual forms are grouped
into 2024 families. We use the term family to refer to a set of
morphologically related forms. This includes forms belonging to the same
paradigm, as well as forms that are related by derivational processes. The
reason we need to operate in terms of families rather than individual stem
forms is that in many cases, the underlying accent location can only be
established by comparing multiple forms of the same stem. Take, for
example, the word for ‘river’ cake with the genitive ckentse.
Table 7.
The nominative cake shows that the underlying representation of the word
contains a schwa intervening between the first two consonants. Therefore,
in terms of underlying representations, ckentse is accented on the second
88 On Sonority and Accent in Tocharian B1
vowel, despite the fact that in both surface representations it is the first
syllable that is accented. In the case of the nominative this is due to
Marggraf’s accent retraction rule, which prohibits final syllables to be
accented on the surface, while in the case of the genitive the vowel / /
undergoes deletion in an unaccented open syllable.
We will now give a brief overview of how lexical entries are structured
within the CEToM database.24 Each form is stored as a separate entry but
contains a reference to the nominative singular of the same paradigm, or
the form it is derived from. The following examples involve two forms of
the paradigm for ‘pain’. The nominative/oblique plural läklenta (second
entry) contains a reference to the nominative singular lakle (first entry),
which doubles as the lemma form of the paradigm as a whole. The
reference is contained in the field <w_lemma>, indicating that the two
forms belong to the same paradigm.
<entry>
<page_name> lakle </page_name>
<w_case> Nominative+Oblique </w_case>
<w_class> Noun </w_class>
<w_gender> Alternant </w_gender>
<w_language> TB </w_language>
<w_meaning> "suffering, pain" </w_meaning>
<w_noun_number> Singular </w_noun_number>
</entry>
<entry>
<w_lemma> lakle </w_lemma>
<page_name> läklenta </page_name>
<w_case> Nominative+Oblique </w_case>
<w_class> Noun </w_class>
<w_language> TB </w_language>
<w_noun_number> Plural </w_noun_number>
</entry>
24
Rather than presenting the CEToM-internal markup we are giving an XML-
version thereof, as the reader is likely more familiar with this type of markup. The
internal structure of the entries is the same in either representation, however.
Hannes A. Fellner and Bernhard Koller 89
<entry>
<w_family> lakle </w_lemma>
<page_name> läklessu </page_name>
<w_case> Nominative </w_case>
<w_class> Adjective </w_class>
<w_language> TB </w_language>
<w_noun_number> Singular </w_noun_number>
</entry>
Let us now look at how our algorithm uses the CEToM data in order to
determine the underlying accent of individual nominal forms. The
following sample derivation illustrates how the underlying accent of the
adjectival stem lāre ‘beloved’ is determined based on a set of related
forms25 from the same family.
Base: l re lVrV
Table 8.
25
In the interest of space we are only giving a subset of the actually attested forms,
which provide sufficient information to determine the location of the accent.
90 On Sonority and Accent in Tocharian B1
The nominative singular masculine form lāre functions as the base form of
the family. This form is converted into a CV-template, in which all vowels
are removed and every consonant is followed by a potential vowel
position. The purpose of this template is to keep track of vowel
alternations (specifically between <ä>, <a>, < > and zero) and to establish
underlying representations of central vowels. The base form lāre yields the
template l V r V. This template is updated with every parsed form of the
family in order to record alternations between central vowels, which are
often required to establish the underlying representation of the vowel
position. Specifically, an alternation between <a> and <ä>/zero indicates
the underlying segment / / while an alternation between <a> and < >
points to /a/. The algorithm cycles through each form of the family and
attempts to determine the underlying accent of the stem based on both
evidence internal to the form itself and information from the current
version of the template. The algorithm first parses the base form of the
family (Table 8) and determines that lāre has initial surface accent based
on the generalization that < > is always accented. Since, < > can only
represent the underlying segment /a/, the template can be updated
accordingly (lVrV larV). The items in the column labelled surface
accent represent an intermediary representation of the form containing a
hypothesis about the underlying representation of the central vowels and
the location of the surface accent based on the information gathered so far.
In the case of láre this is already the correct analysis. However, the
location of the underlying accent is obscured by Marggraf’s rule. That is,
even if the lexical accent was located on the second syllable, it would have
to be realized on the first syllable due to the ban on final accented surface
syllables. The algorithm then moves on to the next form (8.2). In isolation
the feminine plural form larona is ambiguous in that the accent could be
either on the first or the second syllable, since the vowel <a> either
represents an accented schwa or an unaccented /a/. The current form of the
template resolves this ambiguity and the underlying accent in this form is
correctly determined to be non-initial. The derived nominal larauñe works
exactly the same way. As noted above, the fact that larauñe is related to
lāre via derivation instead of inflection is irrelevant for determining the
accent of the stem.
Let us now return to the word for river cake to see how the algorithm
determines the location of schwas that have been deleted on the surface.
{cake.NOM/OBL.SG, ckentse.GEN.SG}
Table 9.
The sample consists again of a subset of the forms attested for the family.
As in lāre the surface accent of cake must be located on the first syllable
due to Marggraf’s rule. From this it follows that <a> must represent
underlying / /, which is recorded in the template (cVkV → cəkV). Using
the updated template, the algorithm correctly restores the schwa following
the initial consonant in the genitive ckentse, despite it having undergone
deletion on the surface. Based on this alternation, the underlying accent
can be correctly determined to be non-initial.
The family of the noun kercapo ‘donkey’ illustrates two additional aspects
of the method employed.
Table 10.
92 On Sonority and Accent in Tocharian B1
more to this world’ (Adams 2013: 12) contains two instances of < >,
conflicting with the general rule that < > corresponds to accented /a/
(under the assumption that words only bear a single accent). Many other
cases do not have as obvious a tell as anāgāme that they deviate from the
general accent pattern but are, due to their status as loanwords, still
unreliable as evidence for inherited accentual properties. Before arriving at
any definitive results, it is therefore necessary to enrich the dictionary with
information regarding borrowing.
Bibliography
Adams D.Q., 1999, A Dictionary of Tocharian B., (Leiden Studies in Indo-
European 10), Rodopi, Amsterdam/Atlanta.
—. 2013, A Dictionary of Tocharian B, 2nd edition, (Leiden Studies in
Indo-European 10), Rodopi, Amsterdam/New York.
Anthony D.W. and Ringe D.A., 2015, “The Indo-European Homeland
from Linguistic and Archaeological Perspectives”, in Annual Review of
Linguistics 1, 199-219.
Carling G., Pinault G.-P. and Winter W., 2009, A Dictionary and
Thesaurus of Tocharian A. Volume 1: letters a-j. Harrassowitz,
Wiesbaden.
CEToM Comprehensive Edition of Tocharian Manuscripts:
univie.ac.at/tocharian.
Fellner H.A., 2007, “The Expeditions to Tocharistan”, in Malzahn M.
(ed.), Instrumenta Tocharica, Winter, Heidelberg, 13-36.
Fortson B.W., 2009, Indo-European Language and Culture: An
Introduction, 2nd edition, Malden, Ma., Blackwell.
IDP International Dunhuang Project: idp.bl.uk.
Jasanoff J.H., 2003, Hittite and the Indo-European verb, Oxford
University Press, Oxford.
—. 2015, “The Tocharian B accent”, in Malzahn M. (ed.), Tocharian Texts
94 On Sonority and Accent in Tocharian B1