Sunteți pe pagina 1din 11

NOTA IMPORTANTE: El documento que usted ha recibido es slo para fines acadmicos de Docencia e Investigacin.


This article was downloaded by: [Universidad De Concepcion] On: 03 May 2013, At: 12:48 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Quantitative Linguistics

Publication details, including instructions for authors and subscription information:

Hapax Legomena and Language Typology

Ioan-Iovitz Popescu & Gabriel Altmann
a b a b

Bucharest University

Ldenscheid Published online: 07 Oct 2008.

To cite this article: Ioan-Iovitz Popescu & Gabriel Altmann (2008): Hapax Legomena and Language Typology, Journal of Quantitative Linguistics, 15:4, 370-378 To link to this article:

PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sublicensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

Journal of Quantitative Linguistics 2008, Volume 15, Number 4, pp. 370378 DOI: 10.1080/09296170802326699

Hapax Legomena and Language Typology*

Ioan-Iovitz Popescu1 and Gabriel Altmann2

Bucharest University; 2Lu denscheid

Downloaded by [Universidad De Concepcion] at 12:48 03 May 2013

Counting word forms one obtains more hapax legomena in highly synthetic languages than in highly analytic ones. We propose an index of analytism based exclusively on mechanical counting of word forms in text.

Since the seminal work of V. Skali cka (see esp. 20042006) language typology is not a classication of languages any more. Though many linguists are still engaged in describing, collecting and locating isolated language phenomena, typology can nowadays be seen as the study of relationships between language properties. In general, the relationships are the stronger the more contingent two phenomena are but it can be expected that even distant phenomena like phonemics and syntax display some dependences. The consequent continuation of this kind of typology has become language synergetics (cf. Ko hler, 1986, 2005), which tries to express all relationships formally and derive them from a common self-regulation scheme. Needless to say, all relationships in language are stochastic and the resulting equations are either nonlinear regression functions or probability distributions. Many of them hold only for averages. However, up to now, textual phenomena have not been exploited for nding any typological relationships because the study of texts in many languages at once is not as easy as using ready-made grammar textbooks from which elaborated phenomena can be selected. Perhaps the rst

*Address correspondence to: Gabriel Altmann, Stu ttinghauser Ringstr. 44, 58515 Lu denscheid. E-mail:
0929-6174/08/15040370 2008 Taylor & Francis



attempt to use textual phenomena is made in Popescu et al. (2008), where a relationship between a function of the h-point (cf. Popescu, 2007) of the rank-frequency distribution of word forms and the degree of synthetism of language has been found. The study was performed in texts of 20 languages and the relationship does not depend on text length. In the present contribution we shall try to show a relationship between hapax legomena of texts and the synthetism/analytism of language. Logically, if a language is highly synthetic, whatever the text length, not all forms of each word will occur several times (i.e. more than once). A great number of forms will occur only once, thus forming the stock of hapax legomena occupying the last HL ranks. On the contrary, in highly analytic languages the number of forms is small, hence all words have a greater chance of being repeated. The situation would be quite dierent if we analysed lemmatized texts, in which no morphology is contained. This result (showing only the bottom part of the data) simply shows that in highly synthetic languages with a long hapax legomena sequence the theoretical function rather underestimates the frequencies trying to capture the steep decrease in frequencies. In the case of highly analytic languages (Figure 1c) the theoretical function overestimates the empirical frequencies (not only of hapax legomena). In the present work we try to give to these considerations a quantitative expression, as illustrated in Figures 1a, b, c. For this purpose, let us denote V the vocabulary of word forms and HL the number of ranks at which hapax legomena of the rank-frequency distribution occur, and let us suppose that the rank-frequency distribution follows Zipfs law. The assumption of the validity of Zipfs law has been corroborated on innumerable data sets in many languages, so that sporadic exceptions could be captured by one of the many generalizations of this law. For the sake of simplicity we shall use the Zipf function f(r) c/ra in which a and c are tting parameters, thus getting rid of the necessity of normalization and truncation at the right-hand side. Now, if we t this function to rank-frequency data, we may expect that with a good tting the function achieves f(r) 1 (i.e. the level of hapax legomena) exactly in the middle of hapax legomena (i.e. at r V HL/2), as shown in Figure 1a for a Bulgarian text. But iterative tting of the Zipf function may yield dierent values. Therefore, in the following we will

Downloaded by [Universidad De Concepcion] at 12:48 03 May 2013



Downloaded by [Universidad De Concepcion] at 12:48 03 May 2013

Fig. 1a. Zipan location of hapax legomena in a balanced language (here Bulgarian).

Fig. 1b. Over-Zipan location of hapax legomena in a highly synthetic language (here Hungarian).



Downloaded by [Universidad De Concepcion] at 12:48 03 May 2013

Fig. 1c. Under-Zipan location of hapax legomena in a highly analytic language (here Hawaiian).

take the above central location as a reference for any other tting and introduce the indicator A c V HL=2a 1

as an expression of analytism of a language. The interval of this indicator is not yet known, but the relationship to analytism/synthetism is evident: the greater A, the stronger is the analytism. In order to check our hypothesis we used the texts analysed in Popescu et al. (2008) and obtained A-values presented in Table 1 for 100 texts from 20 languages and in Table 2 for the corresponding language averages. The individual extreme cases of Table 1 are illustrated in Figure 1b for a highly synthetic Hungarian text and in Figure 1c for a highly analytic Hawaiian text. Generally, the Zipf tting curve is displaced rightwards for analytic languages and leftwards for synthetic languages with respect to the real rank-frequency distribution. The displacement depends on the degree of analytism/synthetism. Clearly, there is a bi-univocal and reciprocal correspondence between the analytism indicator A and its crossing point



Table 1. Zipfs function f(r) c/ra tting to data of 100 texts from 20 languages. ID B 01 B 02 B 03 B 04 B 05 Cz 01 Cz 02 Cz 03 Cz 04 Cz 05 E 01 E 02 E 03 E 04 E 05 E 07 E 13 G 05 G 09 G 10 G 11 G 12 G 14 G 17 H 01 H 02 H 03 H 04 H 05 Hw 03 Hw 04 Hw 05 Hw 06 I 01 I 02 I 03 I 04 I 05 In 01 In 02 In 03 In 04 V 400 201 285 286 238 638 543 1274 323 556 939 1017 1001 1232 1495 1597 1659 332 379 301 297 169 129 124 1079 789 291 609 290 521 744 680 1039 3667 2203 483 1237 512 221 209 194 213 a 0.6850 0.5704 0.5550 0.6169 0.6202 0.7473 0.7169 0.8028 0.6228 0.8722 0.7657 0.7434 0.8179 0.8712 0.8009 0.7568 0.8034 0.6935 0.6523 0.6053 0.5895 0.6062 0.5755 0.5515 1.2268 1.1865 1.2114 0.9549 0.8168 0.7932 0.7633 0.7267 0.7816 0.7266 0.7488 0.7895 0.7014 0.6524 0.5809 0.5915 0.5417 0.4877 c 41.8602 17.6950 20.9975 23.6917 22.0499 54.2844 51.9648 175.4805 23.3822 77.1944 145.9980 180.1325 254.7482 385.9532 319.1386 300.1258 811.1689 32.8211 32.5565 21.8114 19.9677 14.3627 10.8110 13.1021 214.2708 122.0057 44.9653 74.8581 30.9795 329.6012 678.1305 592.6243 1081.7823 509.5979 305.6487 56.8099 153.3448 54.5840 18.2346 19.1717 15.6229 11.9156 R2 0.9837 0.8705 0.8790 0.9619 0.9367 0.9764 0.9767 0.9832 0.9537 0.9715 0.9620 0.9661 0.9752 0.9870 0.9822 0.9347 0.9800 0.9646 0.9626 0.9402 0.9593 0.9514 0.9349 0.9349 0.9600 0.9365 0.8864 0.9451 0.9093 0.9489 0.9154 0.8742 0.9352 0.9336 0.9559 0.9523 0.9385 0.9293 0.9486 0.9583 0.9565 0.9574 HL 298 153 212 222 187 517 412 964 241 445 662 735 620 693 971 1075 736 250 302 237 232 141 107 84 844 638 259 509 250 255 347 302 500 2514 1604 382 848 355 166 147 130 145
c VHL=2a

0.9507 1.1292 1.1798 0.9790 1.0090 0.6416 0.8013 0.8261 0.8562 0.4864 1.0783 1.4610 1.2123 1.0449 1.2529 1.5416 2.5688 0.8129 0.9431 0.9331 0.9320 0.8888 0.8977 1.1531 0.0749 0.0824 0.0950 0.2753 0.4784 2.8821 5.3384 6.2199 5.8855 1.7784 1.3468 0.6427 1.3948 1.2306 1.0420 1.0509 1.1233 1.0683 (continued )

Downloaded by [Universidad De Concepcion] at 12:48 03 May 2013



Table 1. (Continued ). ID In 05 Kn 003 Kn 004 Kn 005 Kn 006 Kn 011 Lk 01 Lk 02 Lk 03 Lk 04 Lt 01 Lt 02 Lt 03 Lt 04 Lt 05 Lt 06 M 01 M 02 M 03 M 04 M 05 Mq 01 Mq 02 Mq 03 Mr 001 Mr 018 Mr 026 Mr 027 Mr 288 R 01 R 02 R 03 R 04 R 05 R 06 Rt 01 Rt 02 Rt 03 Rt 04 Rt 05 Ru 01 Ru 02 V 188 1833 720 2477 2433 2516 174 479 272 116 2211 2334 2703 1910 909 609 398 277 277 326 514 289 150 301 1555 1788 2038 1400 2079 843 1179 719 729 567 432 223 214 207 181 197 422 1240 a 0.5374 0.6072 0.5237 0.6621 0.5809 0.5786 0.6416 0.7731 0.7512 0.6792 0.7935 0.8047 0.6366 0.6505 0.5877 0.5293 0.7680 0.8197 0.7902 0.8353 0.7484 0.8030 0.7440 0.9795 0.6293 0.6685 0.6224 0.6166 0.6304 0.6720 0.7567 0.7175 0.6673 0.6746 0.6349 0.8575 0.7469 0.7208 0.7359 0.6917 0.6538 0.7713 c 19.4218 66.4545 22.1001 124.5588 95.9573 77.0267 23.4838 139.2126 71.8668 18.7509 109.3668 160.3530 109.5291 129.2023 34.1056 19.3370 185.4091 123.4636 147.8281 137.7184 297.2460 240.0615 46.4870 225.2046 78.3965 128.5531 101.6971 120.0829 100.2890 73.6423 115.8007 60.8094 52.4236 48.1009 30.3691 123.9533 83.2271 78.6409 60.2092 87.0541 36.1404 138.5450 R2 0.8843 0.9775 0.9699 0.9105 0.9522 0.9666 0.9348 0.9510 0.9527 0.9801 0.9078 0.9335 0.9832 0.9463 0.9713 0.9325 0.9225 0.9693 0.9557 0.9763 0.9306 0.9588 0.9655 0.9856 0.9815 0.9863 0.9633 0.9456 0.9683 0.9571 0.9802 0.9778 0.9798 0.9743 0.9350 0.9645 0.9316 0.9465 0.9358 0.9469 0.9604 0.9915 HL 121 1373 564 1784 1655 1873 127 302 174 80 1792 1878 2049 1359 737 521 202 146 133 192 239 91 86 138 1128 1249 1486 846 1534 606 908 567 573 424 353 127 128 98 102 73 316 946
c VHL=2a

1.4347 0.9223 0.9144 0.9480 1.3181 1.0862 1.1474 1.5798 1.4240 0.9901 0.3666 0.4729 0.9695 1.2627 0.8449 0.8726 2.3386 1.5787 2.1571 1.4664 3.3897 2.9102 1.4370 1.0853 1.0210 1.1470 1.1758 1.7214 1.0857 1.0739 0.7930 0.7771 0.8993 0.9157 0.8995 1.6008 1.9726 2.0454 1.6749 2.5959 0.9437 0.8251 (continued )

Downloaded by [Universidad De Concepcion] at 12:48 03 May 2013

Table 1. (Continued ). ID Ru 03 Ru 04 Ru 05 Sl 01 Sl 02 Sl 03 Sl 04 Sl 05 Sm 01 Sm 02 Sm 03 Sm 04 Sm 05 T 01 T 02 T 03 V 1792 2536 6073 457 603 907 1102 2223 267 222 140 153 124 611 720 645


a 0.7106 0.7181 0.7826 0.7467 0.6846 0.7685 0.9187 0.7232 0.8285 0.7752 0.6858 0.7925 0.7161 0.7624 0.7803 0.7652

c 158.2659 234.3457 775.3826 44.1840 68.9001 115.2402 334.8100 240.2785 177.1858 123.5355 58.1896 89.0771 46.3093 120.0367 144.5780 167.7334

R2 0.9620 0.9571 0.9807 0.9760 0.9823 0.9604 0.9912 0.9490 0.9678 0.9450 0.8708 0.9563 0.9263 0.8817 0.8685 0.8923

HL 1365 1850 4395 364 423 651 701 1593 119 96 75 76 66 465 540 447

c VHL=2a

Downloaded by [Universidad De Concepcion] at 12:48 03 May 2013

1.0851 1.1661 1.2063 0.6665 1.1571 0.8651 0.7633 1.2572 2.1315 2.2641 2.4320 2.0738 1.8312 1.2995 1.2297 1.6447

B, Bulgarian; Cz, Czech; E, English; G, German; H, Hungarian; Hw, Hawaiian; I, Italian; In, Indonesian; Kn, Kannada; Lk, Lakota,; Lt, Latin; M, Maori; Mq, Marquesan; Mr, Marathi; R, Romanian; Rt, Rarotongan; Ru, Russian; Sl, Slovenian; Sm, Samoan; T, Tagalog.

(real or virtual) between the tting Zipf curve and the hapax legomena level f(r) 1. Finally, it is worthwhile mentioning that the data of Table 1 reveal a good linear dependence between the hapax legomena length HL and the vocabulary V, namely HL 0.7256*V718.6979, as it is further illustrated in Figure 2. Hence, replacing HL in (1), we get a good approximation for the analytism indicator in the form A 2a c 1:2744 V 18:6979a 2

An independent measure of analytism/synthetism using e.g. the Greenberg-Krupa indices (Greenberg, 1960; Krupa, 1965) could tell us whether the purely morphological denition of this property corresponds to our purely textual measure which can be won mechanically without analysing each word of a text with regard to its morphological structure.



Table 2. Mean analytism indicator A of 20 languages. Language Hungarian Czech Latin Romanian German Slovenian Kannada Russian Bulgarian Indonesian Marathi Italian Lakota Tagalog English Marquesan Rarotongan Samoan Maori Hawaiian Mean A 0.2012 0.7223 0.7982 0.8931 0.9372 0.9418 1.0378 1.0453 1.0495 1.1438 1.2302 1.2787 1.2853 1.3913 1.4514 1.8108 1.9779 2.1465 2.1861 5.0815 Number of texts 5 5 6 6 7 5 5 5 5 5 5 5 4 3 7 3 5 5 5 4

Downloaded by [Universidad De Concepcion] at 12:48 03 May 2013

Fig. 2. Illustrating the linear dependence between the hapax legomena length HL and the vocabulary V.



Needless to say, one can perform the same procedure using the ZipfMandelbrot function or a number of other hyperbolic functions. We adhere to Occams razor and restrict ourselves to the original Zipf function which yielding this morphological distinctiveness without touching morphology gets a secondary important corroboration.

Downloaded by [Universidad De Concepcion] at 12:48 03 May 2013
Greenberg, J. H. (1960). A quantitative approach to the morphological typology of languages. International Journal of American Linguistics, 26, 178194. Ko hler, R. (1986). Zur linguistischen Synergetik. Struktur und Dynamik der Lexik. Bochum: Brockmeyer. Ko hler, R. (ed.) (2005). Synergetic linguistics. In R. Ko hler, G. Altmann & R. G. Piotrowski (Eds), Quantitative Linguistics. An International Handbook (pp. 760 775). Berlin: de Gruyter. Krupa, V. (1965). On quantication of typology. Linguistics, 12, 3136. Popescu, I.-I. (2007). Text ranking by the weight of highly frequent words. In P. Grzybek & R. Ko hler (Eds), Exact Methods in the Study of Language and Text (pp. 555 565). Berlin/New York: Mouton de Gruyter. , L., Pustet, R., Mehler, A., Ma Popescu, I.-I., Vidya, M. N., Uhl r ova cutek, J., Krupa, V., Ko hler, R., Jayaram, B. D., Grzybek, P., & Altmann, G. (2008). Word Frequency Studies. Berlin: Mouton de Gruyter (in press). erma k et al.). Praha: Skali cka, V. (20042006). Souborne dlo IIII (edited by F. C Nakladatelstv Karolinum. [The work contains the Czech translations.]

S-ar putea să vă placă și