Sunteți pe pagina 1din 5

Letter frequency

The frequency of letters in text has often been studied TH HE AN RE ER IN ON AT ND ST ES EN OF TE


for use in cryptanalysis, and frequency analysis in partic- ED OR TI HI AS TO, and the most common doubled
ular.
letters as LL EE SS OO TT FF RR NN PP CC.[1]
No exact letter frequency distribution underlies a given
language, since all writers write slightly dierently.
Linotype machines assumed the letter order, from most
to least common, to be etaoin shrdlu cmfwyp vbgkjq
xz based on the experience and custom of manual compositors.

The top twelve letters comprise about 80% of the total usage. The top eight letters comprise about 65% of
the total usage. Letter frequency as a function of rank
can be tted well by several rank functions, with the twoparameter Cocho/Beta rank function being the best.[2]
Another rank function with no adjustable free parameter also ts the letter frequency distribution reasonably
well[3] (the same function has been used to t the amino
acid frequency in protein sequences.[4] ) A spy using the
VIC cipher or some other cipher based on a straddling
checkerboard typically uses a mnemonic such as a sin to
err (dropping the second r) to remember the top eight
characters.

Likewise, Modern International Morse code encodes the


most frequent letters with the shortest symbols; arranging the Morse alphabet into groups of letters that require equal amounts of time to transmit, and then sorting
these groups in increasing order, yields e it san hurdm
wgvlfbk opjxcz yq. Similar ideas are used in modern
data-compression techniques such as Human coding.

Letter frequency was also used by other telegraph system, The use of letter frequencies and frequency analysis plays
a fundamental role in cryptograms and several word puzsuch as, for instance by Donald Murray, in the Murray
zle games, including Hangman, Scrabble and the televiCode.
sion game show Wheel of Fortune. One of the earliest
description in classical literature of applying the knowledge of English letter frequency to solving a cryptogram
1 Introduction
is found in E.A. Poe's famous story The Gold-Bug, where
the method is successfully applied to decipher a message
Letter frequencies, like word frequencies, tend to vary, instructing on the whereabouts of a treasure hidden by
both by writer and by subject. One cannot write an essay Captain Kidd.[5]
about x-rays without using frequent Xs, and the essay will
Letter frequencies had a strong eect on the design of
have an idiosyncratic letter frequency if the essay is about
some keyboard layouts. The most-frequent letters are on
the frequent use of x-rays to treat zebras in Qatar. Dierthe bottom row of the Blickensderfer typewriter, and the
ent authors have habits which can be reected in their use
home row of the Dvorak Simplied Keyboard.
of letters. Hemingway's writing style, for example, is visibly dierent from Faulkner's. Letter, bigram, trigram,
word frequencies, word length, and sentence length can
be calculated for specic authors, and used to prove or
disprove authorship of texts, even for authors whose styles 2 Relative frequencies of letters in
are not so divergent.

the English language

Accurate average letter frequencies can only be gleaned


by analyzing a large amount of representative text. With
the availability of modern computing and collections of
large text corpora, such calculations are easily made. Examples can be drawn from a variety of sources (press
reporting, religious texts, scientic texts and general ction) and there are dierences especially for general ction with the position of 'h' and 'i', with H becoming more
common.

Analysis of entries in the Concise Oxford dictionary is


published by the compilers.[6] The table below is taken
from Pavel Mika's website, which cites Robert Lewands
Cryptological Mathematics.[7]
This table diers slightly from others, such as Cornell
University Math Explorers Project, which produced a table after measuring 40,000 words.[8]

Herbert S. Zim, in his classic introductory cryptography text Codes and Secret Writing, gives the English
letter frequency sequence as ETAON RISHD LFCMU
GYPWB VKJXQ Z, the most common letter pairs as

In English, the space is slightly more frequent than the


top letter (e) [9] and the non-alphabetic characters (digits,
punctuation, etc.) collectively occupy the fourth position,
between t and a.[10]
1

4 RELATIVE FREQUENCIES OF LETTERS IN OTHER LANGUAGES


Often the frequency distribution of the rst digit in each
datum is signicantly dierent from the overall frequency
of all the digits in a set of numeric data -- see Benfords
law for details.

0.14

0.12

0.1

Analysis of a subset of Project Gutenberg text shows the


following frequencies of letters at the starts of words:[13]

0.08

0.06

0.04

4 Relative frequencies of letters in


other languages

0.02

0
a b c d e f g h i

j k l m n o p q r s t u v w x y z

Relative frequencies of letters in text.


0.14

*See Dotted and dotless I


The gure below illustrates the frequency distributions of
the 26 most common Latin letters across some languages.
Template:Letter frequencies in 14 languages
Based on these tables, the 'etaoin shrdlu'-equivalent results for each language is as follows:

0.12

0.1

French: 'esait nruol'; (Indo-European: Romance;


traditionally, 'esartinulop' is used, in part for its ease
of pronunciation[27] )

0.08

0.06

Spanish: 'eaosr nidlt'; (Indo-European: Romance)

0.04

Portuguese: 'aeosr idmtn' (Indo-European: Romance)

0.02

0
e t a o i n s h r d l c u m w f g y p b v k j x q z

Relative frequencies ordered by frequency.

Relative frequencies of the rst


letters of a word in the English
language

The frequency of the rst letters of words or names


is helpful in pre-assigning space in physical les and
indexes.[11] Given 26 ling cabinet drawers, rather than
a 1:1 assignment of one drawer to one letter of the alphabet, it is often useful to use a more equal-frequencyletter code by assigning several low-frequency letters to
the same drawer (often one drawer is labeled VWXYZ),
and to split up the most-frequent initial letters -- S, A, and
C -- into several drawers (often 4 drawers Aa-An, Ao-Az,
Ca-Cj, Ck-Cz, Sa-Si, Sj-Sz). The same system is used in
some multi-volume works such as some encyclopedias.
The rst letter of an English word, from most to least
common, s a c m p r t b f g d h i n e l o w u v j k
q y z x.[11]

Italian: 'eaion lrtsc'; (Indo-European: Romance)


Esperanto: 'aieon lsrtk' (articial language inuenced by Indo-European languages, Romance, Germanic mostly)
German: 'ensri atdhu'; (Indo-European: Germanic)
Swedish: 'eanrt sildo'; (Indo-European: Germanic)
Turkish: 'aeinr lkdm'; (Altaic: Turkic)
Dutch: 'enati rodsl'; (Indo-European: Germanic)[23]
Polish: 'aieon wrszc'; (Indo-European: Slavic)
Danish: 'ernta idslo'; (Indo-European: Germanic)
Icelandic: 'arnie stul'; (Indo-European: Germanic)
Finnish: 'ainte slouk'; (Uralic: Finnic)
Czech: 'aeoni tvsrl'; (Indo-European: Slavic)

Both the overall letter distribution and the word-initial letter distribution approximately match the Zipf distribution All these languages use a basically similar 25+ character
and even more closely match the Yule distribution.[12]
alphabet.

See also

[13] Calculated from Project Gutenberg Selections available


from the NLTK Corpora

Corpus linguistics

[14] CorpusDeThomasTemp". Retrieved 2007-06-15.

ETAOIN SHRDLU

[15] Beutelspacher, Albrecht (2005). Kryptologie (7 ed.).


Wiesbaden: Vieweg. p. 10. ISBN 3-8348-0014-7.

RSTLNE (Wheel of Fortune)


Frequency analysis (cryptanalysis)

[16] Pratt, Fletcher (1942). Secret and Urgent: the Story of


Codes and Ciphers. Garden City, N.Y.: Blue Ribbon
Books. pp. 2545. OCLC 795065.

Linotype machine
Most common words in English

[17] Frequncia da ocorrncia de letras no Portugus. Retrieved 2009-06-16.

Scrabble

[18] La Oftecoj de la Esperantaj Literoj. Retrieved 2007-0914.

Arabic Letter Frequency

[19] Singh, Simon; Galli, Stefano (1999). Codici e Segreti


(in Italian). Milano: Rizzoli. ISBN 978-8-817-86213-4.
OCLC 535461359.

References

[1] Zim, Herbert Spencer. (1961). Codes & Secret Writing: Authorized Abridgement. Scholastic Book Services.
OCLC 317853773.
[2] Li, Wentian; Miramontes, Pedro (2011).
Fitting
ranked English and Spanish letter frequency distribution in US and Mexican presidential speeches.
Journal of Quantitative Linguistics 18 (4):
359.
doi:10.1080/09296174.2011.608606.
[3] Gusein-Zade, S.M. (1988). Frequency distribution of
letters in the Russian language. Probl. Peredachi Inf.
24 (4): 1027.
[4] Gamow, George; Ycas, Martynas (1955). Statistical
correlation of protein and ribonucleic acid composition. Proc. Natl. Acad. Sci. 41 (12): 101119.
doi:10.1073/pnas.41.12.1011. PMC 528190.
[5] Poe, Edgar Allan. The works of Edgar Allan Poe in ve
volumes. Project Gutenberg.
[6] What is the frequency of the letters of the alphabet in
English?". Oxford Dictionary. Oxford University Press.
Retrieved 29 December 2012.
[7] Mika, Pavel.
ritmy.net.

Letter frequency (English)".

[20] Serengil, S.I., Akin, M. "Attacking Turkish Texts Encrypted by Homophonic Cipher" Proceedings of the 10th
WSEAS International Conference on Electronics, Hardware, Wireless and Optical Communications, pp.123126, Cambridge, UK, February 20-22, 2011.
[21] Practical Cryptography. Retrieved 2013-10-30.
[22] Wstp do kryptologii, counting [space] 17.2%, [dot point]
0.9%, [comma] 0.9% and [semicolon] 0.5%
[23] Letterfrequenties. Genootschap OnzeTaal. Retrieved
2009-05-17.
[24] Practical Cryptography. Retrieved 2013-10-24.
[25] Practical Cryptography. Retrieved 2013-10-24.
[26] Practical Cryptography. Retrieved 2013-10-24.
[27] Perec, Georges; Alphabets; ditions Galile, 1976

Notes

Algo-

[8] http://www.math.cornell.edu/~{}mec/2003-2004/
cryptography/subs/frequencies.html
[9] Statistical Distributions of English Text
[10] Lee, E. Stewart. Essays about Computer Security
(PDF). University of Cambridge Computer Laboratory.
p. 181.
[11] Herbert Marvin Ohlman. Subject-Word Letter Frequencies with Applications to Superimposed Coding". Proceedings of the International Conference on Scientic Information (1959).
[12] Hemlata Pande and H. S. Dhami. Mathematical Modelling of Occurrence of Letters and Words Initials in
Texts of Hindi Language.

Some useful tables for single letter, digram, trigram,


tetragram, and pentagram frequencies based on 20,000
words that take into account word-length and letterposition combinations for words 3 to 7 letters in length.
The references are as follows:
1. Mayzner, M.S.; Tresselt, M.E. (1965). Tables of
single-letter and digram frequency counts for various word-length and letter-position combinations.
Psychonomic Monograph Supplements 1 (2): 1332.
OCLC 639975358.
2. Mayzner, M.S.; Tresselt, M.E.;Wolin, B.< R.<
(1965). Tables of trigram frequency counts for various word-length and letter-position combinations.
Psychonomic Monograph Supplements 1 (3): 3378.

7
3. Mayzner, M.S.; Tresselt, M.E.;Woliin, B.< R,..
(1965). Tables of tetragram frequency counts for
various word-length and letter-position combinations. Psychonomic Monograph Supplements 1 (4):
79143.
4. Mayzner, M.S.; Tresselt, M.E.Wolin, B,.< R.>
(1965). Tables of pentagram frequency counts
for various word-length and letter-position combinations. Psychonomic Monograph Supplements 1 (5):
144190.

External links
A site with content of Cryptographical Mathematics
by Robert Edward Lewand
Some examples of letter frequency rankings in some
common languages
Java-Application for building letter frequencies out
of a text le
JavaScript Heatmap Visualization showing letter
frequencies of texts on dierent keyboard layouts
An updated version of Mayzners work using Google
books Ngrams data set by Peter Norvig

EXTERNAL LINKS

Text and image sources, contributors, and licenses

8.1

Text

Letter frequency Source: http://en.wikipedia.org/wiki/Letter%20frequency?oldid=636806871 Contributors: Frecklefoot, Eliasen,


ArnoLagrange, Ww, Taxman, Topbanana, AnonMoos, Donarreiskoer, Chealer, AlainV, Tomchiukc, R3m0t, Lowellian, Auric, Smb1001,
Seth Ilys, DavidCary, BenFrantzDale, Lee J Haywood, Timpo, Dissident, Gus Polly, Frencheigh, Matt Crypto, Urhixidur, Abdull, Thorwald, CALR, Discospinster, LoganCale, Andrejj, EmilJ, Nandhp, BrokenSegue, Water Bottle, RandomEE2, Stephan Leeds, Jdege, Richwales, Winterdragon, Tabletop, Eyreland, Geenius at Wrok, Zbxgscqf, Wars, Zarano, DevastatorIIC, Visor, Peter Grey, YurikBot, Jimp,
Jojo-schmitz, RussBot, Hellbus, Thesloth, Uni4dfx, Rufua, ReCover, LiquidFire, Tim Parenti, Livitup, GraemeL, Ordinary Person, Cmglee, That Guy, From That Show!, SmackBot, McGeddon, Speight, Kintetsubualo, Gilliam, LinguistAtLarge, RDBrown, Iwaterpolo,
Trekphiler, Argyriou, Kukini, JackLumber, Dejudicibus, RomanSpa, Sharcho, Novangelis, DagErlingSmrgrav, FakeTango, Onepairofpants, Mwhitlock, Gogo Dodo, Yonat, Gioto, Joe Schmedley, BranER, Arch dude, Moralist, JPDaigle, JMyrleFuller, Ariel., Tgeairn,
Leon math, AstroHurricane001, SimpsonDG, Pdcook, Prometheusg, Jshrubb, Philip Trueman, Melsaran, Finnrind, RubySS, Ori, Kleptog,
MinorContributor, Rubo77, Mangledorf, ObfuscatePenguin, Foxj, Quinxorin, S0mbre, DumZiBoT, Addbot, Wli625, Tide rolls, Jarble,
VengeancePrime, Doctorhook, AnomieBOT, Salisbury-99, Jim1138, Citation bot, Xqbot, Thehelpfulbot, Tktru, Coroboy, Ywmpq205,
Dramartistic, Mickm720, Pinethicket, A8UDI, Jschnur, Wolfehhgg, January, Jesse V., RjwilmsiBot, NerdyScienceDude, Mark mayzner,
AlanSiegrist, Indeluxe, BestKH, Alexlatham96, ClueBot NG, Tideat, Pejno Simono, Masssly, Widr, Helpful Pixie Bot, Kyoakoa, Glacialfox, Dexbot, SteenthIWbot, C5st4wr6ch, FallingGravity, Phinumu, Mjdav1, Robdark00, Vieque, Pokemonmaster34 and Anonymous: 128

8.2

Images

File:Ambox_important.svg Source: http://upload.wikimedia.org/wikipedia/commons/b/b4/Ambox_important.svg License: Public domain Contributors: Own work, based o of Image:Ambox scales.svg Original artist: Dsmurat (talk contribs)
File:English_letter_frequency_(alphabetic).svg Source:
http://upload.wikimedia.org/wikipedia/commons/d/d5/English_letter_
frequency_%28alphabetic%29.svg License: Public domain Contributors: Own work; en:Letter frequency. Original artist: Nandhp
File:English_letter_frequency_(frequency).svg Source:
http://upload.wikimedia.org/wikipedia/commons/b/b0/English_letter_
frequency_%28frequency%29.svg License: Public domain Contributors: Own work; en:Letter frequency. Original artist: Nandhp
File:Text_document_with_red_question_mark.svg Source: http://upload.wikimedia.org/wikipedia/commons/a/a4/Text_document_
with_red_question_mark.svg License: Public domain Contributors: Created by bdesham with Inkscape; based upon Text-x-generic.svg
from the Tango project. Original artist: Benjamin D. Esham (bdesham)

8.3

Content license

Creative Commons Attribution-Share Alike 3.0

S-ar putea să vă placă și