Sunteți pe pagina 1din 63

Lexicon structure

IMPACT

EE2

LEXICON STRUCTURE
Deliverable number

DStart : Month 1
EE2.1
Lexicon structure
Internal
10
5
12
3
INL
DN
LM ON
B
U
B
6
2
2

Deliverable name
Internal/external
Participant number
Participant short name
Estimated personmonths per participant
for this deliverable
Dissemination level1

Due Date: Month 6


Actual delivery: Month 7

7
UGOE

CO

Revisions
Version

Status

Date

1.1
1.2
1.3
1.3.1
1.3.2

Changes

Corrected error in Appendix E


Added structure for multiword NEs
Minor adaptations to the structure for NEs
Added NE part information (sect. 3.14)

Approvals

This document requires the following approvals


Version

Date of
approval

Name

Role in project

signature

Max Kaiser
Hildelies Balk

Sub-Project Leader
Coordinator

Distribution

This document was sent to:


Version

Date of sending Name

Role in project

July 1, 2008 Klaus Schulz


Uli Reffle
Barbara Pfeiffer

Participant in EE2
Participant in EE2
Participant in EE2

PU Public; PP Restricted to other programme participants (including the Commission Services); RE restricted
to a group specified by the consortium (including the Commission Services); CO Confidential only for
members of the consortium (including the Commission Services).

Lexicon structure

IMPACT

EE2

IMPACTLexicondatabasestructure

Lexiconstructure ...................................................................................................................................................1
1. Introduction....................................................................................................................................................4
2. Informationattachedtowordforms...........................................................................................................6
2.1.
Databaseinformationforunlabeledwordforms ............................................................................6
2.2.
Partofspeech ........................................................................................................................................6
2.3.
Lemma ...................................................................................................................................................6
2.4.
Paradigmaticrelationbetweenwordformandlemma ..................................................................6
2.5.
Attestation ...........................................................................................................................................10
2.5.1. Attestationsonthetokenlevel.....................................................................................................11
2.5.2. Attestationsonthetextlevel ........................................................................................................15
2.5.3. Verifyingnonanalyzedwordforms...........................................................................................15
2.6.
Derivations ..........................................................................................................................................15
2.7.
Documents,corporaandworkflowmanagement.........................................................................16
3. Informationattachedtolemmata ..............................................................................................................17
3.1.
Lemmaid ............................................................................................................................................18
3.2.
Modernlemmaform..........................................................................................................................18
3.3.
Lexicalpartofspeech ........................................................................................................................19
3.4.
Genderandotherpossiblegrammaticalfeatures..........................................................................19
3.5.
Namedentitylabel.............................................................................................................................19
3.6.
Inflectionalclass(es) ...........................................................................................................................20
3.7.
Language .............................................................................................................................................20
3.8.
Gloss.....................................................................................................................................................20
3.9.
Multiwordexpressions......................................................................................................................20
3.9.1. Multiwordnamedentitylemmata ..............................................................................................22
3.10. Morphologicalanalysis .....................................................................................................................23
3.11. Unresolvedambiguityinlemmaassignment ................................................................................25
3.11.1.
Portmanteaulemmata...............................................................................................................26
3.11.2.
Transcategorisation(conversion),sublemmaandmainlemma .........................................26
3.12. Addingcustominformationonthelemmalevel...........................................................................28
3.13. AdditionalstructureforrelatedentriesinNElexica ....................................................................28
3.14. Namedentityparts ............................................................................................................................29
4. Informationonthedocumentlevel...........................................................................................................31
5. Auxiliaryinformationforwordformsynthesisandanalysis...............................................................32
5.1.
Datatosupportthemodellingoforthographicvariation ............................................................32
5.2.
Informationaboutparadigmaticexpansion...................................................................................35
5.3.
Databaseinformationforstems ...................................................................................................36
6. Lexicalsource ...............................................................................................................................................37
6.1.
Ambiguityinformation .....................................................................................................................39
7. ConvertingthedatabaseintoLMF............................................................................................................40
7.1.
Introduction. .......................................................................................................................................40

Lexicon structure

IMPACT

EE2

Mappings.............................................................................................................................................40
7.2.
7.2.1. Onnotation .....................................................................................................................................40
7.2.2. Unlabelledwordforms. ................................................................................................................40
7.2.3. Inflection(labelledwordforms). .................................................................................................41
7.2.4. Composition. ..................................................................................................................................41
7.2.5. Spelling............................................................................................................................................42
7.2.6. Clitics. ..............................................................................................................................................42
7.2.7. Portmanteau. ..................................................................................................................................43
7.2.8. Transcategorization. ......................................................................................................................43
7.2.9. Multiwordexpressions. ................................................................................................................44
7.2.10.
Multiwordnamedentities........................................................................................................45
7.2.11.
Attestations.................................................................................................................................45
7.3.
ConvertingrelationaldatatoXML..................................................................................................46
8. References .....................................................................................................................................................47
AppendixA:Databaseschema ..........................................................................................................................49
AppendixB:Filtersfortheexportofrelevantsubsetsfromthelexicon .....................................................57
AppendixC:ScriptforconvertingrelationaldatatoLMF(XML):relDB2xml.pl....................................57
AppendixD:StructureDefinitionfortheDutchLexicon.............................................................................59

Lexicon structure

IMPACT

EE2

1. Introduction

IMPACT lexica are computational lexica which will be used in two ways: in OCR to enhance word
recognition, and in Enrichment, to enable variationindependent searches. The core database objects
are word forms, lemmata and documents2. All other objects define some kind of relation between
these.
InordertoenabletheOCRsspellcheckingmechanismtoassesstheplausibilityoftheoccurrenceofa
wordinacertaintext,itisnotsufficienttoconvertexistinglexicaanddictionariesintoalargeword
list.Wealsoneedto
1. Keeptrackofthesourcesfromwhichwetookthewords(LexicalSource,cf.section6)
2. List the actually encountered words in the language and record occurrences in actual texts,
withfrequencyinformation(attestation,cf.section2.5)
3. Recordinwhatkindoftextsthesewordsoccur(documentproperties,cf.section4)
It is impossible to extract all possible word forms from the limited amount of available reliably
transcribedhistoricaltext.Hence,weneedmechanismstoextendthelexiconandtobeabletoassess
the plausibility of hypothetical words without previous attestations, i.e. words we have not seen
before.Supportingdataforthesemechanismshavetobepresentinthedatabase,suchas:
1. Unknowninflectedformsoflemmatawhichalreadyareinthedatabasecanbedealtwithby
means of the automatic expansion from the lemma to the full paradigm of word forms
(paradigmaticexpansion,thedatabaseinformationforthispurposeisdiscussedsection5)
2. Newspellingsofknownwordscanbedealtwithbydevelopingagoodmodelofthespelling
conventions of the period at hand (cf. section 5.1 for the storage of orthographic variant
patterns)
3. Previouslyunseencompoundscanbedealtwithbymeansofagoodmodelofwordformation
(cf.section3.10fortheassociateddatabaseinformation)
In order to effectuate word searches without having to worry about inflection and variation of
wordforms, Enrichment will use modern lemmata as variationindependent retrieval keys for the
fullspectrumofinflectionalandorthographicalvariation.

The database structure is most conveniently discussed by dividing the information into a few main
blocks:
1. Informationattachedtowordforms,eitherunlabeled(i.e.notyetlemmatizedorlabeledwith
PartofSpeech)orlabeled(i.e.withlemmaandpossiblyPoS),cf.section2.
2. Informationattachedtolemmata(section3)
3. Informationaboutdocuments,partsofdocuments,documentcollections(section4)
4. Auxiliary information needed for expansion and for plausibilityofnewwords prediction
(section5)
5. LexicalSource(section6)

Documentisunderstoodhereasasequenceofwords,togetherwiththedocumentmetadata(section4)

Lexicon structure

IMPACT

EE2

Statusofinformation:externalorinternal,optionalormandatory
Part of the lexicon database information is intended to be delivered to other work packages, other
informationispresentbecauseitisusefulinthelexiconbuildingprocess.
WespecifywhichinformationisreallyadeliverablepartoftheEE3output.

A survey of the database fields can be found in the Database scheme (Appendix A). Appendix B
briefly touches on the lexicon API in development. An XML interchange format is proposed in
AppendixC.

Lexicon structure

IMPACT

EE2

2. Informationattachedtowordforms

There are two distinct objects in the database on the word form level: unlabeled wordforms (i.e.
withoutlinguisticinformationattachedtothem)andlabeledones(i.e.labeledwithlemmaandpartof
speech)

2.1. Databaseinformationforunlabeledwordforms

Unlabeled word forms may be used in OCR. They only need to be attested in texts. Attestation
informationistheonlykindofinformationwelinktounlabeledforms.(cfsection2.5,attestation)

Statusofattestationinformationforunlabeledwordforms:Mandatory,external(useinTR5)

2.2. Partofspeech

Each labeled word form is linked to one or several lemmata and assigned a Part of Speech
(part_of_speech)label.Thisgrammaticaltag3ismorespecificthantheoneassignedtothelemma(cf.
3.3),asitmayincludeinformationaboutinflection,tense,number.
ItisnotyetclearhowmuchdetailneedstobeincludedinIMPACTlexica.Wemightacceptacertain
levelofunderspecification,becauseclearly,thedistinctionbetweenformallyidenticalpositionsinthe
paradigm is beyond the scope of IMPACT. So instead of tagging loopt as a second or third person
singular (which means a lot of effort has to be put in disambiguation), we may mark it simply as a
finiteverbendingwitht.4

Status of this information: Part of speech is not externally required, but hardly dispensable, as the
relationbetweenalemmaanditsinflectedformscannotbedefinedwithoutit.

2.3. Lemma
Fieldcontent:theIDoftherelevantlemmaobject.

Status:mandatory

2.4. Paradigmaticrelationbetweenwordformandlemma

Itisessentialthatthelexiconexplicitatestheparadigmaticrelationsbetweenlemmataandtheirword
forms.

3
4

TorefertothisgrammaticaltagasPartofSpeechisanabuseofterminology.
CfBie(2004)foradiscussionofthisdistinctionbetweenamorphologicalword(finiteverbendingwitht)
andamorphosyntacticword(thirdpersonsingular).

Lexicon structure

IMPACT

EE2

Inflectedforms
Thisinformationisnotabouttheformalstructureoftheinflectedform,butmerelyservestointerlink
lemmataandinflectedforms.ThislinkisstoredinobjectsoftypeAnalyzedWordform,whichhavea
PoS property and link to the lemma on the one hand and to the wordform on the other hand. See
Figure1forarepresentationofthedatabasestructureandTable1foranexample.

Figure1:databasemodel5foranalyzedwordforms6

Table1.
Tablelemmata
lemma_id

modern_lemma

lemma_part_of_speech

L1

Marcher

VRB

Tableanalyzed_wordforms
analyzed_wordform_id

part_of_speech

lemma_id

wordform_id

A1

VRB(fin,-erons)

L1

W00001

Tablewordforms
wordform_id

wordform

W00001

marcherons

Cliticcombinations7

ThediagramsinthisdocumentareinCrowsFootnotation.Theyhavebeengeneratedfromthedatabaseby
Mysql Workbench 5.0.22. As a result, all relations are annotated as being of the 1:m type with both
referencingandreferencedtablemarkedasmandatory.Thismeansthatsomeofthelogicalconstraintsare
notaccuratelyreflectedinthediagrams.
6 The structure changed with respect to the previous version. Instead of just a flat sequence,hierarchy is now
possible.Itisunlikelythatwewillusethisverymuch,butwehadtoincorporatethepossibilityofhavingat
leasttwolevelsbecauseofcliticcombinationsoccurringinsidemultiwordexpressions.
7Weusethetermcliticcombinationtorefertowordformslikedutchneemtse,whichisacombinationofafinite
verbform(neemt=germannimmt)andanunstressed,phoneticallyreducedpronoun(ze).Thisphenomenonis
muchmorefrequentinhistorical(anddialectic)DutchthaninGerman.Cliticsmaybeattachedtootherword
classes like conjunctions andmorethanonecliticcan beattachedtoasingleword(cfindienmense~german
indemmansie).
5

Lexicon structure

IMPACT

EE2

Cliticcombinations willbelemmatizedbyassigningan orderedsequenceoflemmata.Awordform


likesboexs(=desBuches)willthusbelemmatized(HET,BOEK).Inthedatabase,theorderingwillbe
reflected by assigning a sequence number tot the lemma parts (see Figure 2 and Tabel 2). Each part
willhaveitsownpartofspeech.Thus,thecompleteLemmaPoSassignmentforsboexswillbe

Sboexs~{(1,HET/DAT,PRN),(2,BOEK,NOU(infl=s)}.

Thesequencenumbersareincludedtodistinguishbetweenwordslikekzagandzagk.

Comment
Thistreatmentofcliticcombinationsservesthefollowingpurposes:thelemmapartscanbe usedas
searchkeys,whilethecombinationofallpartsservesasavariationindependentkeygrouping
differentrealizationsofbasicallythesamecliticcombination.
Asegmentationofthecliticcombinationasasequenceofwordformsisnotincludedinthedatabase
becausethisis,inmanycases,problematicbecauseofsandhiphenomena,cf.middledutchdat=dat+
het,MiddleGermandeist=da+ist.
CliticcombinationsareverycommoninItalianandSpanish(damelo=da+me+lo,givemeit.Ofcourse,
theyarequitecommoninMiddleGerman(deist=da+ist,enloufen=en(not)+laufen,etc.).

Figure2.Multiplelemmataanalysis.

Table2:exampledataforananalyzedcliticcombination

Tablelemmata
Lemma_id

modern_lemma

lemma_part_of_speech

L1

Ik

PRN

L2

Zij

PRN

L3

Zien

VRB

Tableanalyzed_wordforms
Analyzed_wordform_id

Pos

part_number

Multiple_lemma_analysis_id

lemma_id

wordform_id

A1

CLITIC

NULL

Mla_1

NULL

W00001

Tablemultiple_lemmata_analyses
Multiple_lemmata_analysis_id multiple_lemmata_analysis_part_id Nr_of_parts

Part_number

Mla_1

Mlap_1

Mla_1

Mlap_2

Mla_1

Mlap_3

Lexicon structure

IMPACT

EE2

Tablemultiple_lemmata_analyses_parts
Multiple_lemmata_analysis_part_id multiple_lemmata_analysis_part_id Part_nr

POS

Lemma_id

Mlap_1

Mla_1

PRN

L1

Mlap_2

Mla_1

VRB

L3

Mlap_3

Mla_1

PRN

L2

Tablewordforms
Wordform_id

Wordform

W1

ksachse

Status of this information: mandatory when applicable (when clitic combinations are prominent in
thelanguage,somethinghastobedoneaboutthem).Use:internalandexternal(TR5hastobeableto
dealwiththecliticcombinationsaswell).

Lexicon structure

IMPACT

EE2

2.5. Attestation

OneofthemostimportanttasksfortheIMPACTlexiconbuildingprocessistokeeptrackoftheorigin
ofwordforms.Anunstructured,evergrowingsetofwordforms,withoutinformationaboutthekind
oftext(intermsofperiodandsubjectmatter)inwhichwecanexpectthewordstooccur,isneither
usableintextrecognitionnorinenrichment.Hence,toeachlabeledorunlabeledwordform,welink
attestation objects which are basically just verified occurrences of the words in documents. The
attestations enable us to derive the relevant information about the domain of applicability of word
formsfromthepropertiesofthedocumentstheyoccurin.
When a word form is taken from a lexicon or dictionary, or it originates from automatic analysis
expansion,wealsokeeptrackofitsprovenance.Thisiscoveredinthenextsection.

Besidesthelinktotherelevantwordformandalocationinadocument,theattestationobjectscontain
thefollowinginformation:
Verification(yes/no):Istheoccurrenceofalabeledwordformcheckedmanuallybyanexpert?
Frequencyinadocumentordocumentcollection

Several distinct kinds of attestation may be relevant: we may just link a word form to a document,
recording the frequency of occurrence (attestation at text level), or we may link to an individual
occurrence of the word (attestation at the token level)8. The latter kind of attestation is especially
relevant to tagged corpora. In the lexicon building workflow, lemmata may first be assigned on the
textlevel,andambiguityisnotcompletely resolved.Atalater stage,ambiguitymayberesolvedby
assigninglemmataonthetokenlevel.

Atypeisawordform,atokenisaparticularinstance(occurrence)ofthetypeinatext.

10

Lexicon structure

IMPACT

EE2

Figure2:databasemodelfortheattestationofwordformsindocuments9

2.5.1.
Attestations on the token level
Therepresentationofalemmatizedfragmentinthedatabase:

Everybodyislovedbysomebody?

Tablelemmata
lemma_id
l1
l2
l3
l4
l5
l6

modern_lemma
EVERYBODY
BE
LOVE
LOVED
BY
SOMEBODY

lemma_part_of_speech
PRN
VRB
VRB
ADJ
ADP
PRN

Tablewordforms
wordform_id
wf1
wf2
wf3
wf4
wf5

Wordform
Everybody
Is
Loved
By
Somebody

Tableanalyzed_wordforms
Analyzed_wordform_id

Part_of_speech

lemma_id wordform_id

ana1

PRN

l1

wf1

ana2

VRB(3sg)

l2

wf2

ana3

VRB(part)

l3

wf3

ana4

ADJ

l4

wf3

ana5

ADP

l5

wf4

ana6

PRN

l6

wf5

Tabletoken_attestations
Attestation_id
1
2
3
4
5
6

Quote
NULL
NULL
NULL
NULL
NULL
NULL

Analyzed_wordform_id
ana1
ana2
ana3
ana4
ana5
ana6

Document_id
text1
text1
text1
text1
text1
text1

onset
0
9
12
12
18
21

offset
8
11
17
17
20
29

2.5.1.1.
Tokengroupattestations

Thereareseveralwaysinwhichagroupofgraphicaltokenscanbelinkedtoasingleanalysisofthe

We only give the diagram for attestations of labeled word forms. The diagram for attestations of unlabeled
wordformsiscompletelyanalogous.

11

Lexicon structure

IMPACT

EE2

groupasawhole.
1. Twoormoretokensarejoinedbyawordformgroup;thegroupisanalyzedasawhole;This
is typically the case for more or less accidental split realizations of noncompound word
formslikegelopenasgelopen;
In these cases all group members get the same analysis as a token attestation and all group
members are mentioned in the wordform_groups table. The group_attestations (cf. 2. below)
tableisnotusedinthiscase.Thefollowingtablesareused:

2. Thetokensarejoinedinawordformgroup;buttheindividualtokenshaveananalysisoftheir
own.Thisisapplicableto
1) Idiomaticexpressions(nottackledassuchinIMPACT)
2) Multiword named entities. E.g. Benedykta Chmielowskiego is analyzed as a compound
wordformfortheNElemmaBenedyktChmielowski;butwealsodonotwishtoomitthe
informationthatBenedyktabelongstothelemmaBenedykt,andChmielowskiegobelongs
toChmielowski.

The following structure is present in the database for this purpose: wordform_groups serves to link
several tokens by a single group id. Group_attestations gives the possibility to link such a group of
tokensasattestationdatatoanalyzed_wordforms.

12

Lexicon structure

IMPACT

EE2

Togiveanexample,supposewehavethefollowingshortsentence:

ToJestPrzezXidzaBenedyktaChmielowskiegoDziekanaRohatyskiego,Firlejowskiego,
PodkamienieckiegoPasterza.

13

Lexicon structure

IMPACT

EE2

Token

raw token with punctuation

Character offset
of
start
of
token

Character
token

To

To

Jest

Jest

Przez

Przez

13

Xiedza

Xiedza

14

20

Benedykta

Benedykta

21

30

Chmielowskiego

Chmielowskiego

31

45

Dziekana

Dziekana

46

54

Rohatyn skiego

Rohatynskiego,

55

68

offset

of

end

of

lemmata
lemm
a_id

modern_lemma

lemma_pos

to

PRN

wordform_id

Word form

byc

VRB

wf1

To

przez

ADP

wf2

Jest
Przez

wordforms

Ksiadz

NOU

wf3

Benedykt

NOU

wf4

Xiedza

wf5

Benedykta

Wf6

Chmielowskiego

Wf7

Benedykta
Chmielowskiego

Chmielowski

NOU

Benedykt
Chmielowski

NOU

analyzed_wordforms
Analyzed_word_form_id

Pos

Multiple_lemma_analysis_id

lemma_id

wordform_id

A1

PRN

NULL

Wf1
Wf2

A2

VRB

NULL

A3

ADP

NULL

Wf3

A4

NOU

NULL

Wf4
Wf5

A5

NOU

NULL

A6

NOU

NULL

Wf6

A7

NOU

NULL

Wf7

wordform_groups
group_attestations
Wordform_group_id

document_id

onset

offset

text1

21

30

text1

31

45

Group_attestation_id

analyzed_wordform_id

Wordform_group_id

Ana7

token_attestations
attestation_id

analyzed_wordform_id

document_id

start_pos

end_pos

A1

text1

A2

text1

A3

text1

13

A4

text1

14

20

A5

text1

21

30

A6

text1

31

45

14

Lexicon structure

IMPACT

EE2

2.5.2.
Attestations on the text level

Thistypeofattestationislinkedtotheoccurrencesofawordintext,withoutspecifyingthelocationin
the document. It is important in our workflow thatalso partially disambiguated information can be
storedandused.i.e.severalattestationsmaybelinkedtothesametypeortoken.

Tabletextlevelattestations.
Attestation_id
Frequency
Tla1
23

Verified
True

Analyzed_wordform_id Document_id
A100
Text1

2.5.3.
Verifying non-analyzed word forms
Insomecontextwordformscanbeattestedforwhichnoanalysisisavailable.Forthisreasonthetable
token_attestation_verificationsisintroducedinthedatabase.Attestationsofthistypelinkdirectlyto
wordforms.
In some cases the annotator might decide not to assign a lemma to the token. The token is then
markedasverified.Verifiedtokensmightberevisitedatalaterstage.

Status of attestation information: mandatory, external (for use in TR5), and for internal use in EE2
andEE3

2.6. Derivations

Figure3.Derivations

Wordformscangetamoreelaborateanalysisthanjustapartofspeechandagloss.Amodernword
formcanbeattached,andpossiblyalsoasetofpatternsthatdescribeshowtogettotheolderword
formfromthemodernone.E.g:

theyle,<teile>,[(t_th,0),(ei_ey,1)],NOUN,teil

Here the part between angled brackets (<>) describes the modern word form, and the part between

15

Lexicon structure

IMPACT

EE2

squarebrackets([])describesthepatterns.

Tablederivations
derivationid
normalized_form
pattern_application_id

dentifierofthederivation
Themodernwordform.CanbeNULL.
Identifier of pattern application if applicable. Can be empty, in
whichcaseitis0(nil,notNULL)

Tablepattern_applications
pattern_application_id

Position
number_of_patterns

pattern_id

Identifier of the derivation. NOTE that this is NOT a primary key.


Ratheritisusedtogroupsevaralpatternstogether.Theuniquekey
ofthistableiscomposedofallthefieldtogether.
The positionin thestring that the pattern is applied to(0 and 1 in
theexample)
The amount of patterns that go with this analysis (two in the
example above). This number is in a way redundant, because it is
always the same as the amount of records sharing the same
identifier.Storing the number herehowever makes somequeries a
lotfasterandeasier.
Identifierofthepatternassociated.

Tablepatterns
pattern_id
left_hand_side
right_hand_side

dentifierofthepattern
Thelefthandsideofthepatterns.Whatisleftoftheunderscore.So
tandeiintheexampleabove.
Therighthandsideofthepatterns.Whatisrightoftheunderscore.
Sothandeyintheexampleabove.

Pleasenotethatbothpatternsandmodernwordformscanbeempty.
Inotherwords

theyle,[(t_th,0),(ei_ey,1)],NOUN,teil

theyle,<teile>,NOUN,teil

arebothvalidanalyses.

Iftherearepatternsbutnomodernwordform(asinthefirstexampleabove),arowinthederivations
tableiscreatednonethelesstotiethepatternsandtheanalyzedwordformtogether.Itsmodernword
formfieldwillbeleftemptyhowever.
Ifthereisnopatternbutamodernwordformisprovided(asinthesecondexampleabove)thenthere
willjustbearowinthederivationstableandnocorrespondingpatternapplicationsnorpatterns.

2.7. Documents,corporaandworkflowmanagement
Inordertorecordprovenancedetails,thedatabaseisprovidedwiththestructuredepictedinFigure4.

16

Lexicon structure

IMPACT

EE2

Figure4.Documents,corporaandworkflowtables.

Documentscanbeorganizedincorpora.Animportantreasonforthisistheallocationofpropertiesto
alargenumberofdocumentsatonce.
The table type_frequencies contains the relations between word forms and documents. When a
document is to be annotated, all of its word forms are added to the table wordforms (unless there
already exists an entry for that word form). Simultaneously, the frequency of the word forms
occurringinthedocumentisregisteredintabletype_frequencies.
The table dont_show, can be used during the building of the lexicon. Certain word forms (e.g.
frequent function words) should not be presented to the annotators over and over again during the
processofattestingdocumentsandcorpora.Itispossibletoexcludecertainwordformsfromattesting
inacertaindocument,inacertaincorpus,orinalldocumentsandcorpora.

Tabledont_show
Wordform_id
Wf201

Document_id

Corpus_id
SG1873

At_all

User_id
1

Date
15-01-2010

Foradministrativepurposesweaddedatableusers.Hereweregisterstaffmemberswhoaretasked
withmanualannotationandverification.

Tableusers
User_id
1

name
Jan van der Wiel

3.Informationattachedtolemmata

Lemmata are linked to word forms (cf. 2.4). In their turn, lemmata need several other information

17

Lexicon structure

IMPACT

EE2

categoriestofulfilltheirroleinthelexicon,whichwillbedescribedinthissection.

Figure5:basiclemmainformation

3.1. Lemmaid
ItgoeswithoutsayingthateachlemmaisassignedauniqueID.

Status:mandatory

3.2. Modernlemmaform
RecallthatthemodernlemmaformisusedasavariationindependentsearchkeyinEnrichment.The
generalruleistoassignasinglemodernlemmaform.Insomecases,itwillbeprofitabletoaddmore
thanonemodernlemmaform,becauseseveralvariantssurviveinthemodernlanguage,withmoreor
less equal status. A separate table stores these variants. Typical examples in Dutch: Weer/weder,
neer/neder.
There will be a separate document about the principles of assigning a modern lemma to historical
wordforms.

Statusofthisinformation:mandatory,bothforinternalandexternaluse.Modernlemmavariantsare
optional.

18

Lexicon structure

IMPACT

EE2

3.3. Lexicalpartofspeech
Amainpartofspeechisassignedtoeachlemma(e.g.NOUN,VERB,ADPOSITION,.).
PartofspeechisnotbyitselfadeliverableofIMPACT,butthelexiconcannotbeorganizedwithoutit.
Part of speech distinguishes lemmata.Additional features (like gender, inflectional class) do not by
themselvesconstituteasufficientcriteriontodistinguishlemmata,sincetheyareverymuchsubjectto
historical variation (e.g.: at least 3815 nouns from the Dutch Woordenboek der Nederlandsche Taal
havemorethanonepossiblegender).Wedonotspecifywhichadditionalfeaturesmaybeusedforall
differentlanguages.Instead,weprovideageneralmechanismforaddingfeatures(cf.3.6and3.4).

Statusofthisinformation:mainlyfor internal use10,but hardlydispensableas ameanstoorganize


thelexicon,so:mandatory.

3.4. Genderandotherpossiblegrammaticalfeatures

Gender information is important as an organizational principle in, for instance, German. In other
languages,featureslikeanimate/nonanimatemayberelevant.
Inlanguageswithpoorinflectionmorphology,itisoftenpossibletohaveseveralgendersforasingle
lemma.Hencethesuggestedgeneralfeatureassignmentmechanism(cf.figure1).

Example:gender

Tablelemma_features
lemma_feature_id
1
2

Lemma_feature_name
Gender
Foreign_Language_Name

Tablelemma_feature_values
lemma_feature_value_id
1
2
3
4

lemma_feature_value
M
V
French
German

TableLemma_feature_assignments
assignment_id
1
2
3

feature_id
1
1
2

value_id
1
2
4

lemma_id
19289
19289
20001

Status:optional,internal,dependingontheimportanceofthesenotionsinthelanguageathand.
Remark: Within the NE context, this can be used to tag words as belonging to a foreign language
(Koroka[SLOVENIAN]).

3.5. Namedentitylabel

For named entities (NE), either multiword or single, a classification label is added according to the

PartofspeechtaggingisnotadeliverableofIMPACT

10

19

Lexicon structure

IMPACT

EE2

schemechosenforIMPACT.TheproposedlabelsareNEPER(persons),NELOC(locations),NEORG
(organizations).

Statusofthisinformation:forinternalandexternaluse,mandatory.

3.6. Inflectionalclass(es)
Inflectionalclassesarenecessaryforthebasicgenerationofwordformsinthereverselemmatization
task.

Statusofthisinformation:forinternaluse,buthardlydispensableasameansoforganization.

3.7. Language
Whenatextcontainswordsfromanotherlanguage,theyshouldbemarkedaccordingly.

3.8. Gloss

Lemmata may have a short description of word meaning. This is especially relevant to be able to
distinguishbetweenhomographs.

Status:optional,internalandexternaluse.

3.9. Multiwordexpressions

The inclusion of multiword expressions (MWE) takes us to the boundary of syntax and
morphosyntax.Alotofrecentresearchhasbeendevotedtothepositionofmultiwordexpressionsin
thelexicon;muchofthisworkisconcernedwiththesyntactictreatmentorthesemanticinterpretation
ofidioms,whichisdecidedlyoutofscopeforIMPACT.
WithinIMPACT,MWEarelikelytoplayarolefornamedentitiesandforconstructionswhichcanbe
realizedbothasasingleorthographictokenandasseveraltokens(e.g.separableverbsanddetached
wordparts).

Therearetwodistinctwaysofaddingmultiwordstructuretothedatabase.Wecanmapamultiword
expression realized as a word form to a sequence of lemmata and PoS labels by using the structure
alreadypresentinthedatabaseforthestorageofcliticcombinations(cf.2.4),andtheconstituentparts
of multiword lemmata are specified using a mechanism parallel to the way we treat morphological
analysis(3.10).Sometypicalcases:
1. Transparent:thereisaclear11correspondencebetweenthepartsofthewordform,separated
bywhitespace,andthelemmaparts.KarlderGrosse,KarlsdesGrossen.
Most naturally seen as a sequence of word forms, each with their own lemma and PoS. The
sequencehasahigherlevelPoSandlemmaaswell.Cf.also2.5.1.1.
2. Nontransparent: zu ruck: two typographic words but just one linguistic word form
(containingwhitespace,nospecialtreatmentrequiredinthelexicaldatabase).
3. SomecombinationslikeMiddleDutchaldiewiledat(allthewhilethat):admitforbothpointsof

20

Lexicon structure

IMPACT

EE2

view. The fact that the combination occurs with different typographical segmentations (cf.
examples below) points to an analysis along the lines of the analysis of clitic combinations.
(DictionaryofEarlyMiddleDutch:
Ende al de wile dat soe drinct yet Sone drinc en twint selue niet, En.Cod. p. 486487, r. 426,
OostVlaanderen,1290
Endealdiewiledatsighingenoliekoepen,soquamdiebrudegoem,endediegheretwaren,
ghingenmetheminterbrulocht,Diat.p.222,r.1216,BrabantWest,12911300)

The equivalence class method (ECM , Odijk 2004) is quite similar to what we intend to do on the
lemmalevel.Inordertoarriveatarepresentationwhichcanbeusedindifferentpossiblegrammatical
theories,Odijkproposestoincludethefollowinginformationforeachidiom:
1. Idiompatternid(=ourmultiword_operation_id)
2. Idiomcomponentlist(=multiword_analysis)
3. Examplesentence(weshouldgetthisfromtheattestations)
In order to deal with inflected forms of multiword expressions, pattern equivalence can be defined
suchthatequivalentmultiwordexpressionshavesimilarinflectionalproperties.

Status of multiword data: optional for the general lexicon; indispensable for the named entities
lexicon.

Figure6:databasemodelformultiwordlemmata11

Tablelemmata
lemma_id
L102
L501
L502
L503

modern_lemma
al-de-wijl-dat
Al
De/die12
Wijl

lemma_pos
CONJ
PRN
PRN
NOU

Foranexplanationoftheselfreferenceinthedefinitionofthelemmatatable,cfsection3.11.1,portmanteau
lemmata
12Theslashindicatesalternatives
11

21

Lexicon structure

IMPACT

EE2

L504

Dat

CONJ

Tablemultiword_analyses
Multiword_analysis_id
a102

Arity
4

Analyzed_lemma_id
L02

multiword_operation_id
M1

Tablepart_multiword_analysis
Part_multiword_analysis_id
P1
P2
P3
P4

Part_number
1
2
3
4

Part_lemma_id
L501
L502
L503
L504

Multiword_analysis_id
a102
a102
a102
a102

3.9.1.
Multiword named entity lemmata

TheinclusionofNamedEntities(NEs)inthelexiconiscrucialinthesensethat,ontheonehand,text
recognition is based on input from the lexicon, so we want to capture as many possibly occurring
tokensaspossible,andontheotherhand,namesofpersons,organizationsandplacesareverylikely
candidates for users search queries, hence, normalizing them with respect to orthographical and
interlingualvariationisdesirable.
NEs can occur in the form of multiword expressions or as single tokens. In principle, the mapping
from multiword NEs to lemmas works the same way as with idiom parts, i.e., the entire complex
receivesaLemmaID,andthepartsaremappedontotheircorrespondinglemmas,ifavailable.Forthe
possible values of the property NE label cf. section 3.5. For the treatment of wordforms and
attestationsformultiwordNEs,cf.2.5.1.1.
Tablelemmata
lemma_id
L202
L601
L602
L603
L604

Modern_lemma
Jan van de Wiel
Jan
Van
De
Wiel

lemma_pos
NOU
NOU
ADP
PRN
NOU

ne_label
NE_PER
NE_PER

NE_PER

Tablepart_multiword_analysis
Multiword_analysis_id
A202

Arity
4

Analyzed_lemma_id
202

Multi_operation_id
m1

Tablepart_multiword_analysis
Part_multiword_analysis_id
P1
P2
P3
P4

Part_number
1
2
3
4

Part_lemma_id
L601
L602
L603
L604

Multiword_analysis_id
A202
A202
A202
A202

NotethatthereisnodistinctionbetweensinglewordandmultiwordNEs,asbothtypesareidentified
asthesamePoScategory,andthatthepersonnameWielisnotmappedtothenounwiel(wheel).
StatusofNEinformation:mandatory,internalandexternaluse

22

Lexicon structure

IMPACT

EE2

3.10. Morphologicalanalysis

Thissectionisaboutderivationandcomposition.Theparadigmaticrelationbetweenlemmaandword
formsistreatedinsection2.4.Morphologicalanalysiswillbeattachedatthelemmalevel.Theword
formsbelongingtolemmaswillinheritthisanalyticalinformation.
WithinIMPACT,morphologicalanalysisisnotapurposeofitsown,butservespracticalends:
tofunctioninaspellcheckerthatdoesnotrejectnewlyfoundproductivecompoundsbecauseoftheir
deviantforms
analysisofexistingcompoundscanbeusedtopredictinflectionalformsforcompounds/wordforms
whichwillbegeneratedautomatically(expansion).

Someremarks:
1) Morphological analysis can be specified in the form of a full hierarchical analysis, or a flat list of
components, or (partial analysis) one can just specify the head of the compound, which usually
determinesitsmorphosyntacticproperties.Theproposeddatabasestructureiscompatiblewiththese
three possibilities. We want to stress the idea that different solutions are possible for different
languages.
Tofulfillpracticalends,wedontalwaysneedfullblowndeepanalyses.Weonlyhavetobeabletosay
whichtypeofcompoundandwhichfinalpartsofacompoundareveryfrequent.
A deep analysis can be obtained by storing, recursively, the analyses of the immediate
constituents(Braumeisterfleischpflanzeisanalyzedasanominalcompoundof Braumeister+
Fleischpflanze, a deeper analysis can be stored if Braumeister is analyzed in its turn as
brauen+MeisterandFleischpflanzeasFleisch+Pflanze,etc.).
An (arbitrarily long) flat analysis of a compound is also possible,
Braumeister+Fleisch+Pflanze. There is often no need to choose between different
bracketingsofacompound.
Ifthefocusisonpredictingthemorphosyntacticpropertiesofthecompound,itissufficient
toanalyzethiswordasnominalcompoundwithlastpartPflanze.ItisNOTmandatoryto
linktoallpartsofacompound.
2)Diminutivesareassignedlemmataoftheirownbuttherelationtothebaselemmaisstored.
3)Itisallowedtoassignmorethanoneanalysisofonecompoundlemma.

23

Lexicon structure

IMPACT

EE2

Figure7:databasemodelformorphologicalanalysis

Statusofmorphologicalanalysis:internal+external,optional(inthesensethatnotallwordsmustbe
analyzed)
Internaluse:usetopredictparadigmsofcompoundsandderivations
Externaluse:useinOCRtohelpassesstheprobabilityofunknownwords

Table1:databaseexamplesformorphologicalanalysis

Tablelemmata
Id

Modern_lemma

Lemma_pos

L000001

Appelflap

NOU

L000002

Appel

NOU

L000003

Flap

NOU

D1

Braumeisterfleischpflanze

NOU

D2

Braumeister

NOU

D3

Pflanze

NOU

D4

Brauen

VRB

24

Lexicon structure

IMPACT

EE2

D5

Meister

NOU

D6

Fleisch

NOU

D7

Fleischpflanze

NOU

Tablemorphological_analyses
Morphological_analysis_id Arity

Analyzed_lemma_id Morphological_operation_id

A1

l000001

o1

A2

d1

o1

A3

d2

o2

A4

d7

o1

A5

L000001

o3

Tablemorphological_operations
Morphological_operation_id description

resulting_pos

O1

NOU+NOU->NOU

NOU

O2

VRB+NOU -> NOU

NOU

O3

.* + NOU -> NOU

NOU

Tablepart_morphological_analysis
Part_morphological_analysis_id Part_number

Part_lemma_id

Morphological_analysis_id

P1

L0000002

A1

P2

L0000003

A1

P3

D2

A2

P4

D7

A2

P5

D4

A3

P6

D5

A3

P7

D6

A4

P8

D3

A4

P9

d3

A5

Note:Analysesa2,a3,a4constituteahierarchicalanalysis((Brau)(meister))((fleisch)(pflanze)), a5isaflat
analysis(brau_meister_fleisch_pflanze)whichonlylinkstotheheadofthecompound.

3.11. Unresolvedambiguityinlemmaassignment

Therearevariouswaysofdealingwithambiguouswordformsinthedatabase.Thebasicmechanism
is always the same: different analyses are attached to a single word form. This makes it possible to
either leave it like that and not resolve the ambiguity at all or resolve it partially or resolve it
completely.Thisdependsontherequirementsforthetask.
The two mechanisms described below mainly serve to distinguish ambiguities which need not be
resolvedinIMPACTfromotherambiguitieswhichpossiblydorequireapartialresolution.

25

Lexicon structure

IMPACT

EE2

3.11.1.
Portmanteau lemmata

Aportmanteaulemmaisalemmarepresentingagroupofhomographs.
Thepurposeofportmanteaulemmataistoavoidchoosingbetweentwohomographiclemmata(with
equalmodernlemmaformandPoS),butdifferentinmeaning,inflectionclassorgender.
Portmanteaulemmatawillbeimplementedasordinarylemmata,linkedtothehomographs.
Cf. heer1 (lord), heer2 (army) or bank1 (couch), bank2 (bank), Wetter1 (person who places bet), Wetter2
(weather).
Portmanteau lemmata can be used to avoid complete disambiguation in morphological analysis as
well: cf. tuinbank vs. handelsbank or heerbaan (heirbaan) vs. heerendas. A word form which belongs
unambiguously to the paradigm of one of the homographs can be assigned directly to the more
specificlemma,e.g.theoldformharbelongsonlytoheer1.
NB:portmanteaulemmatawillnotbeusedtogrouphomographiclemmatawithdifferentPoS.Cf.the
discussionoftranscategorization(conversion),section3.11.2.
Portmanteaulemmatawereintroducedforpracticalreasons:

Lemmatizing a word form like Dutch kip to 15 possible homographic lemmata is not very
attractive.

How to update the ambiguous lemmatizations when another homograph is added to the lexical
database?

Howtoadddatafromafullformlexiconwhichneednothavesplitthehomographsinexactlythe
sameway?

Status:optional,internal

3.11.2.
Transcategorisation (conversion), sublemma and main lemma
Transcategorization(orconversion)occurswhenpartoftheparadigmofalemmaXwithPoSAcanbe
seenasbelongingtolemmaYwithPoSB(e.g.participles,whichcanbeseenbelongtobothaverbal
and an adjectival paradigm). We call Y a sublemma corresponding to the main lemma X. In each
language, we have a (small) fixed set of productive transcategorization relations. This list will be
includedinthedatabaseforthelanguage.
While it might appear that including transcategorization information in the lexical database is
linguistic hairsplitting and not relevant to IMPACT, it must be realized that it provides us with a
principled way to avoid or defer decisions about lemma and PoS assignment to word forms like
geboren(geboren/ADJorgebren/VRB),ortoleavethechoiceuptotheuser.

26

Lexicon structure

IMPACT

EE2

Figure8:Databaseobjectsrelatingtomorphosyntacticconversion(transcategorization):

Table2:databaseexamples

TableLemmata
Lemma_id
L1
L2

Modern_lemma
Bakken
Gebakken

Lemma_part_of_speech
VRB
ADJ

TableTranscategorisations
Transcategorization_id
T1

Mainlemma_id
L1

Sublemma_id
L2

Transcategorizationtype_id
C1

TableConversion_rules(Listoftranscategorizationspresentinthelanguage)
Rule_id
R1
R2
R3

main_pos
VRB(part,past)
VRB(part,past,infl=e)
VRB(part,past,infl=en)

sub_pos
ADJ(infl=0)
ADJ(infl=e)
ADJ(infl=en)

Transcategorisation_id
C1
C1
C1

TableTranscategorisation_types
Transcategorizationtype_id
C1

Description
Conversion between past participle
and adjective

main_pos
VRB

sub_pos
ADJ

Useofthisdata:
1) Lexiconexpansion:createsublemmataautomaticallyfornonincidentaltranscategorizations
2) Postponing or omitting disambiguation: distinguish between genuine ambiguity (where for
instancetwosemanticallyandetymologicallycompletelydifferentlexemesmaybeinvolved)
andambiguityresultingfromdifferenttaggingprinciples

Statusofthisdata:optional,internal

27

Lexicon structure

IMPACT

EE2

3.12. Addingcustominformationonthelemmalevel
Ifthedatabasedesignerneedstostoreotherlemmarelatedinformation,therecommendedwayisnot
tochangethetableswhicharepartofthebasicstructure,buttoaddtableslinkingtheinformationto
therelevantlemmaIDs.If,forinstance,itisdesirabletoaddnearsynonyminformationforretrieval
purposes,thepreferredsolutionisnottoaddfieldstothelemmatatable,buttoaddatablelinkingto
it.

Example:retrieval linksfornearsynomynsorheadsof compounds.Possibleuse:whensearchingfor


lemma_id,alsosearchforrelated_lemma_id.

Tablelemmata
Lemma_id
L001
L002
L003
L004

Modern_lemma
Zange
Seitenschneider
Sonnenblume
Blume

Lemma_pos
NOU
NOU
NOU
NOU

Tableretrievallinks
Lemma_id
L001
L004

Related_lemma_id
L002
L003

Statusofthisinformation:optional,mainlyforexternaluseinretrieval.

3.13. AdditionalstructureforrelatedentriesinNElexica

Inthegenerallexicon,variantsareincludedaswordformsbelongingtothemodernstandardlemma.
Thiswillalsobethecaseforspellingvariantsoflocations(Haerlemwillbeawordformwithlemma
Haarlem,etc).
Forpersonnames,however,wefounditnotfeasibletodistinguishbetweenallographsofthesame
name and etymologically related but different names. There are also variant relations like
interlingualvariationwhichdeservespecialtreatment.

Weproposethefollowingstructure:

28

Lexicon structure

IMPACT

EE2

Examples:

Tablelemmata
Lemma_id
L001
L002
L003
L004
L005

Modern_le
mma
Krnten
Carinthia

Koroka
Douwes
Dekker
Multatuli

Lemma_pos

NE_Label

NOU
NOU
NOU

NELOC
NELOC
NELOC

NOU

NEPER

NOU

NEPER

Tablene_variant_relation_types
Ne_variant_relation_type_id

Ne_variant_relation_name

ne_variant_relation_description

1
2

Interlingual_variant
Pseudonym

Second is an other languafe variant of first


Second is a pseudonym used by first

Tablene_variant_relations
first_lemma_id

second_lemma_id

ne_variant_relation_type_id

L001
L001
L004

L002
L003
L005

1
1
2

3.14. Namedentityparts

29

Lexicon structure

IMPACT

EE2

Thesetableswereaddedtoallowpartsofnamestobemarkedassuch.
ExamplesofNEparttypesforDutchare:

Givenname
Surname
Title
Particle
Suffix

Piet
Jansen
dr., Jhr., baron
van, de, of, thoe, over, uyt
junior, senior, sr. C.zn, A.zn, IIIe, Derde

Statusofthisinformation:optional,mainlyforexternaluseinretrieval.

30

Lexicon structure

IMPACT

EE2

4. Informationonthedocumentlevel

Information about the domain of application of words will be specified on the document level. By
linkingthewordstothedocumentstheyoccurin,theywillinheritthisinformation.
Thefollowingarerelevantonthedocumentlevel.

Elementarybibliograficaldata:
Author
Editor
Title
Dateofpublication
Publisher
Publishinglocation
If document is part of e.g. a magazine, or a collected work, ...: reference to this work
andtopagesand/orissue/volumeinthismagazine,collection...
If document is in collection holders catalogue: some ID or other type of link to the
relevantiteminthecatalogue
Texttype,basedonlibrarymetadatastandards
Numberofwords
Dateoftext(candifferfromdateofpublication,e.g.incaseofeditions
Regionoforiginoftext(dialect/languagevariety)
characterencoding(UTF8)
primarylanguage
presenceofotherlanguages,e.g.Latin,French,....
Tostartwith:informaldescriptionofthetypeofspellingusedinthedocument.Inthecourse
oftheproject,thiscanbeextendedbyamoreformalprofile.(f.i.Dutch:thereisadifference
between text material in the late nineteenth century spelling of De Vries/Te Winkel and the
spellingofGroeneBoekje1954.Someauthorse.g.Multatulihavetheirownspellingrules.This
informationisrelevant.
Location(path).

Statusofthisinformation:mandatory

31

Lexicon structure

IMPACT

EE2

5. Auxiliaryinformationforwordformsynthesisandanalysis

Anotherkindofinformationistheoneforautomaticallygeneratedoranalyzedwordforms,
Here,wekeeptrackof:
theinflectionruleused
thebuildingelement(s)
the spelling patterns used to match the normalized spelling of the word form with the actual
spellingindocuments

The following diagram summarizes the relation between historical word form and modern lemma
form,whichiscentralinIMPACTlexica:

Thehorizontalaxescorrespondtomodelsforinflectionalmorphology;theverticalaxescorrespondto
spellingvariationasitwillbemodelledinIMPACT13.

5.1. Datatosupportthemodellingoforthographicvariation

Inordertobeabletoinducestatisticalmodelsforhistoricalspellingbymachinelearningalgorithms,
someextradata,besidestherelationofhistoricalwordformandmodernlemma,mustbedeveloped.
Withoutthemodernwordformequivalents,itisdifficulttoseparateinflectionfromorthographical
variation. The addition of this information is not entirely unproblematic. When there are
morphological (and phonological) differences, a historical word form in modern spelling may be a
somewhatartificialconstruct.Inmanualannotationofgroundtruthmaterial,weexpectittobemuch
easier to choose a relevant lemma from a suggestion list of possible lemma assignments, than to
choose a plausible transcription for an historical word form. There are, however, many cases where
thedifferencesbetweenmodernlanguageandhistoricallanguagearelargelyorthographic,anditis
indeedpossibletohavesomestandardrepresentationofhistoricalwordformsinmodernspelling.

Thenoisychannelmodelsusedassignweightstomulticharactersubstitutions,thusdefiningaprobabilistic
modeloforthographicvariation.

13

32

Lexicon structure

IMPACT

EE2

Themodernwordformisuseful,becauseadatabaseofmodernandhistoricalwordforms makesit
easy to induce a set of patterns relating historical and modern spelling by a machine learning
algorithm.
SUMMARIZING: It is of course not a problem to include this field in the database without being
obliged to manually verify its contents. It may be sufficient to fill this in for only a relatively small
number of word forms in a certain orthography in order to obtain the set of patterns needed to
describethisparticularorthography.

Example:themanualverificationofthelemmaassignmentzeggentothehistoricalwordformseg(h)eiseasy,
choosingamodernform(zegorzeggeorevenzeggen)ismuchlessstraightforward.

Modern
Position in paradigm Middle dutch
1e sg.ind.pres.
1e pl.ind.pres.
imp.sg.
imp.pl.
1e sg.conj.pres.
3e sg.conj.pres.

sech, seg(h), segg, secge, seche, seg(h)e, segg(h)e

Zeg

secg(h)en, segg(h)en, zegghen, directly followed by Zeggen


the pers.pron. wi final-n often missing: secghe,
segg(h)e, zegghe
Zeg
sech, seg(h), seg(h)e
sagit (2x, Nederrijn), secget, sec(h)t, segg(h)et

Zegt

segg, segh, sage, segge (its not always certain that a (nonexistent)
conjunctive is involved)
Zegge
segg, sage, secghe, segghe

33

Lexicon structure

IMPACT

EE2

Figure9.Derivations

Wordforms
Wordform_id
W1

Wordform
Klaerlick

Analyzed_wordforms
analyzed_wordform_id
A1

Number_of_parts
0

Wordform_id
W1

Normalized_form
Klaarlijk

analyzed_form_id
A1

Left_hand_side
aa
Ij
K

Right_hand_side
Ae
I
ck

Derivations
Derivation_id
D1

Patterns
Pattern_id
P1
P2
P3

Pattern_applications
Pattern_application_id
Pa1
Pa2
Pa3

Position
2
6
8

Pattern_id
P1
P2
P3

Derivation_id
D1
D1
D1

34

Lexicon structure

IMPACT

EE2

Statusofthisinformation:mandatory,external(forusewithinTR5)
Themandatorystatusofthisinformationdoesnotimplythatitiscompletelymanuallyverified.One
maychoosetogeneratethisinformationfromotherinformation.Thenormalizedwordformmaybe
chosen on the flyamong the modern word forms of the lemma. The mapping, on the other hand,
between historical word form and modern lemma is part of the deliverable output of the lexicon
building process and the quality of this mapping has to be checked, and if necessary, manual
correctionsmusttakeplace.

5.2. Informationaboutparadigmaticexpansion

Thisisonewayofkeepingtrackoftheprocessofexpansionfromlemmatatowordforms.

Figure10.Paradigmaticexpansion

Tableparadigms
Paradigm_id
Paradigm_name
P1
Regularverbalastems
P2
Regularverbalestems

Tableparadigm_positions
paradigm_position_id paradigm_position_name paradigm_position
Paradigm_id
1
1sgindpresactive
1
P1
2
2sgindpresactive
2
P1

Tabletransformsets

Transformset_id
R1
R2

35

Lexicon structure

IMPACT

EE2

Transformset_id
R1
R2

Inflection_process
s/are$/o/14
s/are$/as/

Paradigm_position_name
1sgpresactiveastems
1sgpresactiveastems

Stem_type_id
ST1
ST1

Comment: patterns for inflection may be either simple substitution rules or fullfledged finitestate
transducers

Tablewordform_transform_instances(definestherelationbetweeninflectionalpatternsandwordforminstances)
Transform_instance_id
Transformset_id
Stem_id
Analyzed_wordform_id
R1
R2
Amare
A1

TableWordforms
Wordform_id
W1

Wordform
Amas

Tableanalyzed_wordforms
analyzed_wordform_id pos
A1

part_number number_of_parts parent_analysis_id lemma_id wordform_id

VRB(pres,2,sg,ind,act) NULL

TableAnalyzed_wordforms
analyzed_wordform_id
A1

Tablestems
Stem_id
S1

TableStem_types
Stem_type_id
ST1

NULL

Wordform_id
W1

Stem_form
Amare

NULL

L1

Number_of_parts
1

Lemma_id
L1

Stem_type_id
ST1

Stem_type_name
Lemmaform

Statusofthisinformation:optional,internal

5.3. Databaseinformationforstems

Itmaynotbeverypracticaltoderivethecompleteparadigmfromasinglebaseform(e.g.forstrong
orirregularverbs).
Forthisreason,weaddapossibilitytospecifyanumberofalternatestemformsforagivenlemma.

Tablelemmata
Lemma_id

Modern_lemma

Lemma_pos

l1

Binden

VRB

ThisexampleusesPerl5regularexpressionsyntax

14

36

Lexicon structure

IMPACT

EE2

Tablestems
Stem_id

Stem_form

Lemma_id

Stem_type_id

S1

Bind

L1

St1

S2

Band

L1

St2

S3

Bund

L1

St3

Tablestem_types
Stem_type_id

Name

ST1

Present tense stem

ST2

Past tense stem

ST3

Past participle stem

Statusofthisinformation:optional,internal

6. Lexicalsource

Fortheexistenceofotherwords,noverifiedevidenceintextsmayhavebeenfound.Itisstilldesirable
tokeeptrackofwheretheycomefrom:incorporatedfromsomeotherlexicon,obtainedbyexpansion
fromlemmatainhistoricaldictionaries,obtainedbyautomatic(andnotmanuallyverified)analysisof
historical documents. In the case of named entities, the lexical source information may serve to
preservethelinktothepersistentidentifierinthelibrarynamedauthoritydata.
When information is incorporated from lexica or dictionaries, labeling from these sources may be
copied (mapped often a nontrivial task; subject matter labels may be useful; regional or temporal
labelingmayalsobepresent).Ofcoursenotallwordsinthesourcelexiconhaveidenticaldate,text
type,etc..Henceinthiscase,theinformationisspecifiedinthesourceinformationrecordfortheword
form.

37

Lexicon structure

IMPACT

EE2

Figure11.Lexicalsource

Tablelemmata
lemma_id
L202

Modern_lemma
Jan van de Wiel

lemma_pos
NOU

ne_label
NE_PER

Tablelexical_source_lemma
Lemma_source_id
Ls1

Labels
Physics,Science

Lemma_id
A1

Foreign_id
0000330x

Lexicon_id
Lex1

38

Lexicon structure

IMPACT

EE2

Example:(wordindexofVanderSijs)

woonachtig*wonende1279[CG11,423]
woord*klankmeteigenbetekenis776880[CG111Utr.Doopbelofte]{2.5}
woordenboek*dictionaire[Toll.]
worcestersauskruidigesaus1900[Sanders1995]<Engels{4.1.6}
This word list gives us dates of occurrence which can be useful. The information is linked on the
lemmalevel.

Statusoflexicalsourceinformation:optional,internal

6.1. Ambiguityinformation

Especiallyfornamedentities,theinformationthatawordformisalsopartofthegenerallexiconorof
anotherpartofthenamed entitylexicon can beuseful. Hence, weadded some structuretoindicate
ambiguity of a word form. This ambiguity information may derive from another lexicon or from
manualinspection.

39

Lexicon structure

IMPACT

EE2

7. ConvertingthedatabaseintoLMF
7.1. Introduction.
In the previous chapters we described the structure of the database that is used for building the
lexicon.Thefinalformofthelexicon,however,willbeintheLexicalMarkupFramework(LMF:ISO
24613:2008)forthisisthestandardforsharingcomputationallexicons.
In this chapter we first describe the structures in LMF that correspond with those that have been
discussedaboveintheformatofarelationaldatabase.
Second, we will describe the method to compile a LMFversion in XMLformat from the relationall
database.ThescriptsthatarerequiredforthisprocessandtheinstructionsareprovidedinAppendix
[?].

7.2. Mappings
7.2.1.
On notation
The ISO standard uses UML diagrams to represent LMF models. We will do the same in this
document.Forconveniencewewilldescribethemostessentialelementsofthesediagrams.Theboxes
inFigure12representelementsintheXMLstructure.Abovethelineisthenameoftheelement,and
belowthenamesoftheattributes.

Figure12.NotationinUML

ThearrowinDiagramAindicatesthatElement2isanaggregateofElement1.ImplementedinXML,
thismeansthatElement2isembeddedinElement1.
The arrow in Diagram B indicates that Element 1 and 2 are associated and that Element 1 can send
messagestoElement2.ImplementedinXMLthismeansthatElement1containsapointertoElement
2.
7.2.2.
Unlabelled word forms.
Unlabeledwordformshavenolinguisticdataattachedtothem.

Figure13.Unlabelledwordforms.

Thelexiconelementisthetopnodeinourdescription.Thelexicalentry(LE)correspondswithwhat
in the previous chapter has been labelled lemma. The LMF element word form corresponds with

40

Lexicon structure

IMPACT

EE2

thenotionanalyzedwordformofthepreviouschapters.AndtheLMFelementform_representation
finally,correspondswiththenotionwordformofthepreviouschapters.
In case of the unlabelled word forms, the embedding elements lexical_entry and word_form will
containnolinguisticinformation.
7.2.3.
Inflection (labelled word forms).
Labelledwordformshavelinguisticinformationattachedtothem.Informationabouttheavailableset
of features is provided at the level of the LE. The features and values of word forms point to the
relevantfeaturesthatresideunderthelexicalentry.

Figure14.Inflection.

NotethatusageofthetermLemmainLMFisdifferentlyfromthatinthepreviouschapters.InLMF
it contains a marker for the LE; usually the stem or base of the word. In this document the element
lemmacontainstheformofthemodernlemma.
7.2.4.
Composition.
The set of morphological patterns are attached to the level of the lexicon. LEs can have several
analyses,whichallpointtodifferentmorphologicalpatterns.

41

Lexicon structure

IMPACT

EE2

Figure15.Composition.

7.2.5.
Spelling.
Itispossibletospecifythenormalizedspellingofwordformsinadifferent(older)spelling.

Figure16.Normalizedspelling.

Thepatternsdescribehowthewrittenformisderivedfromthenormalizedform.

7.2.6.
Clitics.
Clitic combination have a lot in common with composition. Both are considered agglutinations.
Clitics,however,areanalyzedatthelevelofthewordform.

42

Lexicon structure

IMPACT

EE2

Figure17.Clitics.

CliticsarerepresentedasLEsthathaveanaggregatedwordform,andanorderedlistofcomponents.
ComponentarereferencestootherLEs.
7.2.7.
Portmanteau.
Aportmanteauspecifiesarelationsbetweenhomographlemmas(implementedasLEs).Ithasbeen
implementedinLMFformatasalexicalentrycontainingalistofmembers.Eachmemberinthelist
pointstoalexicalentry.TheconceptofaListofMembersisderivedfromtheListofComponents
thatisusedfore.g.compositionandMWEs.ThemaindifferenceisthataListofComponentsisan
orderedsetandaListofmembersisnotordered.

Figure18.Portmanteau.

7.2.8.
Transcategorization.
TranscategorisationsspecifyhomonymwordformsfromdifferentLEsthattypicallydifferinpartof
speechtype.Sincethereisusuallyalimitedlistoftranscategorisationtypesinalanguage,thislistis
locatedatthelexiconlevel.

43

Lexicon structure

IMPACT

EE2

Figure19.Transcategorization.

For transcategorisations we use Lexical Entries of the type Categorisation, that point to the
according Transcategorisation Type and rule. Further, the Lexical Entry contains a List of
Componentstospecifytheelementsofthetranscategorisation.Thereasonwhywedonotusealistof
membersasinthecaseofPortmanteausisthataListofComponentsisordered.
7.2.9.
Multiword expressions.
MultiwordexpressionsareaddedtothelexiconasLexical
Entries.TheanalysisofthatLEpointstoaMultiwordExpressionPattern(MWEPattern).TheseMWE
Patternsdescribeaorderedlistofnodes,andinthedescriptionfieldthegrammaticalrelationofthese
nodes.

44

Lexicon structure

IMPACT

EE2

Figure20.Multiwordexpressions.

7.2.10.
Multiword named entities.
MultiwordexpressionsareaddedtothelexiconasLexical
Entries.TheanalysisofthatLEpointstoaMultiwordExpressionPattern(MWEPattern).TheseMWE
Patternsdescribeaorderedlistofnodes,andinthedescriptionfieldthegrammaticalrelationofthese
nodes.

Figure21.Multiwordnamedentities.

7.2.11.
Attestations.
LMFprovidesastructureforexamplesofuseofaLE.Thisstructure(context)issubsumedunderthe
sensepartof theLE.Thisisnotfitforourpurpose since wewantthedescriptionof thecontextto
clarifytheprovenanceofwordforms.
We,therefore,havetocreateanewextensionwithnewcategoriesforthispurpose.

Figure22.Attestations.

45

Lexicon structure

IMPACT

EE2

In paragraph 2.5 we described three types of attestations. The types text attestation and token
attestation are attached to the form representations of analyzed word forms. The attestations of
unanalyzedwordformsarelocatedatthesameposition,butoccurinlexicalentrieswithunlabelled
wordforms(seepar.7.2.2).

7.3. ConvertingrelationaldatatoXML.
IntheprevioussectionwedescribedtheLMFformatinXMLwewanttouseforthefinalformofthe
lexicon.Inthissectionwepresentamethodforconvertingthecontentoftherelationaldatabaseinto
XML.InAppendix[?]youwillfindthePerlscript(relDB2xml.pl)thatcanbeusedforthis.Thescript
istobeusedincombinationwithastructuredefinitionforacertainlexicon(language).Appendix[?]
containsthespecificationfortheDutchlexicon(NL_Structure.pl).

The script relDB2xml.pl is run without arguments. All specific data for the conversion are in a
separate(Perl)filewhichcontainsthemappingoftablestoxml.Thefilealsocontainsalldetailsonthe
databasethatcontainstherelationaldata.Thereferencetothisfileisspecifiedsomewhereatthetopof
thescriptrelDB2xml.pl.
Themappingspecificationislaiddowninaarraystructure.NotethatthisisPerlcodeandthatusing
therightsyntaxisveryimportant.

Thearrayhasanembeddedstructurethatroughlycorrespondswiththeresultingxml.
Therearethreekindsofsubstructures:fortables,fieldsandXMLelements.

StructureforXMLelements.
Thearraysforbindingcontainthefollowingelements:
elementnameREQUIRED
listofarraysforsubelementsREQUIRED

ThisstructureintroducesanXMLelementwhichwillcontainallfurtherdatafromitssubelements.

Structurefortables.
Everyarrayforenteringtablescontainstheseelements:
connectiontype(>)REQUIRED
selectioncriteriumREQUIRED
nameofresultingXMLelement(canbeemtpystring)REQUIRED
tablenameREQUIRED
listofarrayswithsubelementOPTIONAL

The selection criterium is essentialy the where clause in a SQL select statement. The name of the
resultingXMLelementisusedtospecifytheresultingXMLsubtree.
Ifthenameisanemptystring,nonewelementswillbeintroducedatthatlevel,whichmeansthatthe
fields of all records that result from the query will be siblings. If a simple name is specified, a
subelement with that name is introduced for every record that results from the query in which the
fieldsofthatrecordareembedded.Ifapathisspecified,(e.g.element_a.element_b),extralevelsof
subelementswillbeintroducedforeveryrecord.

46

Lexicon structure

IMPACT

EE2

NotethattherearetwowaystointroduceXMLsubstructures:usingtheStructureforXMLelements
(exampleA),orusingapathspecificationintheStructurefortables(exampleB).
A:
["collection",
["->",
"lemmata.lemma_id=lexical_source.lemma_id",
"source", "lexical_source_lemma"]]
B: ["->", "lemmata.lemma_id=lexical_source.lemma_id", "collection.source",
"lexical_source_lemma"]
Thesewillresultindifferentstructureswhentherearemorethanonerecordfoundinthequery:

A: <collection>
<source><.. content of record 1 ..></source>
<source><.. content of record 2 ..></source>
</collection>
B: <collection>
<source><.. content of record 1 ..></source>
</collection>
<collection>
<source><.. content of record 2 ..></source>
</collection>

Structureforfields.
Everyarrayforaddingfieldscontainstheseelements:
connectiontype()REQUIRED
elementnameREQUIRED
fieldnameREQUIRED

TheelementnameisthenameoftheXMLelementwhichwillholdthevalueofthefieldspecifiedby
the field name. The element name cannot be an empty string. The element name can be a path, in
which case extra levels of XML elements will be introduced (analogue to the examples presented
above).
ThefieldnamespecifiesthefieldthatcontainthevaluethathastobeinsertedintotheXML.

8. References

D.Archer,A.ErnstGerlach,S.Kempken,Th.PilzandP.Rayson(2006).Theidentificationofspellingvariantsin
EnglishandGermanhistoricaltexts:manualorautomatic?.InDigitalHumanities(proceedings),Paris,2006,
pp.35.
Bie,JanuszS.(2004)AnApproachtoComputationalMorphology.In:IntelligentInformationProcessingandWeb
Mining.ProceedingsoftheInternationalIIS:IIPWM04ConferenceheldinZakopane,Poland,May17
20,2004.Springer,BerlinHeidelbergNewYork,pp.181199.ISBN3540213317
S.CucerzanandD.Yarowsky,Bootstrappingamultilingualpartofspeechtaggerinonepersonday.In:Dan
RothandAntalvandenBosch(eds.),ProceedingsofCoNLL2002,Taipei,Taiwan,2002,pp.132138.
A.ErnstGerlachandN.Fuhr.GeneratingSearchTermVariantsforTextCollectionswithHistoricSpellings.In
ECIR,2006,pp.4960.
G.Francopoulo,N.Bel,M.George,N.Calzolari,M.Monachini,M.PetM.andC.Soria.LexicalMarkup
Framework:ISOstandardforsemanticinformationinNLPlexicons.GLDV(Gesellschaftfrlinguistische
Datenverarbeitung),Tbingen,2007.

47

Lexicon structure

IMPACT

EE2

G.Francopoulo,M.George,N.Calzolari,M.MonachiniM.,N.Bel.,M.PetandC.Soria.LexicalMarkup
Framework(LMF).LREC,Genoa,2006.
N.Grgoire.DesignandImplementationofaLexiconofDutchMultiwordExpressions.In:N.Grgoireetal.
(eds),ProceedingsoftheACL2007WorkshoponABroaderPerspectiveonMultiwordExpressions.Prague,2007,
pp.1724.
A.Hauser,M.Heller,E.Leiss,K.U.SchulzandC.Wanzeck.InformationAccesstoHistoricalDocumentsfrom
theEarlyNewHighGermanPeriod.In:IJCAI2007WorkshoponAnalyticsforNoisyUnstructuredTextData,
Hyderabad,IndiaJanuary8,2007,pp.147154.
V.Hoste,W.DaelemansandS.Gillis,Usingruleinductiontechniquestomodelpronunciationvariationin
Dutch.In:ComputerSpeechandLanguage18:1,pp.124.
F.Masini.Multiwordexpressionsbetweensyntaxandthelexicon:ThecaseofItalianverbparticle
constructions.In:SKYJournalofLinguistics18(2005):pp.145173.
J.E.J.M.Odijk.AProposedStandardfortheLexicalRepresentationofIdioms.In:ProceedingsofEuralex.Lorient,
2004,pp.153163.
A.Rappoport,AriandT.LeventLevi,InductionofCrossLanguageAffixandLetterSequenceCorrespondence.
In:Proceedings,EACL2006WorkshoponCrossLanguageKnowledgeInduction,April2006,Trento,Italy.
E.S.RistadandP.B.Yianilos.Learningstringeditdistance.In:MachineLearning:ProceedingsoftheFourteenth
InternationalConference(SanFrancisco,July8111997),D.Fisher,Ed.,MorganKaufmann,1997,pp.287295.
N.vanderSijs.Etymologieinhetdigitaletijdperk,Eenchronologischwoordenboekalspraktijkvoorbeeld.Leiden,2001.

48

Lexicon structure

IMPACT

EE2

AppendixA:Databaseschema

Table alternate_modern_lemmata
Field
Type
alternate_lemma_id
bigint(20) unsigned
alternate_lemma
varchar(255)
base_lemma_id
bigint(20) unsigned
Table analyzed_wordforms
Field
analyzed_wordform_id
part_of_speech
lemma_id
wordform_id
multiple_lemmata_analysis_id
derivation_id
verified_by
verification_date
Table conversion_rules
Field
rule_id
main_pos
sub_pos
transcategorization_id

Null Key
NO PRI
YES
YES MUL

Type
bigint(20) unsigned
varchar(255)
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
datetime

Type
bigint(20) unsigned
varchar(255)
varchar(255)
bigint(20) unsigned

Table corpora
Field
Type
corpus_id
bigint(20) unsigned
name
varchar(255)

Null
NO
YES

Null
NO
NO
NO
NO
NO
NO
YES
YES

Table corpusId_x_documentId
Field
Type
corpus_id
bigint(20) unsigned
document_id
bigint(20) unsigned
Table derivations
Field
derivation_id
normalized_form
pattern_application_id
Table documents
Field

Type
bigint(20) unsigned
varchar(255)
bigint(20) unsigned

Type

MUL
NULL
NULL

Default
NULL
NULL
NULL
NULL

Default
NULL
NULL

Null
NO
NO

Extra
auto_increment

Key Default Extra


PRI NULL auto_increment
MUL
MUL
MUL

Null Key
NO PRI
YES
YES
YES MUL

Key
PRI

Default
NULL
NULL
NULL

Extra
auto_increment

Key
PRI
PRI

Default

Null Key Default


NO PRI
NULL
YES MUL NULL
NO

Null

Key

Extra
auto_increment

Default

Extra

Extra
auto_increment

Extra

49

Lexicon structure

IMPACT

EE2

document_id
persistent_id
word_count
encoding
title
year_from
year_to
pub_year
author
editor
publisher
publishing_location
text_type
region
language
other_languages
spelling
parent_document
Table dont_show
Field
wordform_id
document_id
corpus_id
at_all
user_id
date

bigint(20) unsigned
varchar(255)
bigint(20) unsigned
bigint(20) unsigned
varchar(255)
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)
bigint(20) unsigned

NO PRI
YES MUL
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES MUL

Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
tinyint(3) unsigned
bigint(20) unsigned
datetime

Null
NO
NO
NO
NO
NO
NO

NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL

Key
PRI
PRI
PRI
PRI

auto_increment

Default
0
0
0

Table group_attestations
Field
group_attestation_id
token_id
quote
analyzed_wordform_id
derivation_id
wordform_group_id

Type
bigint(20) unsigned
bigint(20) unsigned
text
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

Null Key
NO PRI
YES
YES
NO MUL
NO
NO

Table inflection_classes
Field
inflection_class_id
inflection_class_name

Type
bigint(20) unsigned
varchar(255)

Null Key Default


NO PRI NULL
YES
NULL

Table languages
Field

Type

Null

Key

Default
NULL
NULL
NULL

Default

Extra

Extra
auto_increment

Extra
auto_increment

Extra
50

Lexicon structure

IMPACT

EE2

language_id
language

tinyint(3) unsigned
varchar(255)

NO
NO

Table lemma_feature_assignments
Field
Type
assignment_id
bigint(20) unsigned
feature_id
bigint(20) unsigned
value_id
bigint(20) unsigned
lemma_id
bigint(20) unsigned
Table lemma_feature_values
Field
lemma_feature_value_id
lemma_feature_value
Table lemma_features
Field
lemma_feature_id
lemma_feature_name

Table lemmata
Field
lemma_id
modern_lemma
gloss
persistent_id
lemma_part_of_speech
ne_label
portmanteau_lemma_id
language_id
Table lexica
Field
lexicon_id
lexicon_name

auto_increment

Default
NULL
NULL
NULL
NULL

Extra
auto_increment

Null Key Default


NO PRI NULL
YES
NULL

Null Key Default


NO PRI NULL
YES
NULL

Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

Null
NO
YES
YES

Type
bigint(20) unsigned
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)
bigint(20) unsigned
tinyint(3) unsigned

Type
bigint(20) unsigned
varchar(255)

NULL

Key
PRI
MUL
MUL
MUL

Type
bigint(20) unsigned
varchar(255)

Table lexical_source_lemma
Field
Type
lemma_source_id
bigint(20) unsigned

Null
NO
YES
YES
YES

Type
bigint(20) unsigned
varchar(255)

Table lemma_inflection_class
Field
lemma_inflection_class_id
lemma_id
inflection_class_id

PRI
UNI

Null
NO
YES
YES
YES
YES
YES
YES
YES

Null
NO
YES

Null
NO

Key
PRI
MUL
MUL

Extra
auto_increment

Default Extra
NULL auto_increment
NULL
NULL

Key Default
PRI
NULL
MUL NULL
NULL
NULL
NULL
NULL
MUL NULL
NULL

Key
PRI

Key
PRI

Extra
auto_increment

Default
NULL
NULL

Default
NULL

Extra
auto_increment

Extra
auto_increment

Extra
auto_increment
51

Lexicon structure

IMPACT

EE2

label
lemma_id
foreign_id
lexicon_id

varchar(255)
bigint(20) unsigned
varchar(255)
bigint(20) unsigned

YES
YES MUL
YES
YES MUL

Table lexical_source_wordform
Field
Type
wordform_source_id
bigint(20) unsigned
foreign_id
varchar(255)
label
varchar(255)
wordform_id
bigint(20) unsigned
lexicon_id
bigint(20) unsigned
Table morphological_analyses
Field
morphological_analysis_id
arity
analyzed_lemma_id
morphological_operation_id

Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

Table morphological_operations
Field
morphological_operation_id
description
resulting_part_of_speech

Type
bigint(20) unsigned
varchar(255)
varchar(255)

Table multiple_lemmata_analyses
Field
multiple_lemmata_analysis_id
multiple_lemmata_analysis_part_id
part_number
nr_of_parts

Null
NO
YES
YES
YES
YES

Key
PRI

Default
NULL
NULL
NULL
MUL NULL
MUL NULL

Null
NO
YES
YES
YES

Key
PRI

Default Extra
NULL auto_increment
NULL
MUL NULL
MUL NULL

Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
tinyint(3) unsigned

Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

Extra
auto_increment

Null Key Default Extra


NO PRI NULL auto_increment
YES
NULL
YES
NULL

Table multiple_lemmata_analysis_parts
Field
Type
multiple_lemmata_analysis_part_id
bigint(20) unsigned
part_of_speech
varchar(255)
lemma_id
bigint(20) unsigned
Table multiword_analyses
Field
multiword_analysis_id
arity
analyzed_lemma_id
multiword_operation_id

NULL
NULL
NULL
NULL

Null
NO
YES
YES
YES

Null
NO
NO
NO
NO

Key Default
PRI
PRI
PRI
PRI

Extra

Null Key Default Extra


NO PRI NULL auto_increment
NO MUL
NO

Key
PRI

Default
NULL
NULL
MUL NULL
MUL NULL

Extra
auto_increment

52

Lexicon structure

IMPACT

EE2

Table multiword_operations
Field
multiword_operation_id
description
resulting_pos

Type
bigint(20) unsigned
varchar(255)
varchar(255)

Table ne_variant_relation_types
Field
ne_variant_relation_type_id
ne_variant_relation_name
ne_variant_relation_desciption

Type
int(32)
varchar(255)
text

Table ne_variant_relations
Field
first_lemma_id
second_lemma_id
ne_variant_relation_type_id
Table paradigm_positions
Field
paradigm_position_id
paradigm_position_name
paradigm_position
paradigm_id
transformset_id
Table paradigms
Field
paradigm_id
paradigm_name

Type
int(32)
int(32)
int(32)

Type
bigint(20) unsigned
varchar(255)
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

Type
bigint(20) unsigned
varchar(255)

Table part_morphological_analysis
Field
part_morphological_analysis_id
part_number
part_lemma_id
morphological_analysis_id
Table part_multiword_analysis
Field
part_multiword_analysis_id
part_number
part_lemma_id
multiword_analysis_id

Null Key Default


NO PRI NULL
YES
NULL
YES
NULL

Extra
auto_increment

Null Key Default


NO PRI NULL
YES
NULL
YES
NULL

Extra
auto_increment

Null
YES
YES
YES

Null
NO
YES
YES
YES
YES

Null
NO
YES

Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

Key
MUL

Key
PRI

Default
NULL
NULL
NULL
MUL NULL
MUL NULL

Key
PRI

Default
NULL
NULL

Null
NO
YES
YES
YES

Key
PRI

Null
NO
YES
YES
YES

Default
NULL
NULL
NULL

Extra

Extra
auto_increment

Extra
auto_increment

Default Extra
NULL auto_increment
NULL
MUL NULL
MUL NULL

Key
PRI

Default Extra
NULL auto_increment
NULL
MUL NULL
MUL NULL
53

Lexicon structure

IMPACT

EE2

Table pattern_applications
Field
pattern_application_id
position
pattern_id
number_of_patterns
Table patterns
Field
pattern_id
left_hand_side
right_hand_side

Type
bigint(20) unsigned
varchar(64)
varchar(64)

Table stem_types
Field
stem_type_id
stem_type_name
Table stems
Field
stem_id
stem_form
lemma_id
stem_type_id

Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

Type
bigint(20) unsigned
varchar(255)

Type
bigint(20) unsigned
varchar(255)
bigint(20) unsigned
bigint(20) unsigned

Table text_attestation_verifications
Field
Type
document_id
bigint(20) unsigned
wordform_id
bigint(20) unsigned
verification_date
datetime
verified_by
bigint(20) unsigned
Table text_attestations
Field
attestation_id
frequency
analyzed_wordform_id
document_id

Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

Table token_attestation_verifications
Field
Type
document_id
bigint(20) unsigned
wordform_id
bigint(20) unsigned
start_pos
bigint(20) unsigned

Null
NO
YES
YES
NO

Null
NO
YES
YES

Key
PRI
MUL

Null
NO
YES

Null
NO
YES
YES
YES

Key
MUL

Default

Extra

NULL
NULL

Default
NULL
NULL
NULL

Extra
auto_increment

Key Default
PRI NULL
NULL

Extra
auto_increment

Key
PRI
MUL
MUL

Default
NULL
NULL
NULL
NULL

Null
NO
NO
NO
NO

Extra
auto_increment

Key
PRI
PRI

Null Key Default


NO PRI
NULL
YES
NULL
NO MUL
NO

Null
NO
NO
NO

Key
PRI
PRI
PRI

Default

Extra

Extra
auto_increment

Default

Extra

54

Lexicon structure

IMPACT

EE2

end_pos
verification_date
verified_by

bigint(20) unsigned
datetime
bigint(20) unsigned

Table token_attestations
Field
attestation_id
token_id
quote
analyzed_wordform_id
derivation_id
document_id
start_pos
end_pos

Type
bigint(20) unsigned
bigint(20) unsigned
text
bigint(20) unsigned
bigint(20)
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

NO
NO
NO

Null Key
NO PRI
YES
YES
NO MUL
NO
NO
NO
NO

Table transcategorization_types
Field
Type
transcategorizationtype_id
bigint(20) unsigned
description
varchar(255)
main_pos
varchar(255)
sub_pos
varchar(255)
Table transcategorizations
Field
transcategorization_id
mainlemma_id
sublemma_id
transcategorizationtype_id

Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

Default
NULL
NULL
NULL

Extra
auto_increment

Null Key Default


NO PRI NULL
YES
NULL
YES
NULL
YES
NULL

Extra
auto_increment

Null
NO
YES
YES
YES

Key
PRI
MUL
MUL
MUL

Default Extra
NULL auto_increment
NULL
NULL
NULL

Table transformsets
Field
transformset_id
inflection_process
formal_tag
stem_type_id

Type
bigint(20) unsigned
varchar(255)
varchar(255)
bigint(20) unsigned

Null Key
NO PRI
YES
YES
YES MUL

Default
NULL
NULL
NULL
NULL

Extra
auto_increment

Table type_frequencies
Field
type_frequency_id
frequency
wordform_id
document_id

Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

Null Key
NO PRI
NO
NO MUL
NO

Default
NULL

Extra
auto_increment

Table users
Field
Type

Null

Key

Default

Extra
55

Lexicon structure

IMPACT

EE2

user_id
name

bigint(20) unsigned
varchar(255)

Table wordform_groups
Field
wordform_group_id
document_id
onset
offset

NO
YES

Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned

Table wordform_transform_instance
Field
Type
transform_instance_id
bigint(20) unsigned
transformset_id
bigint(20) unsigned
stem_id
bigint(20) unsigned
analyzed_wordform_id
bigint(20) unsigned
Table wordforms
Field
wordform_id
wordform
wordform_lowercase
lastviewed_by
lastview_date
has_analysis

PRI
UNI

Type
bigint(20) unsigned
varchar(255)
varchar(255)
bigint(20)
datetime
bit(1)

NULL
NULL

Null
NO
NO
NO
NO

Null
NO
YES
YES
YES

Null
NO
NO
NO
YES
YES
YES

Key
PRI
MUL
MUL
MUL

auto_increment

Key
PRI
PRI
PRI
PRI

Default
NULL
NULL
NULL
NULL

Key
Default
PRI
NULL
UNI
MUL
NULL
NULL
NULL

Default

Extra

Extra
auto_increment

Extra
auto_increment

56

Lexicon structure

IMPACT

EE2

Appendix B: Filters for the export of relevant subsets from the


lexicon

Filters for various applications will be developed as the workflow for lexicon development and
deployment progresses. They can be implemented eithers as SQL queries on the database or, for
instance,asXSLTqueriesontheXMLexportformat.

Asimpleexample:produceawordlistwithfrequenciesforalldocumentsfrom1749.

create view lemma_wordform_attestation as select modern_lemma,lemmata.lemma_id,


wordform, pos, documents.document_id, documents.year_from, documents.year_to,
documents.title,type_attestations.frequency from lemmata, analyzed_wordforms,
wordforms, type_level_attestations, documents where analyzed_wordforms.lemma_id =
lemmata.lemma_id and wordforms.wordform_id = analyzed_wordforms.wordform_id and
type_level_attestations.analyzed_wordform_id=analyzed_wordforms.analyzed_wordform_i
d and type_level_attestations.document_id=documents.document_id;
select distinct wordform, sum(frequency) as frequency from
lemma_wordform_attestation where year_from=1749 and year_to =1749 group by
wordform;

Appendix C: Script for converting relational data to LMF


(XML):relDB2xml.pl.
ThescriptwritesoutputtoSTDOUT.Afterthekeywordrequire,thenameofthefilecontainingthe
structureandfurtherparametershastobeprovided.

use strict;
use DBI;
use HTML::Entities;
#
#
#
#
#
#

Every array for inserting tables contains these elements:


- connection type ("->") REQUIRED
- selection criteria REQUIRED
- element name (can be emtpy string) REQUIRED
- table name REQUIRED
- list of arrays with subelement OPTIONAL

#
#
#
#

Every array for inserting fields contains these elements:


- connection type ("-") REQUIRED
- element name (can be emtpy string) REQUIRED
- table name REQUIRED

# arrays for binding contain the following elements:


# - element name REQUIRED
# - list of arrays for subelements REQUIRED
require "NL_Structure.pl";
open (LOG, sprintf ">%s.log", getParam ("output"));

57

Lexicon structure

IMPACT

EE2

my $dbh = DBI->connect (sprintf ("DBI:mysql:database=%s;host=%s", getParam


("database"), getParam ("databasehost")), getParam ("user"), getParam
("password"));
if (!defined ($dbh)) {
die sprintf "Unable to connect: %s\n", $DBI::errstr;
}
printf "%s\n", xmlHeader (getParam ("dtd"));
printf "%s", buildXml ("", @{getLmf()});
$dbh->disconnect;
close (LOG);
sub buildXml {
my ($super, $type, @rest) = @_;
if (@rest) {
if ($type eq "->") { # handle table
my ($constraint, $tag, $table) = splice (@rest, 0, 3);
my @table = queryAggregate ($dbh, $super, $constraint, $table);
my ($result, $openTag, $closeTag) = ("", "", "");
if ($tag ne "") {
$openTag = "<" . $tag . ">";
$closeTag = "</" . $tag . ">";
}
foreach my $record (@table) {
$result .= sprintf "%s%s%s\n", $openTag, join ("", map {buildXml
($record, @{$_})} @rest), $closeTag;
}
return $result;
}
elsif ($type eq "-") { #handle field
my ($name, $key) = @rest;
my @path = split (/\./, $name);
return sprintf "<%s>%s</%s>\n", join ("><", @path), $$super{$key},
join ("></", reverse @path);
}
else { #binding element
if ($type =~ s!^([^.]+)\.!!) {
return sprintf "<%s>\n%s</%s>\n", $1, buildXml ($super, $type,
@rest), $1;
}
else {
return sprintf "<%s>\n%s</%s>\n", $type, join ("", map {buildXml
($super, @{$_})} @rest), $type;
}
}
}
}

sub queryAggregate {
my ($dbh, $super, $constraint, $table) = @_;
my $sth = "";

58

Lexicon structure

IMPACT

EE2

if ($constraint ne "") {
my ($leftTable, $leftKey, $rightTable, $rightKey) = split (/[.=]/,
$constraint);
my $query = sprintf "select * from %s where %s='%s'", $rightTable,
$rightKey, $$super{$leftKey};
$sth = $dbh->prepare ($query);
}
else {
my $query = sprintf "select * from %s", $table;
$sth = $dbh->prepare ($query);
}
$sth->execute or printf LOG "%s\n", $sth->errstr;
my @result = ();
my $hashref = "";
while ($hashref = $sth->fetchrow_hashref) {
push (@result, $hashref);
}
return @result;
}
sub xmlHeader {
my ($name) = @_;
if ($name ne "") {
return sprintf "<?xml version='1.0'?>\n<!DOCTYPE
'%s'>\n", $name;
}
else {
return "<?xml version='1.0'?>\n";
}
}

lexicon

SYSTEM

AppendixD:StructureDefinitionfortheDutchLexicon.
ThefilecontainsPerlcode.Twodatastructuresarespecified:ahashwithsomedetailsforconnecting
to a relational database. The parameter output is used to provide a name for the log file. The
keyworddtdisoptional.
Thefilefurthercontainstwosmallfunctionsneededtopassthedatatothemainscript.Theseshould
notbechanged.

use strict;
my %params =
("output" => "NL_Lexicon",
"database" => "EE3",
"databasehost" => "impactdb.inl.loc",
"password" => "impact",
"user" => "impact",
"dtd" => "NL_Structure.dtd"
);
my $lmf =
["lexicon",
# rule section

59

Lexicon structure

IMPACT

EE2

["->", "", "lemma_feature", "lemma_features"],


["->", "", "lemma_feature_value", "lemma_feature_values"],
["->", "", "inflection_class", "inflection_classes"],
["->", "", "derivation_pattern", "patterns"],
["->", "", "transcategorization_type", "transcategorization_types",
["->",
"transcategorization_types.transcategorizationtype_id=conversion_rules.tran
scategorization_id", "rule", "conversion_rules"]
],
["->", "", "mwe_pattern", "multiword_operations",
["-", "multiword_operation_id", "multiword_operation_id"],
["-", "description", "description"],
["-", "resulting_pos", "resulting_pos"],
],
["->",
"",
"morphological_pattern.transformation_set.process",
"morphological_operations"],
["->",
"",
"morphological_pattern.transformation_set",
"transformsets",
["->",
"transformsets.stem_type_id=stem_types.stem_type_id",
"transform_category", "stem_types"],
["->",
"transformsets.paradigm_position_name=paradigm_positions.paradigm_position_
name", "process", "paradigm_positions",
["->",
"paradigm_positions.paradigm_id=paradigms.paradigm_id",
"paradigm", "paradigms"],
],
],
# lexical entries
["->", "", "lexical_entry", "multiword_analyses",
["-", "multiword_analysis_id"],
["-", "multiword_operation_id", "mwe_pattern"],
["-", "arity", "arity"],
["list_of_components",
["->",
"multiword_analyses.multiword_analysis_id=part_multiword_analysis.multiword
_analysis_id", "component", "part_multiword_analysis",
["-", "part_number", "part_number"],
["-", "lemma_id", "part_lemma_id"]
]
],
],
["->", "", "lexical_entry", "transcategorizations",
["-", "", "transcategorization_type"],
["list_of_components",
["-", "component.mainlemma_id", "mainlemma_id"],
["-", "component.sublemma_id", "sublemma_id"],
]
],
["->", "", "lexical_entry", "lemmata",
["-", "lemma_id", "lemma_id"],
["-", "modern_lemma", "modern_lemma"],
["-", "gloss", "gloss"],
["-", "POS", "lemma_part_of_speech"],
["-", "ne_label", "ne_label"],
#

60

Lexicon structure

IMPACT

EE2

["-", "language_id", "language_id"],


["-", "portmanteau_lemma_id", "portmanteau_lemma_id"],
["->",
"lemmata.lemma_id=alternate_modern_lemmata.base_lemma_id",
"alternate_modern_lemma", "alternate_modern_lemmata",
["-", "alternate_lemma", "alternate_lemma"],
],
["->",
"lemmata.lemma_id=lemma_inflection_class.lemma_id",
"inflection_class", "lemma_inflection_class",
["-", "inflection_class_id", "inflection_class_id"],
],
["->",
"lemmata.lemma_id=lexical_source_lemma.lemma_id",
"source",
"lexical_source_lemma",
["-", "label", "label"],
["-", "foreign_id", "foreign_id"],
["-", "lexicon_id", "lexicon_id"],
],
["->", "lemmata.lemma_id=stems.lemma_id", "stem", "stems",
["-", "stem_form", "stem_form"],
["-", "stem_id", "stem_id"],
["->",
"stems.stem_type_id=stem_types.stem_type_id",
"",
"stem_types",
["-", "name", "stem_type_name"],
],
],
["->",
"lemmata.lemma_id=lemma_feature_assignments.lemma_id",
"feature", "lemma_feature_assignments",
["->",
"lemma_feature_assignments.feature_id=lemma_features.lemma_feature_id", "",
"lemma_features",
["-", "feature_id", "feature_id"],
["-", "name", "lemma_feature_name"],
],
["->",
"lemma_feature_assignments.value_id=lemma_feature_values.lemma_feature_valu
e_id", "value", "lemma_feature_values",
["-", "value_id", "lemma_feature_value_id"],
["-", "value", "lemma_feature_value"],
]
],
["->",
"lemmata.lemma_id=morphological_analyses.analyzed_lemma_id",
"analysis", "morphological_analyses",
["-", "morphological_operation_id", "morphological_operation_id"],
["list_of_components",
["->",
"morphological_analyses.morphological_analysis_id=part_morphological_analys
is.morphological_analysis_id", "component", "part_morphological_analysis",
["-", "number", "part_number"],
["-", "lemma_id", "part_lemma_id"],
]
]
],
["->",
"lemmata.lemma_id=analyzed_wordforms.lemma_id",
"wordform",
"analyzed_wordforms",
["->", "analyzed_wordforms.derivation_id=derivations.derivation_id",

61

Lexicon structure

IMPACT

EE2

"", "derivations",
["pattern",
["->",
"derivations.derivation_id=pattern_applications.derivation_id",
"",
"pattern_applications",
["-", "position", ""],
["->",
"pattern_applications.pattern_id=patterns.pattern_id",
"",
"pattern",
["-", "left_hand_side", "left_hand_side"],
["-", "right_hand_side", "right_hand_side"],
]
]
]
],
["->",
"analyzed_wordforms.wordform_id=lexical_source_wordform.wordform_id",
"source", "lexical_source_wordform"],
["form_representation",
["->",
"analyzed_wordforms.wordform_id=wordforms.wordform_id",
"",
"wordforms",
["-", "wordform_id", "wordform_id"],
["-", "written_form", "wordform"],
],
["->",
"wordforms.analyzed_wordform_id=text_attestations.analyzed_wordform_id",
"attestation", "text_attestations",
["-", "id", "attestation_id"],
["-", "frequency", "frequency"],
["-", "document_id", "document_id"],
],
["->",
"analyzed_wordforms.analyzed_wordform_id=token_attestations.analyzed_wordfo
rm_id", "attestation", "token_attestations",
["-", "id", "attestation_id"],
["-", "token_id", "token_id"],
["-", "quote", "quote"],
["-", "derivation_id", "derivation_id"],
["-", "document_id", "document_id"],
["-", "start_pos", "start_pos"],
["-", "end_pos", "end_pos"],
],
],
["->",
"analyzed_wordforms.analyzed_wordform_id=wordform_transform_instance.analyz
ed_wordform_id", "", "wordform_transform_instance"]
]
]
];
sub getLmf {
return $lmf;
}
sub getParam {

62

Lexicon structure

IMPACT

EE2

my ($key) = @_;
return $params{$key};
}

63

S-ar putea să vă placă și