D-EE2 1 - Lexicon - Structure - Impact - Version - 1.3.2 PDF

Lexicon structure
IMPACT
EE2
LEXICON STRUCTURE
Deliverable number
DStart : Month 1
EE2.1
Lexicon structure
Internal
10
5
12
3
INL
DN
LM ON
B
U
B
6
2
2
Deliverable name
Internal/external
Participant number
Participant short name
Estimated personmonths per participant
for this deliverable
Dissemination level1
Due Date: Month 6

Actual delivery: Month 7
7
UGOE
CO
Revisions
Version
Status
Date
1.1
1.2
1.3
1.3.1
1.3.2
Changes
Corrected error in Appendix E

Added structure for multiword NEs
Minor adaptations to the structure for NEs
Added NE part information (sect. 3.14)
Approvals
This document requires the following approvals

Version
Date of
approval
Name
Role in project
signature
Max Kaiser
Hildelies Balk
Sub-Project Leader
Coordinator
Distribution
This document was sent to:

Version
Date of sending Name
Role in project
July 1, 2008 Klaus Schulz

Uli Reffle
Barbara Pfeiffer
Participant in EE2
Participant in EE2
Participant in EE2
PU Public; PP Restricted to other programme participants (including the Commission Services); RE restricted
to a group specified by the consortium (including the Commission Services); CO Confidential only for
members of the consortium (including the Commission Services).
Lexicon structure
IMPACT
EE2
IMPACTLexicondatabasestructure
Lexiconstructure ...................................................................................................................................................1
1. Introduction....................................................................................................................................................4
2. Informationattachedtowordforms...........................................................................................................6
2.1.
Databaseinformationforunlabeledwordforms ............................................................................6
2.2.
Partofspeech ........................................................................................................................................6
2.3.
Lemma ...................................................................................................................................................6
2.4.
Paradigmaticrelationbetweenwordformandlemma ..................................................................6
2.5.
Attestation ...........................................................................................................................................10
2.5.1. Attestationsonthetokenlevel.....................................................................................................11
2.5.2. Attestationsonthetextlevel ........................................................................................................15
2.5.3. Verifyingnonanalyzedwordforms...........................................................................................15
2.6.
Derivations ..........................................................................................................................................15
2.7.
Documents,corporaandworkflowmanagement.........................................................................16
3. Informationattachedtolemmata ..............................................................................................................17
3.1.
Lemmaid ............................................................................................................................................18
3.2.
Modernlemmaform..........................................................................................................................18
3.3.
Lexicalpartofspeech ........................................................................................................................19
3.4.
Genderandotherpossiblegrammaticalfeatures..........................................................................19
3.5.
Namedentitylabel.............................................................................................................................19
3.6.
Inflectionalclass(es) ...........................................................................................................................20
3.7.
Language .............................................................................................................................................20
3.8.
Gloss.....................................................................................................................................................20
3.9.
Multiwordexpressions......................................................................................................................20
3.9.1. Multiwordnamedentitylemmata ..............................................................................................22
3.10. Morphologicalanalysis .....................................................................................................................23
3.11. Unresolvedambiguityinlemmaassignment ................................................................................25
3.11.1.
Portmanteaulemmata...............................................................................................................26
3.11.2.
Transcategorisation(conversion),sublemmaandmainlemma .........................................26
3.12. Addingcustominformationonthelemmalevel...........................................................................28
3.13. AdditionalstructureforrelatedentriesinNElexica ....................................................................28
3.14. Namedentityparts ............................................................................................................................29
4. Informationonthedocumentlevel...........................................................................................................31
5. Auxiliaryinformationforwordformsynthesisandanalysis...............................................................32
5.1.
Datatosupportthemodellingoforthographicvariation ............................................................32
5.2.
Informationaboutparadigmaticexpansion...................................................................................35
5.3.
Databaseinformationforstems ...................................................................................................36
6. Lexicalsource ...............................................................................................................................................37
6.1.
Ambiguityinformation .....................................................................................................................39
7. ConvertingthedatabaseintoLMF............................................................................................................40
7.1.
Introduction. .......................................................................................................................................40
Lexicon structure
IMPACT
EE2
Mappings.............................................................................................................................................40
7.2.
7.2.1. Onnotation .....................................................................................................................................40
7.2.2. Unlabelledwordforms. ................................................................................................................40
7.2.3. Inflection(labelledwordforms). .................................................................................................41
7.2.4. Composition. ..................................................................................................................................41
7.2.5. Spelling............................................................................................................................................42
7.2.6. Clitics. ..............................................................................................................................................42
7.2.7. Portmanteau. ..................................................................................................................................43
7.2.8. Transcategorization. ......................................................................................................................43
7.2.9. Multiwordexpressions. ................................................................................................................44
7.2.10.
Multiwordnamedentities........................................................................................................45
7.2.11.
Attestations.................................................................................................................................45
7.3.
ConvertingrelationaldatatoXML..................................................................................................46
8. References .....................................................................................................................................................47
AppendixA:Databaseschema ..........................................................................................................................49
AppendixB:Filtersfortheexportofrelevantsubsetsfromthelexicon .....................................................57
AppendixC:ScriptforconvertingrelationaldatatoLMF(XML):relDB2xml.pl....................................57
AppendixD:StructureDefinitionfortheDutchLexicon.............................................................................59
Lexicon structure
IMPACT
EE2
1. Introduction
IMPACT lexica are computational lexica which will be used in two ways: in OCR to enhance word
recognition, and in Enrichment, to enable variationindependent searches. The core database objects
are word forms, lemmata and documents2. All other objects define some kind of relation between
these.
InordertoenabletheOCRsspellcheckingmechanismtoassesstheplausibilityoftheoccurrenceofa
wordinacertaintext,itisnotsufficienttoconvertexistinglexicaanddictionariesintoalargeword
list.Wealsoneedto
1. Keeptrackofthesourcesfromwhichwetookthewords(LexicalSource,cf.section6)
2. List the actually encountered words in the language and record occurrences in actual texts,
withfrequencyinformation(attestation,cf.section2.5)
3. Recordinwhatkindoftextsthesewordsoccur(documentproperties,cf.section4)
It is impossible to extract all possible word forms from the limited amount of available reliably
transcribedhistoricaltext.Hence,weneedmechanismstoextendthelexiconandtobeabletoassess
the plausibility of hypothetical words without previous attestations, i.e. words we have not seen
before.Supportingdataforthesemechanismshavetobepresentinthedatabase,suchas:
1. Unknowninflectedformsoflemmatawhichalreadyareinthedatabasecanbedealtwithby
means of the automatic expansion from the lemma to the full paradigm of word forms
(paradigmaticexpansion,thedatabaseinformationforthispurposeisdiscussedsection5)
2. Newspellingsofknownwordscanbedealtwithbydevelopingagoodmodelofthespelling
conventions of the period at hand (cf. section 5.1 for the storage of orthographic variant
patterns)
3. Previouslyunseencompoundscanbedealtwithbymeansofagoodmodelofwordformation
(cf.section3.10fortheassociateddatabaseinformation)
In order to effectuate word searches without having to worry about inflection and variation of
wordforms, Enrichment will use modern lemmata as variationindependent retrieval keys for the
fullspectrumofinflectionalandorthographicalvariation.
The database structure is most conveniently discussed by dividing the information into a few main
blocks:
1. Informationattachedtowordforms,eitherunlabeled(i.e.notyetlemmatizedorlabeledwith
PartofSpeech)orlabeled(i.e.withlemmaandpossiblyPoS),cf.section2.
2. Informationattachedtolemmata(section3)
3. Informationaboutdocuments,partsofdocuments,documentcollections(section4)
4. Auxiliary information needed for expansion and for plausibilityofnewwords prediction
(section5)
5. LexicalSource(section6)
Documentisunderstoodhereasasequenceofwords,togetherwiththedocumentmetadata(section4)
Lexicon structure
IMPACT
EE2
Statusofinformation:externalorinternal,optionalormandatory
Part of the lexicon database information is intended to be delivered to other work packages, other
informationispresentbecauseitisusefulinthelexiconbuildingprocess.
WespecifywhichinformationisreallyadeliverablepartoftheEE3output.
A survey of the database fields can be found in the Database scheme (Appendix A). Appendix B
briefly touches on the lexicon API in development. An XML interchange format is proposed in
AppendixC.
Lexicon structure
IMPACT
EE2
2. Informationattachedtowordforms
There are two distinct objects in the database on the word form level: unlabeled wordforms (i.e.
withoutlinguisticinformationattachedtothem)andlabeledones(i.e.labeledwithlemmaandpartof
speech)
2.1. Databaseinformationforunlabeledwordforms
Unlabeled word forms may be used in OCR. They only need to be attested in texts. Attestation
informationistheonlykindofinformationwelinktounlabeledforms.(cfsection2.5,attestation)
Statusofattestationinformationforunlabeledwordforms:Mandatory,external(useinTR5)
2.2. Partofspeech
Each labeled word form is linked to one or several lemmata and assigned a Part of Speech
(part_of_speech)label.Thisgrammaticaltag3ismorespecificthantheoneassignedtothelemma(cf.
3.3),asitmayincludeinformationaboutinflection,tense,number.
ItisnotyetclearhowmuchdetailneedstobeincludedinIMPACTlexica.Wemightacceptacertain
levelofunderspecification,becauseclearly,thedistinctionbetweenformallyidenticalpositionsinthe
paradigm is beyond the scope of IMPACT. So instead of tagging loopt as a second or third person
singular (which means a lot of effort has to be put in disambiguation), we may mark it simply as a
finiteverbendingwitht.4
Status of this information: Part of speech is not externally required, but hardly dispensable, as the
relationbetweenalemmaanditsinflectedformscannotbedefinedwithoutit.
2.3. Lemma
Fieldcontent:theIDoftherelevantlemmaobject.
Status:mandatory
2.4. Paradigmaticrelationbetweenwordformandlemma
Itisessentialthatthelexiconexplicitatestheparadigmaticrelationsbetweenlemmataandtheirword
forms.
3
4
TorefertothisgrammaticaltagasPartofSpeechisanabuseofterminology.
CfBie(2004)foradiscussionofthisdistinctionbetweenamorphologicalword(finiteverbendingwitht)
andamorphosyntacticword(thirdpersonsingular).
Lexicon structure
IMPACT
EE2
Inflectedforms
Thisinformationisnotabouttheformalstructureoftheinflectedform,butmerelyservestointerlink
lemmataandinflectedforms.ThislinkisstoredinobjectsoftypeAnalyzedWordform,whichhavea
PoS property and link to the lemma on the one hand and to the wordform on the other hand. See
Figure1forarepresentationofthedatabasestructureandTable1foranexample.
Figure1:databasemodel5foranalyzedwordforms6
Table1.
Tablelemmata
lemma_id
modern_lemma
lemma_part_of_speech
L1
Marcher
VRB
Tableanalyzed_wordforms
analyzed_wordform_id
part_of_speech
lemma_id
wordform_id
A1
VRB(fin,-erons)
L1
W00001
Tablewordforms
wordform_id
wordform
W00001
marcherons
Cliticcombinations7
ThediagramsinthisdocumentareinCrowsFootnotation.Theyhavebeengeneratedfromthedatabaseby
Mysql Workbench 5.0.22. As a result, all relations are annotated as being of the 1:m type with both
referencingandreferencedtablemarkedasmandatory.Thismeansthatsomeofthelogicalconstraintsare
notaccuratelyreflectedinthediagrams.
6 The structure changed with respect to the previous version. Instead of just a flat sequence,hierarchy is now
possible.Itisunlikelythatwewillusethisverymuch,butwehadtoincorporatethepossibilityofhavingat
leasttwolevelsbecauseofcliticcombinationsoccurringinsidemultiwordexpressions.
7Weusethetermcliticcombinationtorefertowordformslikedutchneemtse,whichisacombinationofafinite
verbform(neemt=germannimmt)andanunstressed,phoneticallyreducedpronoun(ze).Thisphenomenonis
muchmorefrequentinhistorical(anddialectic)DutchthaninGerman.Cliticsmaybeattachedtootherword
classes like conjunctions andmorethanonecliticcan beattachedtoasingleword(cfindienmense~german
indemmansie).
5
Lexicon structure
IMPACT
EE2
Cliticcombinations willbelemmatizedbyassigningan orderedsequenceoflemmata.Awordform

likesboexs(=desBuches)willthusbelemmatized(HET,BOEK).Inthedatabase,theorderingwillbe
reflected by assigning a sequence number tot the lemma parts (see Figure 2 and Tabel 2). Each part
willhaveitsownpartofspeech.Thus,thecompleteLemmaPoSassignmentforsboexswillbe
Sboexs~{(1,HET/DAT,PRN),(2,BOEK,NOU(infl=s)}.
Thesequencenumbersareincludedtodistinguishbetweenwordslikekzagandzagk.
Comment
Thistreatmentofcliticcombinationsservesthefollowingpurposes:thelemmapartscanbe usedas
searchkeys,whilethecombinationofallpartsservesasavariationindependentkeygrouping
differentrealizationsofbasicallythesamecliticcombination.
Asegmentationofthecliticcombinationasasequenceofwordformsisnotincludedinthedatabase
becausethisis,inmanycases,problematicbecauseofsandhiphenomena,cf.middledutchdat=dat+
het,MiddleGermandeist=da+ist.
CliticcombinationsareverycommoninItalianandSpanish(damelo=da+me+lo,givemeit.Ofcourse,
theyarequitecommoninMiddleGerman(deist=da+ist,enloufen=en(not)+laufen,etc.).
Figure2.Multiplelemmataanalysis.
Table2:exampledataforananalyzedcliticcombination
Tablelemmata
Lemma_id
modern_lemma
L1
Ik
PRN
L2
Zij
PRN
L3
Zien
VRB
Analyzed_wordform_id
Pos
part_number
Multiple_lemma_analysis_id
lemma_id
wordform_id
A1
CLITIC
NULL
Mla_1
NULL
W00001
Tablemultiple_lemmata_analyses
Multiple_lemmata_analysis_id multiple_lemmata_analysis_part_id Nr_of_parts
Part_number
Mla_1
Mlap_1
Mla_1
Mlap_2
Mla_1
Mlap_3
Lexicon structure
IMPACT
EE2
Tablemultiple_lemmata_analyses_parts
Multiple_lemmata_analysis_part_id multiple_lemmata_analysis_part_id Part_nr
POS
Lemma_id
Mlap_1
Mla_1
PRN
L1
Mlap_2
Mla_1
VRB
L3
Mlap_3
Mla_1
PRN
L2
Tablewordforms
Wordform_id
Wordform
W1
ksachse
Status of this information: mandatory when applicable (when clitic combinations are prominent in
thelanguage,somethinghastobedoneaboutthem).Use:internalandexternal(TR5hastobeableto
dealwiththecliticcombinationsaswell).
Lexicon structure
IMPACT
EE2
2.5. Attestation
OneofthemostimportanttasksfortheIMPACTlexiconbuildingprocessistokeeptrackoftheorigin
ofwordforms.Anunstructured,evergrowingsetofwordforms,withoutinformationaboutthekind
oftext(intermsofperiodandsubjectmatter)inwhichwecanexpectthewordstooccur,isneither
usableintextrecognitionnorinenrichment.Hence,toeachlabeledorunlabeledwordform,welink
attestation objects which are basically just verified occurrences of the words in documents. The
attestations enable us to derive the relevant information about the domain of applicability of word
formsfromthepropertiesofthedocumentstheyoccurin.
When a word form is taken from a lexicon or dictionary, or it originates from automatic analysis
expansion,wealsokeeptrackofitsprovenance.Thisiscoveredinthenextsection.
Besidesthelinktotherelevantwordformandalocationinadocument,theattestationobjectscontain
thefollowinginformation:
Verification(yes/no):Istheoccurrenceofalabeledwordformcheckedmanuallybyanexpert?
Frequencyinadocumentordocumentcollection
Several distinct kinds of attestation may be relevant: we may just link a word form to a document,
recording the frequency of occurrence (attestation at text level), or we may link to an individual
occurrence of the word (attestation at the token level)8. The latter kind of attestation is especially
relevant to tagged corpora. In the lexicon building workflow, lemmata may first be assigned on the
textlevel,andambiguityisnotcompletely resolved.Atalater stage,ambiguitymayberesolvedby
assigninglemmataonthetokenlevel.
Atypeisawordform,atokenisaparticularinstance(occurrence)ofthetypeinatext.
10
Lexicon structure
IMPACT
EE2
Figure2:databasemodelfortheattestationofwordformsindocuments9
2.5.1.
Attestations on the token level
Therepresentationofalemmatizedfragmentinthedatabase:
Everybodyislovedbysomebody?
Tablelemmata
lemma_id
l1
l2
l3
l4
l5
l6
modern_lemma
EVERYBODY
BE
LOVE
LOVED
BY
SOMEBODY
PRN
VRB
VRB
ADJ
ADP
PRN
Tablewordforms
wordform_id
wf1
wf2
wf3
wf4
wf5
Wordform
Everybody
Is
Loved
By
Somebody
Part_of_speech
lemma_id wordform_id
ana1
PRN
l1
wf1
ana2
VRB(3sg)
l2
wf2
ana3
VRB(part)
l3
wf3
ana4
ADJ
l4
wf3
ana5
ADP
l5
wf4
ana6
PRN
l6
wf5
Tabletoken_attestations
Attestation_id
1
2
3
4
5
6
Quote
NULL
NULL
NULL
NULL
NULL
NULL
ana1
ana2
ana3
ana4
ana5
ana6
Document_id
text1
text1
text1
text1
text1
text1
onset
0
9
12
12
18
21
offset
8
11
17
17
20
29
2.5.1.1.
Tokengroupattestations
Thereareseveralwaysinwhichagroupofgraphicaltokenscanbelinkedtoasingleanalysisofthe
We only give the diagram for attestations of labeled word forms. The diagram for attestations of unlabeled
wordformsiscompletelyanalogous.
11
Lexicon structure
IMPACT
EE2
groupasawhole.
1. Twoormoretokensarejoinedbyawordformgroup;thegroupisanalyzedasawhole;This
is typically the case for more or less accidental split realizations of noncompound word
formslikegelopenasgelopen;
In these cases all group members get the same analysis as a token attestation and all group
members are mentioned in the wordform_groups table. The group_attestations (cf. 2. below)
tableisnotusedinthiscase.Thefollowingtablesareused:
2. Thetokensarejoinedinawordformgroup;buttheindividualtokenshaveananalysisoftheir
own.Thisisapplicableto
1) Idiomaticexpressions(nottackledassuchinIMPACT)
2) Multiword named entities. E.g. Benedykta Chmielowskiego is analyzed as a compound
wordformfortheNElemmaBenedyktChmielowski;butwealsodonotwishtoomitthe
informationthatBenedyktabelongstothelemmaBenedykt,andChmielowskiegobelongs
toChmielowski.
The following structure is present in the database for this purpose: wordform_groups serves to link
several tokens by a single group id. Group_attestations gives the possibility to link such a group of
tokensasattestationdatatoanalyzed_wordforms.
12
Lexicon structure
IMPACT
EE2
Togiveanexample,supposewehavethefollowingshortsentence:
ToJestPrzezXidzaBenedyktaChmielowskiegoDziekanaRohatyskiego,Firlejowskiego,
PodkamienieckiegoPasterza.
13
Lexicon structure
IMPACT
EE2
Token
raw token with punctuation
Character offset
of
start
of
token
Character
token
To
To
Jest
Jest
Przez
Przez
13
Xiedza
Xiedza
14
20
Benedykta
Benedykta
21
30
Chmielowskiego
Chmielowskiego
31
45
Dziekana
Dziekana
46
54
Rohatyn skiego
Rohatynskiego,
55
68
offset
of
end
of
lemmata
lemm
a_id
modern_lemma
lemma_pos
to
PRN
wordform_id
Word form
byc
VRB
wf1
To
przez
ADP
wf2
Jest
Przez
wordforms
Ksiadz
NOU
wf3
Benedykt
NOU
wf4
Xiedza
wf5
Benedykta
Wf6
Chmielowskiego
Wf7
Benedykta
Chmielowskiego
Chmielowski
NOU
Benedykt
Chmielowski
NOU
analyzed_wordforms
Analyzed_word_form_id
Pos
Multiple_lemma_analysis_id
lemma_id
wordform_id
A1
PRN
NULL
Wf1
Wf2
A2
VRB
NULL
A3
ADP
NULL
Wf3
A4
NOU
NULL
Wf4
Wf5
A5
NOU
NULL
A6
NOU
NULL
Wf6
A7
NOU
NULL
Wf7
wordform_groups
group_attestations
Wordform_group_id
document_id
onset
offset
text1
21
30
text1
31
45
Group_attestation_id
Wordform_group_id
Ana7
token_attestations
attestation_id
document_id
start_pos
end_pos
A1
text1
A2
text1
A3
text1
13
A4
text1
14
20
A5
text1
21
30
A6
text1
31
45
14
Lexicon structure
IMPACT
EE2
2.5.2.
Attestations on the text level
Thistypeofattestationislinkedtotheoccurrencesofawordintext,withoutspecifyingthelocationin
the document. It is important in our workflow thatalso partially disambiguated information can be
storedandused.i.e.severalattestationsmaybelinkedtothesametypeortoken.
Tabletextlevelattestations.
Attestation_id
Frequency
Tla1
23
Verified
True
Analyzed_wordform_id Document_id
A100
Text1
2.5.3.
Verifying non-analyzed word forms
Insomecontextwordformscanbeattestedforwhichnoanalysisisavailable.Forthisreasonthetable
token_attestation_verificationsisintroducedinthedatabase.Attestationsofthistypelinkdirectlyto
wordforms.
In some cases the annotator might decide not to assign a lemma to the token. The token is then
markedasverified.Verifiedtokensmightberevisitedatalaterstage.
Status of attestation information: mandatory, external (for use in TR5), and for internal use in EE2
andEE3
2.6. Derivations
Figure3.Derivations
Wordformscangetamoreelaborateanalysisthanjustapartofspeechandagloss.Amodernword
formcanbeattached,andpossiblyalsoasetofpatternsthatdescribeshowtogettotheolderword
formfromthemodernone.E.g:
theyle,<teile>,[(t_th,0),(ei_ey,1)],NOUN,teil
Here the part between angled brackets (<>) describes the modern word form, and the part between
15
Lexicon structure
IMPACT
EE2
squarebrackets([])describesthepatterns.
Tablederivations
derivationid
normalized_form
pattern_application_id
dentifierofthederivation
Themodernwordform.CanbeNULL.
Identifier of pattern application if applicable. Can be empty, in
whichcaseitis0(nil,notNULL)
Tablepattern_applications
Position
number_of_patterns
pattern_id
Identifier of the derivation. NOTE that this is NOT a primary key.

Ratheritisusedtogroupsevaralpatternstogether.Theuniquekey
ofthistableiscomposedofallthefieldtogether.
The positionin thestring that the pattern is applied to(0 and 1 in
theexample)
The amount of patterns that go with this analysis (two in the
example above). This number is in a way redundant, because it is
always the same as the amount of records sharing the same
identifier.Storing the number herehowever makes somequeries a
lotfasterandeasier.
Identifierofthepatternassociated.
Tablepatterns
pattern_id
left_hand_side
right_hand_side
dentifierofthepattern
Thelefthandsideofthepatterns.Whatisleftoftheunderscore.So
tandeiintheexampleabove.
Therighthandsideofthepatterns.Whatisrightoftheunderscore.
Sothandeyintheexampleabove.
Pleasenotethatbothpatternsandmodernwordformscanbeempty.
Inotherwords
theyle,[(t_th,0),(ei_ey,1)],NOUN,teil
theyle,<teile>,NOUN,teil
arebothvalidanalyses.
Iftherearepatternsbutnomodernwordform(asinthefirstexampleabove),arowinthederivations
tableiscreatednonethelesstotiethepatternsandtheanalyzedwordformtogether.Itsmodernword
formfieldwillbeleftemptyhowever.
Ifthereisnopatternbutamodernwordformisprovided(asinthesecondexampleabove)thenthere
willjustbearowinthederivationstableandnocorrespondingpatternapplicationsnorpatterns.
2.7. Documents,corporaandworkflowmanagement
Inordertorecordprovenancedetails,thedatabaseisprovidedwiththestructuredepictedinFigure4.
16
Lexicon structure
IMPACT
EE2
Figure4.Documents,corporaandworkflowtables.
Documentscanbeorganizedincorpora.Animportantreasonforthisistheallocationofpropertiesto
alargenumberofdocumentsatonce.
The table type_frequencies contains the relations between word forms and documents. When a
document is to be annotated, all of its word forms are added to the table wordforms (unless there
already exists an entry for that word form). Simultaneously, the frequency of the word forms
occurringinthedocumentisregisteredintabletype_frequencies.
The table dont_show, can be used during the building of the lexicon. Certain word forms (e.g.
frequent function words) should not be presented to the annotators over and over again during the
processofattestingdocumentsandcorpora.Itispossibletoexcludecertainwordformsfromattesting
inacertaindocument,inacertaincorpus,orinalldocumentsandcorpora.
Tabledont_show
Wordform_id
Wf201
Document_id
Corpus_id
SG1873
At_all
User_id
1
Date
15-01-2010
Foradministrativepurposesweaddedatableusers.Hereweregisterstaffmemberswhoaretasked
withmanualannotationandverification.
Tableusers
User_id
1
name
Jan van der Wiel
3.Informationattachedtolemmata
Lemmata are linked to word forms (cf. 2.4). In their turn, lemmata need several other information
17
Lexicon structure
IMPACT
EE2
categoriestofulfilltheirroleinthelexicon,whichwillbedescribedinthissection.
Figure5:basiclemmainformation
3.1. Lemmaid
ItgoeswithoutsayingthateachlemmaisassignedauniqueID.
Status:mandatory
3.2. Modernlemmaform
RecallthatthemodernlemmaformisusedasavariationindependentsearchkeyinEnrichment.The
generalruleistoassignasinglemodernlemmaform.Insomecases,itwillbeprofitabletoaddmore
thanonemodernlemmaform,becauseseveralvariantssurviveinthemodernlanguage,withmoreor
less equal status. A separate table stores these variants. Typical examples in Dutch: Weer/weder,
neer/neder.
There will be a separate document about the principles of assigning a modern lemma to historical
wordforms.
Statusofthisinformation:mandatory,bothforinternalandexternaluse.Modernlemmavariantsare
optional.
18
Lexicon structure
IMPACT
EE2
3.3. Lexicalpartofspeech
Amainpartofspeechisassignedtoeachlemma(e.g.NOUN,VERB,ADPOSITION,.).
PartofspeechisnotbyitselfadeliverableofIMPACT,butthelexiconcannotbeorganizedwithoutit.
Part of speech distinguishes lemmata.Additional features (like gender, inflectional class) do not by
themselvesconstituteasufficientcriteriontodistinguishlemmata,sincetheyareverymuchsubjectto
historical variation (e.g.: at least 3815 nouns from the Dutch Woordenboek der Nederlandsche Taal
havemorethanonepossiblegender).Wedonotspecifywhichadditionalfeaturesmaybeusedforall
differentlanguages.Instead,weprovideageneralmechanismforaddingfeatures(cf.3.6and3.4).
Statusofthisinformation:mainlyfor internal use10,but hardlydispensableas ameanstoorganize

thelexicon,so:mandatory.
3.4. Genderandotherpossiblegrammaticalfeatures
Gender information is important as an organizational principle in, for instance, German. In other
languages,featureslikeanimate/nonanimatemayberelevant.
Inlanguageswithpoorinflectionmorphology,itisoftenpossibletohaveseveralgendersforasingle
lemma.Hencethesuggestedgeneralfeatureassignmentmechanism(cf.figure1).
Example:gender
Tablelemma_features
lemma_feature_id
1
2
Lemma_feature_name
Gender
Foreign_Language_Name
Tablelemma_feature_values
lemma_feature_value_id
1
2
3
4
lemma_feature_value
M
V
French
German
TableLemma_feature_assignments
assignment_id
1
2
3
feature_id
1
1
2
value_id
1
2
4
lemma_id
19289
19289
20001
Status:optional,internal,dependingontheimportanceofthesenotionsinthelanguageathand.
Remark: Within the NE context, this can be used to tag words as belonging to a foreign language
(Koroka[SLOVENIAN]).
3.5. Namedentitylabel
For named entities (NE), either multiword or single, a classification label is added according to the
PartofspeechtaggingisnotadeliverableofIMPACT
10
19
Lexicon structure
IMPACT
EE2
schemechosenforIMPACT.TheproposedlabelsareNEPER(persons),NELOC(locations),NEORG
(organizations).
Statusofthisinformation:forinternalandexternaluse,mandatory.
3.6. Inflectionalclass(es)
Inflectionalclassesarenecessaryforthebasicgenerationofwordformsinthereverselemmatization
task.
Statusofthisinformation:forinternaluse,buthardlydispensableasameansoforganization.
3.7. Language
Whenatextcontainswordsfromanotherlanguage,theyshouldbemarkedaccordingly.
3.8. Gloss
Lemmata may have a short description of word meaning. This is especially relevant to be able to
distinguishbetweenhomographs.
Status:optional,internalandexternaluse.
3.9. Multiwordexpressions
The inclusion of multiword expressions (MWE) takes us to the boundary of syntax and
morphosyntax.Alotofrecentresearchhasbeendevotedtothepositionofmultiwordexpressionsin
thelexicon;muchofthisworkisconcernedwiththesyntactictreatmentorthesemanticinterpretation
ofidioms,whichisdecidedlyoutofscopeforIMPACT.
WithinIMPACT,MWEarelikelytoplayarolefornamedentitiesandforconstructionswhichcanbe
realizedbothasasingleorthographictokenandasseveraltokens(e.g.separableverbsanddetached
wordparts).
Therearetwodistinctwaysofaddingmultiwordstructuretothedatabase.Wecanmapamultiword
expression realized as a word form to a sequence of lemmata and PoS labels by using the structure
alreadypresentinthedatabaseforthestorageofcliticcombinations(cf.2.4),andtheconstituentparts
of multiword lemmata are specified using a mechanism parallel to the way we treat morphological
analysis(3.10).Sometypicalcases:
1. Transparent:thereisaclear11correspondencebetweenthepartsofthewordform,separated
bywhitespace,andthelemmaparts.KarlderGrosse,KarlsdesGrossen.
Most naturally seen as a sequence of word forms, each with their own lemma and PoS. The
sequencehasahigherlevelPoSandlemmaaswell.Cf.also2.5.1.1.
2. Nontransparent: zu ruck: two typographic words but just one linguistic word form
(containingwhitespace,nospecialtreatmentrequiredinthelexicaldatabase).
3. SomecombinationslikeMiddleDutchaldiewiledat(allthewhilethat):admitforbothpointsof
20
Lexicon structure
IMPACT
EE2
view. The fact that the combination occurs with different typographical segmentations (cf.
examples below) points to an analysis along the lines of the analysis of clitic combinations.
(DictionaryofEarlyMiddleDutch:
Ende al de wile dat soe drinct yet Sone drinc en twint selue niet, En.Cod. p. 486487, r. 426,
OostVlaanderen,1290
Endealdiewiledatsighingenoliekoepen,soquamdiebrudegoem,endediegheretwaren,
ghingenmetheminterbrulocht,Diat.p.222,r.1216,BrabantWest,12911300)
The equivalence class method (ECM , Odijk 2004) is quite similar to what we intend to do on the
lemmalevel.Inordertoarriveatarepresentationwhichcanbeusedindifferentpossiblegrammatical
theories,Odijkproposestoincludethefollowinginformationforeachidiom:
1. Idiompatternid(=ourmultiword_operation_id)
2. Idiomcomponentlist(=multiword_analysis)
3. Examplesentence(weshouldgetthisfromtheattestations)
In order to deal with inflected forms of multiword expressions, pattern equivalence can be defined
suchthatequivalentmultiwordexpressionshavesimilarinflectionalproperties.
Status of multiword data: optional for the general lexicon; indispensable for the named entities
lexicon.
Figure6:databasemodelformultiwordlemmata11
Tablelemmata
lemma_id
L102
L501
L502
L503
modern_lemma
al-de-wijl-dat
Al
De/die12
Wijl
lemma_pos
CONJ
PRN
PRN
NOU
Foranexplanationoftheselfreferenceinthedefinitionofthelemmatatable,cfsection3.11.1,portmanteau
lemmata
12Theslashindicatesalternatives
11
21
Lexicon structure
IMPACT
EE2
L504
Dat
CONJ
Tablemultiword_analyses
Multiword_analysis_id
a102
Arity
4
Analyzed_lemma_id
L02
multiword_operation_id
M1
Tablepart_multiword_analysis
Part_multiword_analysis_id
P1
P2
P3
P4
Part_number
1
2
3
4
Part_lemma_id
L501
L502
L503
L504
a102
a102
a102
a102
3.9.1.
Multiword named entity lemmata
TheinclusionofNamedEntities(NEs)inthelexiconiscrucialinthesensethat,ontheonehand,text
recognition is based on input from the lexicon, so we want to capture as many possibly occurring
tokensaspossible,andontheotherhand,namesofpersons,organizationsandplacesareverylikely
candidates for users search queries, hence, normalizing them with respect to orthographical and
interlingualvariationisdesirable.
NEs can occur in the form of multiword expressions or as single tokens. In principle, the mapping
from multiword NEs to lemmas works the same way as with idiom parts, i.e., the entire complex
receivesaLemmaID,andthepartsaremappedontotheircorrespondinglemmas,ifavailable.Forthe
possible values of the property NE label cf. section 3.5. For the treatment of wordforms and
attestationsformultiwordNEs,cf.2.5.1.1.
Tablelemmata
lemma_id
L202
L601
L602
L603
L604
Modern_lemma
Jan van de Wiel
Jan
Van
De
Wiel
lemma_pos
NOU
NOU
ADP
PRN
NOU
ne_label
NE_PER
NE_PER
NE_PER
A202
Arity
4
Analyzed_lemma_id
202
Multi_operation_id
m1
Part_multiword_analysis_id
P1
P2
P3
P4
Part_number
1
2
3
4
Part_lemma_id
L601
L602
L603
L604
A202
A202
A202
A202
NotethatthereisnodistinctionbetweensinglewordandmultiwordNEs,asbothtypesareidentified
asthesamePoScategory,andthatthepersonnameWielisnotmappedtothenounwiel(wheel).
StatusofNEinformation:mandatory,internalandexternaluse
22
Lexicon structure
IMPACT
EE2
3.10. Morphologicalanalysis
Thissectionisaboutderivationandcomposition.Theparadigmaticrelationbetweenlemmaandword
formsistreatedinsection2.4.Morphologicalanalysiswillbeattachedatthelemmalevel.Theword
formsbelongingtolemmaswillinheritthisanalyticalinformation.
WithinIMPACT,morphologicalanalysisisnotapurposeofitsown,butservespracticalends:
tofunctioninaspellcheckerthatdoesnotrejectnewlyfoundproductivecompoundsbecauseoftheir
deviantforms
analysisofexistingcompoundscanbeusedtopredictinflectionalformsforcompounds/wordforms
whichwillbegeneratedautomatically(expansion).
Someremarks:
1) Morphological analysis can be specified in the form of a full hierarchical analysis, or a flat list of
components, or (partial analysis) one can just specify the head of the compound, which usually
determinesitsmorphosyntacticproperties.Theproposeddatabasestructureiscompatiblewiththese
three possibilities. We want to stress the idea that different solutions are possible for different
languages.
Tofulfillpracticalends,wedontalwaysneedfullblowndeepanalyses.Weonlyhavetobeabletosay
whichtypeofcompoundandwhichfinalpartsofacompoundareveryfrequent.
A deep analysis can be obtained by storing, recursively, the analyses of the immediate
constituents(Braumeisterfleischpflanzeisanalyzedasanominalcompoundof Braumeister+
Fleischpflanze, a deeper analysis can be stored if Braumeister is analyzed in its turn as
brauen+MeisterandFleischpflanzeasFleisch+Pflanze,etc.).
An (arbitrarily long) flat analysis of a compound is also possible,
Braumeister+Fleisch+Pflanze. There is often no need to choose between different
bracketingsofacompound.
Ifthefocusisonpredictingthemorphosyntacticpropertiesofthecompound,itissufficient
toanalyzethiswordasnominalcompoundwithlastpartPflanze.ItisNOTmandatoryto
linktoallpartsofacompound.
2)Diminutivesareassignedlemmataoftheirownbuttherelationtothebaselemmaisstored.
3)Itisallowedtoassignmorethanoneanalysisofonecompoundlemma.
23
Lexicon structure
IMPACT
EE2
Figure7:databasemodelformorphologicalanalysis
Statusofmorphologicalanalysis:internal+external,optional(inthesensethatnotallwordsmustbe
analyzed)
Internaluse:usetopredictparadigmsofcompoundsandderivations
Externaluse:useinOCRtohelpassesstheprobabilityofunknownwords
Table1:databaseexamplesformorphologicalanalysis
Tablelemmata
Id
Modern_lemma
Lemma_pos
L000001
Appelflap
NOU
L000002
Appel
NOU
L000003
Flap
NOU
D1
Braumeisterfleischpflanze
NOU
D2
Braumeister
NOU
D3
Pflanze
NOU
D4
Brauen
VRB
24
Lexicon structure
IMPACT
EE2
D5
Meister
NOU
D6
Fleisch
NOU
D7
Fleischpflanze
NOU
Tablemorphological_analyses
Morphological_analysis_id Arity
Analyzed_lemma_id Morphological_operation_id
A1
l000001
o1
A2
d1
o1
A3
d2
o2
A4
d7
o1
A5
L000001
o3
Tablemorphological_operations
Morphological_operation_id description
resulting_pos
O1
NOU+NOU->NOU
NOU
O2
VRB+NOU -> NOU
NOU
O3
.* + NOU -> NOU
NOU
Tablepart_morphological_analysis
Part_morphological_analysis_id Part_number
Part_lemma_id
Morphological_analysis_id
P1
L0000002
A1
P2
L0000003
A1
P3
D2
A2
P4
D7
A2
P5
D4
A3
P6
D5
A3
P7
D6
A4
P8
D3
A4
P9
d3
A5
Note:Analysesa2,a3,a4constituteahierarchicalanalysis((Brau)(meister))((fleisch)(pflanze)), a5isaflat
analysis(brau_meister_fleisch_pflanze)whichonlylinkstotheheadofthecompound.
3.11. Unresolvedambiguityinlemmaassignment
Therearevariouswaysofdealingwithambiguouswordformsinthedatabase.Thebasicmechanism
is always the same: different analyses are attached to a single word form. This makes it possible to
either leave it like that and not resolve the ambiguity at all or resolve it partially or resolve it
completely.Thisdependsontherequirementsforthetask.
The two mechanisms described below mainly serve to distinguish ambiguities which need not be
resolvedinIMPACTfromotherambiguitieswhichpossiblydorequireapartialresolution.
25
Lexicon structure
IMPACT
EE2
3.11.1.
Portmanteau lemmata
Aportmanteaulemmaisalemmarepresentingagroupofhomographs.
Thepurposeofportmanteaulemmataistoavoidchoosingbetweentwohomographiclemmata(with
equalmodernlemmaformandPoS),butdifferentinmeaning,inflectionclassorgender.
Portmanteaulemmatawillbeimplementedasordinarylemmata,linkedtothehomographs.
Cf. heer1 (lord), heer2 (army) or bank1 (couch), bank2 (bank), Wetter1 (person who places bet), Wetter2
(weather).
Portmanteau lemmata can be used to avoid complete disambiguation in morphological analysis as
well: cf. tuinbank vs. handelsbank or heerbaan (heirbaan) vs. heerendas. A word form which belongs
unambiguously to the paradigm of one of the homographs can be assigned directly to the more
specificlemma,e.g.theoldformharbelongsonlytoheer1.
NB:portmanteaulemmatawillnotbeusedtogrouphomographiclemmatawithdifferentPoS.Cf.the
discussionoftranscategorization(conversion),section3.11.2.
Portmanteaulemmatawereintroducedforpracticalreasons:
Lemmatizing a word form like Dutch kip to 15 possible homographic lemmata is not very
attractive.
How to update the ambiguous lemmatizations when another homograph is added to the lexical
database?
Howtoadddatafromafullformlexiconwhichneednothavesplitthehomographsinexactlythe
sameway?
Status:optional,internal
3.11.2.
Transcategorisation (conversion), sublemma and main lemma
Transcategorization(orconversion)occurswhenpartoftheparadigmofalemmaXwithPoSAcanbe
seenasbelongingtolemmaYwithPoSB(e.g.participles,whichcanbeseenbelongtobothaverbal
and an adjectival paradigm). We call Y a sublemma corresponding to the main lemma X. In each
language, we have a (small) fixed set of productive transcategorization relations. This list will be
includedinthedatabaseforthelanguage.
While it might appear that including transcategorization information in the lexical database is
linguistic hairsplitting and not relevant to IMPACT, it must be realized that it provides us with a
principled way to avoid or defer decisions about lemma and PoS assignment to word forms like
geboren(geboren/ADJorgebren/VRB),ortoleavethechoiceuptotheuser.
26
Lexicon structure
IMPACT
EE2
Figure8:Databaseobjectsrelatingtomorphosyntacticconversion(transcategorization):
Table2:databaseexamples
TableLemmata
Lemma_id
L1
L2
Modern_lemma
Bakken
Gebakken
Lemma_part_of_speech
VRB
ADJ
TableTranscategorisations
Transcategorization_id
T1
Mainlemma_id
L1
Sublemma_id
L2
Transcategorizationtype_id
C1
TableConversion_rules(Listoftranscategorizationspresentinthelanguage)
Rule_id
R1
R2
R3
main_pos
VRB(part,past)
VRB(part,past,infl=e)
VRB(part,past,infl=en)
sub_pos
ADJ(infl=0)
ADJ(infl=e)
ADJ(infl=en)
Transcategorisation_id
C1
C1
C1
TableTranscategorisation_types
Transcategorizationtype_id
C1
Description
Conversion between past participle
and adjective
main_pos
VRB
sub_pos
ADJ
Useofthisdata:
1) Lexiconexpansion:createsublemmataautomaticallyfornonincidentaltranscategorizations
2) Postponing or omitting disambiguation: distinguish between genuine ambiguity (where for
instancetwosemanticallyandetymologicallycompletelydifferentlexemesmaybeinvolved)
andambiguityresultingfromdifferenttaggingprinciples
Statusofthisdata:optional,internal
27
Lexicon structure
IMPACT
EE2
3.12. Addingcustominformationonthelemmalevel
Ifthedatabasedesignerneedstostoreotherlemmarelatedinformation,therecommendedwayisnot
tochangethetableswhicharepartofthebasicstructure,buttoaddtableslinkingtheinformationto
therelevantlemmaIDs.If,forinstance,itisdesirabletoaddnearsynonyminformationforretrieval
purposes,thepreferredsolutionisnottoaddfieldstothelemmatatable,buttoaddatablelinkingto
it.
Example:retrieval linksfornearsynomynsorheadsof compounds.Possibleuse:whensearchingfor

lemma_id,alsosearchforrelated_lemma_id.
Tablelemmata
Lemma_id
L001
L002
L003
L004
Modern_lemma
Zange
Seitenschneider
Sonnenblume
Blume
Lemma_pos
NOU
NOU
NOU
NOU
Tableretrievallinks
Lemma_id
L001
L004
Related_lemma_id
L002
L003
Statusofthisinformation:optional,mainlyforexternaluseinretrieval.
3.13. AdditionalstructureforrelatedentriesinNElexica
Inthegenerallexicon,variantsareincludedaswordformsbelongingtothemodernstandardlemma.
Thiswillalsobethecaseforspellingvariantsoflocations(Haerlemwillbeawordformwithlemma
Haarlem,etc).
Forpersonnames,however,wefounditnotfeasibletodistinguishbetweenallographsofthesame
name and etymologically related but different names. There are also variant relations like
interlingualvariationwhichdeservespecialtreatment.
Weproposethefollowingstructure:
28
Lexicon structure
IMPACT
EE2
Examples:
Tablelemmata
Lemma_id
L001
L002
L003
L004
L005
Modern_le
mma
Krnten
Carinthia
Koroka
Douwes
Dekker
Multatuli
Lemma_pos
NE_Label
NOU
NOU
NOU
NELOC
NELOC
NELOC
NOU
NEPER
NOU
NEPER
Tablene_variant_relation_types
Ne_variant_relation_type_id
Ne_variant_relation_name
ne_variant_relation_description
1
2
Interlingual_variant
Pseudonym
Second is an other languafe variant of first

Second is a pseudonym used by first
Tablene_variant_relations
first_lemma_id
second_lemma_id
ne_variant_relation_type_id
L001
L001
L004
L002
L003
L005
1
1
2
3.14. Namedentityparts
29
Lexicon structure
IMPACT
EE2
Thesetableswereaddedtoallowpartsofnamestobemarkedassuch.
ExamplesofNEparttypesforDutchare:
Givenname
Surname
Title
Particle
Suffix
Piet
Jansen
dr., Jhr., baron
van, de, of, thoe, over, uyt
junior, senior, sr. C.zn, A.zn, IIIe, Derde
Statusofthisinformation:optional,mainlyforexternaluseinretrieval.
30
Lexicon structure
IMPACT
EE2
4. Informationonthedocumentlevel
Information about the domain of application of words will be specified on the document level. By
linkingthewordstothedocumentstheyoccurin,theywillinheritthisinformation.
Thefollowingarerelevantonthedocumentlevel.
Elementarybibliograficaldata:
Author
Editor
Title
Dateofpublication
Publisher
Publishinglocation
If document is part of e.g. a magazine, or a collected work, ...: reference to this work
andtopagesand/orissue/volumeinthismagazine,collection...
If document is in collection holders catalogue: some ID or other type of link to the
relevantiteminthecatalogue
Texttype,basedonlibrarymetadatastandards
Numberofwords
Dateoftext(candifferfromdateofpublication,e.g.incaseofeditions
Regionoforiginoftext(dialect/languagevariety)
characterencoding(UTF8)
primarylanguage
presenceofotherlanguages,e.g.Latin,French,....
Tostartwith:informaldescriptionofthetypeofspellingusedinthedocument.Inthecourse
oftheproject,thiscanbeextendedbyamoreformalprofile.(f.i.Dutch:thereisadifference
between text material in the late nineteenth century spelling of De Vries/Te Winkel and the
spellingofGroeneBoekje1954.Someauthorse.g.Multatulihavetheirownspellingrules.This
informationisrelevant.
Location(path).
Statusofthisinformation:mandatory
31
Lexicon structure
IMPACT
EE2
5. Auxiliaryinformationforwordformsynthesisandanalysis
Anotherkindofinformationistheoneforautomaticallygeneratedoranalyzedwordforms,
Here,wekeeptrackof:
theinflectionruleused
thebuildingelement(s)
the spelling patterns used to match the normalized spelling of the word form with the actual
spellingindocuments
The following diagram summarizes the relation between historical word form and modern lemma
form,whichiscentralinIMPACTlexica:
Thehorizontalaxescorrespondtomodelsforinflectionalmorphology;theverticalaxescorrespondto
spellingvariationasitwillbemodelledinIMPACT13.
5.1. Datatosupportthemodellingoforthographicvariation
Inordertobeabletoinducestatisticalmodelsforhistoricalspellingbymachinelearningalgorithms,
someextradata,besidestherelationofhistoricalwordformandmodernlemma,mustbedeveloped.
Withoutthemodernwordformequivalents,itisdifficulttoseparateinflectionfromorthographical
variation. The addition of this information is not entirely unproblematic. When there are
morphological (and phonological) differences, a historical word form in modern spelling may be a
somewhatartificialconstruct.Inmanualannotationofgroundtruthmaterial,weexpectittobemuch
easier to choose a relevant lemma from a suggestion list of possible lemma assignments, than to
choose a plausible transcription for an historical word form. There are, however, many cases where
thedifferencesbetweenmodernlanguageandhistoricallanguagearelargelyorthographic,anditis
indeedpossibletohavesomestandardrepresentationofhistoricalwordformsinmodernspelling.
Thenoisychannelmodelsusedassignweightstomulticharactersubstitutions,thusdefiningaprobabilistic
modeloforthographicvariation.
13
32
Lexicon structure
IMPACT
EE2
Themodernwordformisuseful,becauseadatabaseofmodernandhistoricalwordforms makesit
easy to induce a set of patterns relating historical and modern spelling by a machine learning
algorithm.
SUMMARIZING: It is of course not a problem to include this field in the database without being
obliged to manually verify its contents. It may be sufficient to fill this in for only a relatively small
number of word forms in a certain orthography in order to obtain the set of patterns needed to
describethisparticularorthography.
Example:themanualverificationofthelemmaassignmentzeggentothehistoricalwordformseg(h)eiseasy,
choosingamodernform(zegorzeggeorevenzeggen)ismuchlessstraightforward.
Modern
Position in paradigm Middle dutch
1e sg.ind.pres.
1e pl.ind.pres.
imp.sg.
imp.pl.
1e sg.conj.pres.
3e sg.conj.pres.
sech, seg(h), segg, secge, seche, seg(h)e, segg(h)e
Zeg
secg(h)en, segg(h)en, zegghen, directly followed by Zeggen

the pers.pron. wi final-n often missing: secghe,
segg(h)e, zegghe
Zeg
sech, seg(h), seg(h)e
sagit (2x, Nederrijn), secget, sec(h)t, segg(h)et
Zegt
segg, segh, sage, segge (its not always certain that a (nonexistent)
conjunctive is involved)
Zegge
segg, sage, secghe, segghe
33
Lexicon structure
IMPACT
EE2
Figure9.Derivations
Wordforms
Wordform_id
W1
Wordform
Klaerlick
Analyzed_wordforms
A1
Number_of_parts
0
Wordform_id
W1
Normalized_form
Klaarlijk
analyzed_form_id
A1
Left_hand_side
aa
Ij
K
Right_hand_side
Ae
I
ck
Derivations
Derivation_id
D1
Patterns
Pattern_id
P1
P2
P3
Pattern_applications
Pattern_application_id
Pa1
Pa2
Pa3
Position
2
6
8
Pattern_id
P1
P2
P3
Derivation_id
D1
D1
D1
34
Lexicon structure
IMPACT
EE2
Statusofthisinformation:mandatory,external(forusewithinTR5)
Themandatorystatusofthisinformationdoesnotimplythatitiscompletelymanuallyverified.One
maychoosetogeneratethisinformationfromotherinformation.Thenormalizedwordformmaybe
chosen on the flyamong the modern word forms of the lemma. The mapping, on the other hand,
between historical word form and modern lemma is part of the deliverable output of the lexicon
building process and the quality of this mapping has to be checked, and if necessary, manual
correctionsmusttakeplace.
5.2. Informationaboutparadigmaticexpansion
Thisisonewayofkeepingtrackoftheprocessofexpansionfromlemmatatowordforms.
Figure10.Paradigmaticexpansion
Tableparadigms
Paradigm_id
Paradigm_name
P1
Regularverbalastems
P2
Regularverbalestems
Tableparadigm_positions
paradigm_position_id paradigm_position_name paradigm_position
Paradigm_id
1
1sgindpresactive
1
P1
2
2sgindpresactive
2
P1
Tabletransformsets
Transformset_id
R1
R2
35
Lexicon structure
IMPACT
EE2
Transformset_id
R1
R2
Inflection_process
s/are$/o/14
s/are$/as/
Paradigm_position_name
1sgpresactiveastems
1sgpresactiveastems
Stem_type_id
ST1
ST1
Comment: patterns for inflection may be either simple substitution rules or fullfledged finitestate
transducers
Tablewordform_transform_instances(definestherelationbetweeninflectionalpatternsandwordforminstances)
Transform_instance_id
Transformset_id
Stem_id
R1
R2
Amare
A1
TableWordforms
Wordform_id
W1
Wordform
Amas
analyzed_wordform_id pos
A1
part_number number_of_parts parent_analysis_id lemma_id wordform_id
VRB(pres,2,sg,ind,act) NULL
TableAnalyzed_wordforms
A1
Tablestems
Stem_id
S1
TableStem_types
Stem_type_id
ST1
NULL
Wordform_id
W1
Stem_form
Amare
NULL
L1
Number_of_parts
1
Lemma_id
L1
Stem_type_id
ST1
Stem_type_name
Lemmaform
Statusofthisinformation:optional,internal
5.3. Databaseinformationforstems
Itmaynotbeverypracticaltoderivethecompleteparadigmfromasinglebaseform(e.g.forstrong
orirregularverbs).
Forthisreason,weaddapossibilitytospecifyanumberofalternatestemformsforagivenlemma.
Tablelemmata
Lemma_id
Modern_lemma
Lemma_pos
l1
Binden
VRB
ThisexampleusesPerl5regularexpressionsyntax
14
36
Lexicon structure
IMPACT
EE2
Tablestems
Stem_id
Stem_form
Lemma_id
Stem_type_id
S1
Bind
L1
St1
S2
Band
L1
St2
S3
Bund
L1
St3
Tablestem_types
Stem_type_id
Name
ST1
Present tense stem
ST2
Past tense stem
ST3
Past participle stem
Statusofthisinformation:optional,internal
6. Lexicalsource
Fortheexistenceofotherwords,noverifiedevidenceintextsmayhavebeenfound.Itisstilldesirable
tokeeptrackofwheretheycomefrom:incorporatedfromsomeotherlexicon,obtainedbyexpansion
fromlemmatainhistoricaldictionaries,obtainedbyautomatic(andnotmanuallyverified)analysisof
historical documents. In the case of named entities, the lexical source information may serve to
preservethelinktothepersistentidentifierinthelibrarynamedauthoritydata.
When information is incorporated from lexica or dictionaries, labeling from these sources may be
copied (mapped often a nontrivial task; subject matter labels may be useful; regional or temporal
labelingmayalsobepresent).Ofcoursenotallwordsinthesourcelexiconhaveidenticaldate,text
type,etc..Henceinthiscase,theinformationisspecifiedinthesourceinformationrecordfortheword
form.
37
Lexicon structure
IMPACT
EE2
Figure11.Lexicalsource
Tablelemmata
lemma_id
L202
Modern_lemma
Jan van de Wiel
lemma_pos
NOU
ne_label
NE_PER
Tablelexical_source_lemma
Lemma_source_id
Ls1
Labels
Physics,Science
Lemma_id
A1
Foreign_id
0000330x
Lexicon_id
Lex1
38
Lexicon structure
IMPACT
EE2
Example:(wordindexofVanderSijs)
woonachtig*wonende1279[CG11,423]
woord*klankmeteigenbetekenis776880[CG111Utr.Doopbelofte]{2.5}
woordenboek*dictionaire[Toll.]
worcestersauskruidigesaus1900[Sanders1995]<Engels{4.1.6}
This word list gives us dates of occurrence which can be useful. The information is linked on the
lemmalevel.
Statusoflexicalsourceinformation:optional,internal
6.1. Ambiguityinformation
Especiallyfornamedentities,theinformationthatawordformisalsopartofthegenerallexiconorof
anotherpartofthenamed entitylexicon can beuseful. Hence, weadded some structuretoindicate
ambiguity of a word form. This ambiguity information may derive from another lexicon or from
manualinspection.
39
Lexicon structure
IMPACT
EE2
7. ConvertingthedatabaseintoLMF
7.1. Introduction.
In the previous chapters we described the structure of the database that is used for building the
lexicon.Thefinalformofthelexicon,however,willbeintheLexicalMarkupFramework(LMF:ISO
24613:2008)forthisisthestandardforsharingcomputationallexicons.
In this chapter we first describe the structures in LMF that correspond with those that have been
discussedaboveintheformatofarelationaldatabase.
Second, we will describe the method to compile a LMFversion in XMLformat from the relationall
database.ThescriptsthatarerequiredforthisprocessandtheinstructionsareprovidedinAppendix
[?].
7.2. Mappings
7.2.1.
On notation
The ISO standard uses UML diagrams to represent LMF models. We will do the same in this
document.Forconveniencewewilldescribethemostessentialelementsofthesediagrams.Theboxes
inFigure12representelementsintheXMLstructure.Abovethelineisthenameoftheelement,and
belowthenamesoftheattributes.
Figure12.NotationinUML
ThearrowinDiagramAindicatesthatElement2isanaggregateofElement1.ImplementedinXML,
thismeansthatElement2isembeddedinElement1.
The arrow in Diagram B indicates that Element 1 and 2 are associated and that Element 1 can send
messagestoElement2.ImplementedinXMLthismeansthatElement1containsapointertoElement
2.
7.2.2.
Unlabelled word forms.
Unlabeledwordformshavenolinguisticdataattachedtothem.
Figure13.Unlabelledwordforms.
Thelexiconelementisthetopnodeinourdescription.Thelexicalentry(LE)correspondswithwhat
in the previous chapter has been labelled lemma. The LMF element word form corresponds with
40
Lexicon structure
IMPACT
EE2
thenotionanalyzedwordformofthepreviouschapters.AndtheLMFelementform_representation
finally,correspondswiththenotionwordformofthepreviouschapters.
In case of the unlabelled word forms, the embedding elements lexical_entry and word_form will
containnolinguisticinformation.
7.2.3.
Inflection (labelled word forms).
Labelledwordformshavelinguisticinformationattachedtothem.Informationabouttheavailableset
of features is provided at the level of the LE. The features and values of word forms point to the
relevantfeaturesthatresideunderthelexicalentry.
Figure14.Inflection.
NotethatusageofthetermLemmainLMFisdifferentlyfromthatinthepreviouschapters.InLMF
it contains a marker for the LE; usually the stem or base of the word. In this document the element
lemmacontainstheformofthemodernlemma.
7.2.4.
Composition.
The set of morphological patterns are attached to the level of the lexicon. LEs can have several
analyses,whichallpointtodifferentmorphologicalpatterns.
41
Lexicon structure
IMPACT
EE2
Figure15.Composition.
7.2.5.
Spelling.
Itispossibletospecifythenormalizedspellingofwordformsinadifferent(older)spelling.
Figure16.Normalizedspelling.
Thepatternsdescribehowthewrittenformisderivedfromthenormalizedform.
7.2.6.
Clitics.
Clitic combination have a lot in common with composition. Both are considered agglutinations.
Clitics,however,areanalyzedatthelevelofthewordform.
42
Lexicon structure
IMPACT
EE2
Figure17.Clitics.
CliticsarerepresentedasLEsthathaveanaggregatedwordform,andanorderedlistofcomponents.
ComponentarereferencestootherLEs.
7.2.7.
Portmanteau.
Aportmanteauspecifiesarelationsbetweenhomographlemmas(implementedasLEs).Ithasbeen
implementedinLMFformatasalexicalentrycontainingalistofmembers.Eachmemberinthelist
pointstoalexicalentry.TheconceptofaListofMembersisderivedfromtheListofComponents
thatisusedfore.g.compositionandMWEs.ThemaindifferenceisthataListofComponentsisan
orderedsetandaListofmembersisnotordered.
Figure18.Portmanteau.
7.2.8.
Transcategorization.
TranscategorisationsspecifyhomonymwordformsfromdifferentLEsthattypicallydifferinpartof
speechtype.Sincethereisusuallyalimitedlistoftranscategorisationtypesinalanguage,thislistis
locatedatthelexiconlevel.
43
Lexicon structure
IMPACT
EE2
Figure19.Transcategorization.
For transcategorisations we use Lexical Entries of the type Categorisation, that point to the
according Transcategorisation Type and rule. Further, the Lexical Entry contains a List of
Componentstospecifytheelementsofthetranscategorisation.Thereasonwhywedonotusealistof
membersasinthecaseofPortmanteausisthataListofComponentsisordered.
7.2.9.
Multiword expressions.
MultiwordexpressionsareaddedtothelexiconasLexical
Entries.TheanalysisofthatLEpointstoaMultiwordExpressionPattern(MWEPattern).TheseMWE
Patternsdescribeaorderedlistofnodes,andinthedescriptionfieldthegrammaticalrelationofthese
nodes.
44
Lexicon structure
IMPACT
EE2
Figure20.Multiwordexpressions.
7.2.10.
Multiword named entities.
MultiwordexpressionsareaddedtothelexiconasLexical
Entries.TheanalysisofthatLEpointstoaMultiwordExpressionPattern(MWEPattern).TheseMWE
Patternsdescribeaorderedlistofnodes,andinthedescriptionfieldthegrammaticalrelationofthese
nodes.
Figure21.Multiwordnamedentities.
7.2.11.
Attestations.
LMFprovidesastructureforexamplesofuseofaLE.Thisstructure(context)issubsumedunderthe
sensepartof theLE.Thisisnotfitforourpurpose since wewantthedescriptionof thecontextto
clarifytheprovenanceofwordforms.
We,therefore,havetocreateanewextensionwithnewcategoriesforthispurpose.
Figure22.Attestations.
45
Lexicon structure
IMPACT
EE2
In paragraph 2.5 we described three types of attestations. The types text attestation and token
attestation are attached to the form representations of analyzed word forms. The attestations of
unanalyzedwordformsarelocatedatthesameposition,butoccurinlexicalentrieswithunlabelled
wordforms(seepar.7.2.2).
7.3. ConvertingrelationaldatatoXML.
IntheprevioussectionwedescribedtheLMFformatinXMLwewanttouseforthefinalformofthe
lexicon.Inthissectionwepresentamethodforconvertingthecontentoftherelationaldatabaseinto
XML.InAppendix[?]youwillfindthePerlscript(relDB2xml.pl)thatcanbeusedforthis.Thescript
istobeusedincombinationwithastructuredefinitionforacertainlexicon(language).Appendix[?]
containsthespecificationfortheDutchlexicon(NL_Structure.pl).
The script relDB2xml.pl is run without arguments. All specific data for the conversion are in a
separate(Perl)filewhichcontainsthemappingoftablestoxml.Thefilealsocontainsalldetailsonthe
databasethatcontainstherelationaldata.Thereferencetothisfileisspecifiedsomewhereatthetopof
thescriptrelDB2xml.pl.
Themappingspecificationislaiddowninaarraystructure.NotethatthisisPerlcodeandthatusing
therightsyntaxisveryimportant.
Thearrayhasanembeddedstructurethatroughlycorrespondswiththeresultingxml.
Therearethreekindsofsubstructures:fortables,fieldsandXMLelements.
StructureforXMLelements.
Thearraysforbindingcontainthefollowingelements:
elementnameREQUIRED
listofarraysforsubelementsREQUIRED
ThisstructureintroducesanXMLelementwhichwillcontainallfurtherdatafromitssubelements.
Structurefortables.
Everyarrayforenteringtablescontainstheseelements:
connectiontype(>)REQUIRED
selectioncriteriumREQUIRED
nameofresultingXMLelement(canbeemtpystring)REQUIRED
tablenameREQUIRED
listofarrayswithsubelementOPTIONAL
The selection criterium is essentialy the where clause in a SQL select statement. The name of the
resultingXMLelementisusedtospecifytheresultingXMLsubtree.
Ifthenameisanemptystring,nonewelementswillbeintroducedatthatlevel,whichmeansthatthe
fields of all records that result from the query will be siblings. If a simple name is specified, a
subelement with that name is introduced for every record that results from the query in which the
fieldsofthatrecordareembedded.Ifapathisspecified,(e.g.element_a.element_b),extralevelsof
subelementswillbeintroducedforeveryrecord.
46
Lexicon structure
IMPACT
EE2
NotethattherearetwowaystointroduceXMLsubstructures:usingtheStructureforXMLelements
(exampleA),orusingapathspecificationintheStructurefortables(exampleB).
A:
["collection",
["->",
"lemmata.lemma_id=lexical_source.lemma_id",
"source", "lexical_source_lemma"]]
B: ["->", "lemmata.lemma_id=lexical_source.lemma_id", "collection.source",
"lexical_source_lemma"]
Thesewillresultindifferentstructureswhentherearemorethanonerecordfoundinthequery:
A: <collection>
<source><.. content of record 1 ..></source>
</collection>
B: <collection>
</collection>
<collection>
</collection>
Structureforfields.
Everyarrayforaddingfieldscontainstheseelements:
connectiontype()REQUIRED
elementnameREQUIRED
fieldnameREQUIRED
TheelementnameisthenameoftheXMLelementwhichwillholdthevalueofthefieldspecifiedby
the field name. The element name cannot be an empty string. The element name can be a path, in
which case extra levels of XML elements will be introduced (analogue to the examples presented
above).
ThefieldnamespecifiesthefieldthatcontainthevaluethathastobeinsertedintotheXML.
8. References
D.Archer,A.ErnstGerlach,S.Kempken,Th.PilzandP.Rayson(2006).Theidentificationofspellingvariantsin
EnglishandGermanhistoricaltexts:manualorautomatic?.InDigitalHumanities(proceedings),Paris,2006,
pp.35.
Bie,JanuszS.(2004)AnApproachtoComputationalMorphology.In:IntelligentInformationProcessingandWeb
Mining.ProceedingsoftheInternationalIIS:IIPWM04ConferenceheldinZakopane,Poland,May17
20,2004.Springer,BerlinHeidelbergNewYork,pp.181199.ISBN3540213317
S.CucerzanandD.Yarowsky,Bootstrappingamultilingualpartofspeechtaggerinonepersonday.In:Dan
RothandAntalvandenBosch(eds.),ProceedingsofCoNLL2002,Taipei,Taiwan,2002,pp.132138.
A.ErnstGerlachandN.Fuhr.GeneratingSearchTermVariantsforTextCollectionswithHistoricSpellings.In
ECIR,2006,pp.4960.
G.Francopoulo,N.Bel,M.George,N.Calzolari,M.Monachini,M.PetM.andC.Soria.LexicalMarkup
Framework:ISOstandardforsemanticinformationinNLPlexicons.GLDV(Gesellschaftfrlinguistische
Datenverarbeitung),Tbingen,2007.
47
Lexicon structure
IMPACT
EE2
G.Francopoulo,M.George,N.Calzolari,M.MonachiniM.,N.Bel.,M.PetandC.Soria.LexicalMarkup
Framework(LMF).LREC,Genoa,2006.
N.Grgoire.DesignandImplementationofaLexiconofDutchMultiwordExpressions.In:N.Grgoireetal.
(eds),ProceedingsoftheACL2007WorkshoponABroaderPerspectiveonMultiwordExpressions.Prague,2007,
pp.1724.
A.Hauser,M.Heller,E.Leiss,K.U.SchulzandC.Wanzeck.InformationAccesstoHistoricalDocumentsfrom
theEarlyNewHighGermanPeriod.In:IJCAI2007WorkshoponAnalyticsforNoisyUnstructuredTextData,
Hyderabad,IndiaJanuary8,2007,pp.147154.
V.Hoste,W.DaelemansandS.Gillis,Usingruleinductiontechniquestomodelpronunciationvariationin
Dutch.In:ComputerSpeechandLanguage18:1,pp.124.
F.Masini.Multiwordexpressionsbetweensyntaxandthelexicon:ThecaseofItalianverbparticle
constructions.In:SKYJournalofLinguistics18(2005):pp.145173.
J.E.J.M.Odijk.AProposedStandardfortheLexicalRepresentationofIdioms.In:ProceedingsofEuralex.Lorient,
2004,pp.153163.
A.Rappoport,AriandT.LeventLevi,InductionofCrossLanguageAffixandLetterSequenceCorrespondence.
In:Proceedings,EACL2006WorkshoponCrossLanguageKnowledgeInduction,April2006,Trento,Italy.
E.S.RistadandP.B.Yianilos.Learningstringeditdistance.In:MachineLearning:ProceedingsoftheFourteenth
InternationalConference(SanFrancisco,July8111997),D.Fisher,Ed.,MorganKaufmann,1997,pp.287295.
N.vanderSijs.Etymologieinhetdigitaletijdperk,Eenchronologischwoordenboekalspraktijkvoorbeeld.Leiden,2001.
48
Lexicon structure
IMPACT
EE2
AppendixA:Databaseschema
Table alternate_modern_lemmata
Field
Type
alternate_lemma_id
bigint(20) unsigned
alternate_lemma
varchar(255)
base_lemma_id
bigint(20) unsigned
Table analyzed_wordforms
Field
part_of_speech
lemma_id
wordform_id
multiple_lemmata_analysis_id
derivation_id
verified_by
verification_date
Table conversion_rules
Field
rule_id
main_pos
sub_pos
transcategorization_id
Null Key
NO PRI
YES
YES MUL
Type
bigint(20) unsigned
varchar(255)
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
datetime
Type
bigint(20) unsigned
varchar(255)
varchar(255)
bigint(20) unsigned
Table corpora
Field
Type
corpus_id
bigint(20) unsigned
name
varchar(255)
Null
NO
YES
Null
NO
NO
NO
NO
NO
NO
YES
YES
Table corpusId_x_documentId
Field
Type
corpus_id
bigint(20) unsigned
document_id
bigint(20) unsigned
Table derivations
Field
derivation_id
normalized_form
Table documents
Field
Type
bigint(20) unsigned
varchar(255)
bigint(20) unsigned
Type
MUL
NULL
NULL
Default
NULL
NULL
NULL
NULL
Default
NULL
NULL
Null
NO
NO
Extra
auto_increment
Key Default Extra

PRI NULL auto_increment
MUL
MUL
MUL
Null Key
NO PRI
YES
YES
YES MUL
Key
PRI
Default
NULL
NULL
NULL
Extra
auto_increment
Key
PRI
PRI
Default
Null Key Default

NO PRI
NULL
YES MUL NULL
NO
Null
Key
Extra
auto_increment
Default
Extra
Extra
auto_increment
Extra
49
Lexicon structure
IMPACT
EE2
document_id
persistent_id
word_count
encoding
title
year_from
year_to
pub_year
author
editor
publisher
publishing_location
text_type
region
language
other_languages
spelling
parent_document
Table dont_show
Field
wordform_id
document_id
corpus_id
at_all
user_id
date
bigint(20) unsigned
varchar(255)
bigint(20) unsigned
bigint(20) unsigned
varchar(255)
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)
bigint(20) unsigned
NO PRI
YES MUL
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES
YES MUL
Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
tinyint(3) unsigned
bigint(20) unsigned
datetime
Null
NO
NO
NO
NO
NO
NO
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
Key
PRI
PRI
PRI
PRI
auto_increment
Default
0
0
0
Table group_attestations
Field
group_attestation_id
token_id
quote
derivation_id
wordform_group_id
Type
bigint(20) unsigned
bigint(20) unsigned
text
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
Null Key
NO PRI
YES
YES
NO MUL
NO
NO
Table inflection_classes
Field
inflection_class_id
inflection_class_name
Type
bigint(20) unsigned
varchar(255)
Null Key Default

NO PRI NULL
YES
NULL
Table languages
Field
Type
Null
Key
Default
NULL
NULL
NULL
Default
Extra
Extra
auto_increment
Extra
auto_increment
Extra
50
Lexicon structure
IMPACT
EE2
language_id
language
tinyint(3) unsigned
varchar(255)
NO
NO
Table lemma_feature_assignments
Field
Type
assignment_id
bigint(20) unsigned
feature_id
bigint(20) unsigned
value_id
bigint(20) unsigned
lemma_id
bigint(20) unsigned
Table lemma_feature_values
Field
lemma_feature_value_id
lemma_feature_value
Table lemma_features
Field
lemma_feature_id
lemma_feature_name
Table lemmata
Field
lemma_id
modern_lemma
gloss
persistent_id
ne_label
portmanteau_lemma_id
language_id
Table lexica
Field
lexicon_id
lexicon_name
auto_increment
Default
NULL
NULL
NULL
NULL
Extra
auto_increment
Null Key Default

NO PRI NULL
YES
NULL
Null Key Default

NO PRI NULL
YES
NULL
Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
Null
NO
YES
YES
Type
bigint(20) unsigned
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)
bigint(20) unsigned
tinyint(3) unsigned
Type
bigint(20) unsigned
varchar(255)
NULL
Key
PRI
MUL
MUL
MUL
Type
bigint(20) unsigned
varchar(255)
Table lexical_source_lemma
Field
Type
lemma_source_id
bigint(20) unsigned
Null
NO
YES
YES
YES
Type
bigint(20) unsigned
varchar(255)
Table lemma_inflection_class
Field
lemma_inflection_class_id
lemma_id
inflection_class_id
PRI
UNI
Null
NO
YES
YES
YES
YES
YES
YES
YES
Null
NO
YES
Null
NO
Key
PRI
MUL
MUL
Extra
auto_increment
Default Extra
NULL auto_increment
NULL
NULL
Key Default
PRI
NULL
MUL NULL
NULL
NULL
NULL
NULL
MUL NULL
NULL
Key
PRI
Key
PRI
Extra
auto_increment
Default
NULL
NULL
Default
NULL
Extra
auto_increment
Extra
auto_increment
Extra
auto_increment
51
Lexicon structure
IMPACT
EE2
label
lemma_id
foreign_id
lexicon_id
varchar(255)
bigint(20) unsigned
varchar(255)
bigint(20) unsigned
YES
YES MUL
YES
YES MUL
Table lexical_source_wordform
Field
Type
wordform_source_id
bigint(20) unsigned
foreign_id
varchar(255)
label
varchar(255)
wordform_id
bigint(20) unsigned
lexicon_id
bigint(20) unsigned
Table morphological_analyses
Field
morphological_analysis_id
arity
analyzed_lemma_id
morphological_operation_id
Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
Table morphological_operations
Field
morphological_operation_id
description
resulting_part_of_speech
Type
bigint(20) unsigned
varchar(255)
varchar(255)
Table multiple_lemmata_analyses
Field
multiple_lemmata_analysis_id
multiple_lemmata_analysis_part_id
part_number
nr_of_parts
Null
NO
YES
YES
YES
YES
Key
PRI
Default
NULL
NULL
NULL
MUL NULL
MUL NULL
Null
NO
YES
YES
YES
Key
PRI
Default Extra
NULL auto_increment
NULL
MUL NULL
MUL NULL
Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
tinyint(3) unsigned
Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
Extra
auto_increment
Null Key Default Extra

NO PRI NULL auto_increment
YES
NULL
YES
NULL
Table multiple_lemmata_analysis_parts
Field
Type
multiple_lemmata_analysis_part_id
bigint(20) unsigned
part_of_speech
varchar(255)
lemma_id
bigint(20) unsigned
Table multiword_analyses
Field
multiword_analysis_id
arity
analyzed_lemma_id
NULL
NULL
NULL
NULL
Null
NO
YES
YES
YES
Null
NO
NO
NO
NO
Key Default
PRI
PRI
PRI
PRI
Extra
Null Key Default Extra

NO PRI NULL auto_increment
NO MUL
NO
Key
PRI
Default
NULL
NULL
MUL NULL
MUL NULL
Extra
auto_increment
52
Lexicon structure
IMPACT
EE2
Table multiword_operations
Field
description
resulting_pos
Type
bigint(20) unsigned
varchar(255)
varchar(255)
Table ne_variant_relation_types
Field
ne_variant_relation_name
ne_variant_relation_desciption
Type
int(32)
varchar(255)
text
Table ne_variant_relations
Field
first_lemma_id
second_lemma_id
Table paradigm_positions
Field
paradigm_position_id
paradigm_position_name
paradigm_position
paradigm_id
transformset_id
Table paradigms
Field
paradigm_id
paradigm_name
Type
int(32)
int(32)
int(32)
Type
bigint(20) unsigned
varchar(255)
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
Type
bigint(20) unsigned
varchar(255)
Table part_morphological_analysis
Field
part_morphological_analysis_id
part_number
part_lemma_id
morphological_analysis_id
Table part_multiword_analysis
Field
part_multiword_analysis_id
part_number
part_lemma_id
multiword_analysis_id
Null Key Default

NO PRI NULL
YES
NULL
YES
NULL
Extra
auto_increment
Null Key Default

NO PRI NULL
YES
NULL
YES
NULL
Extra
auto_increment
Null
YES
YES
YES
Null
NO
YES
YES
YES
YES
Null
NO
YES
Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
Key
MUL
Key
PRI
Default
NULL
NULL
NULL
MUL NULL
MUL NULL
Key
PRI
Default
NULL
NULL
Null
NO
YES
YES
YES
Key
PRI
Null
NO
YES
YES
YES
Default
NULL
NULL
NULL
Extra
Extra
auto_increment
Extra
auto_increment
Default Extra
NULL auto_increment
NULL
MUL NULL
MUL NULL
Key
PRI
Default Extra
NULL auto_increment
NULL
MUL NULL
MUL NULL
53
Lexicon structure
IMPACT
EE2
Table pattern_applications
Field
position
pattern_id
number_of_patterns
Table patterns
Field
pattern_id
left_hand_side
right_hand_side
Type
bigint(20) unsigned
varchar(64)
varchar(64)
Table stem_types
Field
stem_type_id
stem_type_name
Table stems
Field
stem_id
stem_form
lemma_id
stem_type_id
Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
Type
bigint(20) unsigned
varchar(255)
Type
bigint(20) unsigned
varchar(255)
bigint(20) unsigned
bigint(20) unsigned
Table text_attestation_verifications
Field
Type
document_id
bigint(20) unsigned
wordform_id
bigint(20) unsigned
verification_date
datetime
verified_by
bigint(20) unsigned
Table text_attestations
Field
attestation_id
frequency
document_id
Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
Table token_attestation_verifications
Field
Type
document_id
bigint(20) unsigned
wordform_id
bigint(20) unsigned
start_pos
bigint(20) unsigned
Null
NO
YES
YES
NO
Null
NO
YES
YES
Key
PRI
MUL
Null
NO
YES
Null
NO
YES
YES
YES
Key
MUL
Default
Extra
NULL
NULL
Default
NULL
NULL
NULL
Extra
auto_increment
Key Default
PRI NULL
NULL
Extra
auto_increment
Key
PRI
MUL
MUL
Default
NULL
NULL
NULL
NULL
Null
NO
NO
NO
NO
Extra
auto_increment
Key
PRI
PRI
Null Key Default

NO PRI
NULL
YES
NULL
NO MUL
NO
Null
NO
NO
NO
Key
PRI
PRI
PRI
Default
Extra
Extra
auto_increment
Default
Extra
54
Lexicon structure
IMPACT
EE2
end_pos
verification_date
verified_by
bigint(20) unsigned
datetime
bigint(20) unsigned
Table token_attestations
Field
attestation_id
token_id
quote
derivation_id
document_id
start_pos
end_pos
Type
bigint(20) unsigned
bigint(20) unsigned
text
bigint(20) unsigned
bigint(20)
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
NO
NO
NO
Null Key
NO PRI
YES
YES
NO MUL
NO
NO
NO
NO
Table transcategorization_types
Field
Type
transcategorizationtype_id
bigint(20) unsigned
description
varchar(255)
main_pos
varchar(255)
sub_pos
varchar(255)
Table transcategorizations
Field
transcategorization_id
mainlemma_id
sublemma_id
transcategorizationtype_id
Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
Default
NULL
NULL
NULL
Extra
auto_increment
Null Key Default

NO PRI NULL
YES
NULL
YES
NULL
YES
NULL
Extra
auto_increment
Null
NO
YES
YES
YES
Key
PRI
MUL
MUL
MUL
Default Extra
NULL auto_increment
NULL
NULL
NULL
Table transformsets
Field
transformset_id
inflection_process
formal_tag
stem_type_id
Type
bigint(20) unsigned
varchar(255)
varchar(255)
bigint(20) unsigned
Null Key
NO PRI
YES
YES
YES MUL
Default
NULL
NULL
NULL
NULL
Extra
auto_increment
Table type_frequencies
Field
type_frequency_id
frequency
wordform_id
document_id
Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
Null Key
NO PRI
NO
NO MUL
NO
Default
NULL
Extra
auto_increment
Table users
Field
Type
Null
Key
Default
Extra
55
Lexicon structure
IMPACT
EE2
user_id
name
bigint(20) unsigned
varchar(255)
Table wordform_groups
Field
wordform_group_id
document_id
onset
offset
NO
YES
Type
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
bigint(20) unsigned
Table wordform_transform_instance
Field
Type
transform_instance_id
bigint(20) unsigned
transformset_id
bigint(20) unsigned
stem_id
bigint(20) unsigned
bigint(20) unsigned
Table wordforms
Field
wordform_id
wordform
wordform_lowercase
lastviewed_by
lastview_date
has_analysis
PRI
UNI
Type
bigint(20) unsigned
varchar(255)
varchar(255)
bigint(20)
datetime
bit(1)
NULL
NULL
Null
NO
NO
NO
NO
Null
NO
YES
YES
YES
Null
NO
NO
NO
YES
YES
YES
Key
PRI
MUL
MUL
MUL
auto_increment
Key
PRI
PRI
PRI
PRI
Default
NULL
NULL
NULL
NULL
Key
Default
PRI
NULL
UNI
MUL
NULL
NULL
NULL
Default
Extra
Extra
auto_increment
Extra
auto_increment
56
Lexicon structure
IMPACT
EE2
Appendix B: Filters for the export of relevant subsets from the

lexicon
Filters for various applications will be developed as the workflow for lexicon development and
deployment progresses. They can be implemented eithers as SQL queries on the database or, for
instance,asXSLTqueriesontheXMLexportformat.
Asimpleexample:produceawordlistwithfrequenciesforalldocumentsfrom1749.
create view lemma_wordform_attestation as select modern_lemma,lemmata.lemma_id,

wordform, pos, documents.document_id, documents.year_from, documents.year_to,
documents.title,type_attestations.frequency from lemmata, analyzed_wordforms,
wordforms, type_level_attestations, documents where analyzed_wordforms.lemma_id =
lemmata.lemma_id and wordforms.wordform_id = analyzed_wordforms.wordform_id and
type_level_attestations.analyzed_wordform_id=analyzed_wordforms.analyzed_wordform_i
d and type_level_attestations.document_id=documents.document_id;
select distinct wordform, sum(frequency) as frequency from
lemma_wordform_attestation where year_from=1749 and year_to =1749 group by
wordform;
Appendix C: Script for converting relational data to LMF

(XML):relDB2xml.pl.
ThescriptwritesoutputtoSTDOUT.Afterthekeywordrequire,thenameofthefilecontainingthe
structureandfurtherparametershastobeprovided.
use strict;
use DBI;
use HTML::Entities;
#
#
#
#
#
#
Every array for inserting tables contains these elements:

- connection type ("->") REQUIRED
- selection criteria REQUIRED
- element name (can be emtpy string) REQUIRED
- table name REQUIRED
- list of arrays with subelement OPTIONAL
#
#
#
#
Every array for inserting fields contains these elements:

- connection type ("-") REQUIRED
- element name (can be emtpy string) REQUIRED
- table name REQUIRED
# arrays for binding contain the following elements:

# - element name REQUIRED
# - list of arrays for subelements REQUIRED
require "NL_Structure.pl";
open (LOG, sprintf ">%s.log", getParam ("output"));
57
Lexicon structure
IMPACT
EE2
my $dbh = DBI->connect (sprintf ("DBI:mysql:database=%s;host=%s", getParam

("database"), getParam ("databasehost")), getParam ("user"), getParam
("password"));
if (!defined ($dbh)) {
die sprintf "Unable to connect: %s\n", $DBI::errstr;
}
printf "%s\n", xmlHeader (getParam ("dtd"));
printf "%s", buildXml ("", @{getLmf()});
$dbh->disconnect;
close (LOG);
sub buildXml {
my ($super, $type, @rest) = @_;
if (@rest) {
if ($type eq "->") { # handle table
my ($constraint, $tag, $table) = splice (@rest, 0, 3);
my @table = queryAggregate ($dbh, $super, $constraint, $table);
my ($result, $openTag, $closeTag) = ("", "", "");
if ($tag ne "") {
$openTag = "<" . $tag . ">";
$closeTag = "</" . $tag . ">";
}
foreach my $record (@table) {
$result .= sprintf "%s%s%s\n", $openTag, join ("", map {buildXml
($record, @{$_})} @rest), $closeTag;
}
return $result;
}
elsif ($type eq "-") { #handle field
my ($name, $key) = @rest;
my @path = split (/\./, $name);
return sprintf "<%s>%s</%s>\n", join ("><", @path), $$super{$key},
join ("></", reverse @path);
}
else { #binding element
if ($type =~ s!^([^.]+)\.!!) {
return sprintf "<%s>\n%s</%s>\n", $1, buildXml ($super, $type,
@rest), $1;
}
else {
return sprintf "<%s>\n%s</%s>\n", $type, join ("", map {buildXml
($super, @{$_})} @rest), $type;
}
}
}
}
sub queryAggregate {
my ($dbh, $super, $constraint, $table) = @_;
my $sth = "";
58
Lexicon structure
IMPACT
EE2
if ($constraint ne "") {
my ($leftTable, $leftKey, $rightTable, $rightKey) = split (/[.=]/,
$constraint);
my $query = sprintf "select * from %s where %s='%s'", $rightTable,
$rightKey, $$super{$leftKey};
$sth = $dbh->prepare ($query);
}
else {
my $query = sprintf "select * from %s", $table;
$sth = $dbh->prepare ($query);
}
$sth->execute or printf LOG "%s\n", $sth->errstr;
my @result = ();
my $hashref = "";
while ($hashref = $sth->fetchrow_hashref) {
push (@result, $hashref);
}
return @result;
}
sub xmlHeader {
my ($name) = @_;
if ($name ne "") {
return sprintf "<?xml version='1.0'?>\n<!DOCTYPE
'%s'>\n", $name;
}
else {
return "<?xml version='1.0'?>\n";
}
}
lexicon
SYSTEM
AppendixD:StructureDefinitionfortheDutchLexicon.
ThefilecontainsPerlcode.Twodatastructuresarespecified:ahashwithsomedetailsforconnecting
to a relational database. The parameter output is used to provide a name for the log file. The
keyworddtdisoptional.
Thefilefurthercontainstwosmallfunctionsneededtopassthedatatothemainscript.Theseshould
notbechanged.
use strict;
my %params =
("output" => "NL_Lexicon",
"database" => "EE3",
"databasehost" => "impactdb.inl.loc",
"password" => "impact",
"user" => "impact",
"dtd" => "NL_Structure.dtd"
);
my $lmf =
["lexicon",
# rule section
59
Lexicon structure
IMPACT
EE2
["->", "", "lemma_feature", "lemma_features"],

["->", "", "lemma_feature_value", "lemma_feature_values"],
["->", "", "inflection_class", "inflection_classes"],
["->", "", "derivation_pattern", "patterns"],
["->", "", "transcategorization_type", "transcategorization_types",
["->",
"transcategorization_types.transcategorizationtype_id=conversion_rules.tran
scategorization_id", "rule", "conversion_rules"]
],
["->", "", "mwe_pattern", "multiword_operations",
["-", "multiword_operation_id", "multiword_operation_id"],
["-", "description", "description"],
["-", "resulting_pos", "resulting_pos"],
],
["->",
"",
"morphological_pattern.transformation_set.process",
"morphological_operations"],
["->",
"",
"morphological_pattern.transformation_set",
"transformsets",
["->",
"transformsets.stem_type_id=stem_types.stem_type_id",
"transform_category", "stem_types"],
["->",
"transformsets.paradigm_position_name=paradigm_positions.paradigm_position_
name", "process", "paradigm_positions",
["->",
"paradigm_positions.paradigm_id=paradigms.paradigm_id",
"paradigm", "paradigms"],
],
],
# lexical entries
["->", "", "lexical_entry", "multiword_analyses",
["-", "multiword_analysis_id"],
["-", "multiword_operation_id", "mwe_pattern"],
["-", "arity", "arity"],
["list_of_components",
["->",
"multiword_analyses.multiword_analysis_id=part_multiword_analysis.multiword
_analysis_id", "component", "part_multiword_analysis",
["-", "part_number", "part_number"],
["-", "lemma_id", "part_lemma_id"]
]
],
],
["->", "", "lexical_entry", "transcategorizations",
["-", "", "transcategorization_type"],
["-", "component.mainlemma_id", "mainlemma_id"],
["-", "component.sublemma_id", "sublemma_id"],
]
],
["->", "", "lexical_entry", "lemmata",
["-", "lemma_id", "lemma_id"],
["-", "modern_lemma", "modern_lemma"],
["-", "gloss", "gloss"],
["-", "POS", "lemma_part_of_speech"],
["-", "ne_label", "ne_label"],
#
60
Lexicon structure
IMPACT
EE2
["-", "language_id", "language_id"],

["-", "portmanteau_lemma_id", "portmanteau_lemma_id"],
["->",
"lemmata.lemma_id=alternate_modern_lemmata.base_lemma_id",
"alternate_modern_lemma", "alternate_modern_lemmata",
["-", "alternate_lemma", "alternate_lemma"],
],
["->",
"lemmata.lemma_id=lemma_inflection_class.lemma_id",
"inflection_class", "lemma_inflection_class",
["-", "inflection_class_id", "inflection_class_id"],
],
["->",
"lemmata.lemma_id=lexical_source_lemma.lemma_id",
"source",
"lexical_source_lemma",
["-", "label", "label"],
["-", "foreign_id", "foreign_id"],
["-", "lexicon_id", "lexicon_id"],
],
["->", "lemmata.lemma_id=stems.lemma_id", "stem", "stems",
["-", "stem_form", "stem_form"],
["-", "stem_id", "stem_id"],
["->",
"stems.stem_type_id=stem_types.stem_type_id",
"",
"stem_types",
["-", "name", "stem_type_name"],
],
],
["->",
"lemmata.lemma_id=lemma_feature_assignments.lemma_id",
"feature", "lemma_feature_assignments",
["->",
"lemma_feature_assignments.feature_id=lemma_features.lemma_feature_id", "",
"lemma_features",
["-", "feature_id", "feature_id"],
["-", "name", "lemma_feature_name"],
],
["->",
"lemma_feature_assignments.value_id=lemma_feature_values.lemma_feature_valu
e_id", "value", "lemma_feature_values",
["-", "value_id", "lemma_feature_value_id"],
["-", "value", "lemma_feature_value"],
]
],
["->",
"lemmata.lemma_id=morphological_analyses.analyzed_lemma_id",
"analysis", "morphological_analyses",
["-", "morphological_operation_id", "morphological_operation_id"],
["->",
"morphological_analyses.morphological_analysis_id=part_morphological_analys
is.morphological_analysis_id", "component", "part_morphological_analysis",
["-", "number", "part_number"],
["-", "lemma_id", "part_lemma_id"],
]
]
],
["->",
"lemmata.lemma_id=analyzed_wordforms.lemma_id",
"wordform",
"analyzed_wordforms",
["->", "analyzed_wordforms.derivation_id=derivations.derivation_id",
61
Lexicon structure
IMPACT
EE2
"", "derivations",
["pattern",
["->",
"derivations.derivation_id=pattern_applications.derivation_id",
"",
"pattern_applications",
["-", "position", ""],
["->",
"pattern_applications.pattern_id=patterns.pattern_id",
"",
"pattern",
["-", "left_hand_side", "left_hand_side"],
["-", "right_hand_side", "right_hand_side"],
]
]
]
],
["->",
"analyzed_wordforms.wordform_id=lexical_source_wordform.wordform_id",
"source", "lexical_source_wordform"],
["form_representation",
["->",
"analyzed_wordforms.wordform_id=wordforms.wordform_id",
"",
"wordforms",
["-", "wordform_id", "wordform_id"],
["-", "written_form", "wordform"],
],
["->",
"wordforms.analyzed_wordform_id=text_attestations.analyzed_wordform_id",
"attestation", "text_attestations",
["-", "id", "attestation_id"],
["-", "frequency", "frequency"],
["-", "document_id", "document_id"],
],
["->",
"analyzed_wordforms.analyzed_wordform_id=token_attestations.analyzed_wordfo
rm_id", "attestation", "token_attestations",
["-", "id", "attestation_id"],
["-", "token_id", "token_id"],
["-", "quote", "quote"],
["-", "derivation_id", "derivation_id"],
["-", "document_id", "document_id"],
["-", "start_pos", "start_pos"],
["-", "end_pos", "end_pos"],
],
],
["->",
"analyzed_wordforms.analyzed_wordform_id=wordform_transform_instance.analyz
ed_wordform_id", "", "wordform_transform_instance"]
]
]
];
sub getLmf {
return $lmf;
}
sub getParam {
62
Lexicon structure
IMPACT
EE2
my ($key) = @_;
return $params{$key};
}
63

D-EE2 1 - Lexicon - Structure - Impact - Version - 1.3.2 PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

D-EE2 1 - Lexicon - Structure - Impact - Version - 1.3.2 PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Lexicon structure

Due Date: Month 6

Corrected error in Appendix E

This document requires the following approvals

This document was sent to:

Date of sending Name

July 1, 2008 Klaus Schulz

Cliticcombinations willbelemmatizedbyassigningan orderedsequenceoflemmata.Awordform

raw token with punctuation

Identifier of the derivation. NOTE that this is NOT a primary key.

Statusofthisinformation:mainlyfor internal use10,but hardlydispensableas ameanstoorganize

VRB+NOU -> NOU

.* + NOU -> NOU

Example:retrieval linksfornearsynomynsorheadsof compounds.Possibleuse:whensearchingfor

Second is an other languafe variant of first

sech, seg(h), segg, secge, seche, seg(h)e, segg(h)e

secg(h)en, segg(h)en, zegghen, directly followed by Zeggen

part_number number_of_parts parent_analysis_id lemma_id wordform_id

Present tense stem

Past tense stem

Past participle stem

Key Default Extra

Null Key Default

Null Key Default

Null Key Default

Null Key Default

Null Key Default Extra

Null Key Default Extra

Null Key Default

Null Key Default

Null Key Default

Null Key Default

Appendix B: Filters for the export of relevant subsets from the

create view lemma_wordform_attestation as select modern_lemma,lemmata.lemma_id,

Appendix C: Script for converting relational data to LMF

Every array for inserting tables contains these elements:

Every array for inserting fields contains these elements:

# arrays for binding contain the following elements:

my $dbh = DBI->connect (sprintf ("DBI:mysql:database=%s;host=%s", getParam

["->", "", "lemma_feature", "lemma_features"],

["-", "language_id", "language_id"],

S-ar putea să vă placă și