Sunteți pe pagina 1din 97

White Paper Series Seria de studii

THE LIMBA
ROMANIAN ROMN
LANGUAGE IN N ERA
THE DIGITAL DIGITAL
AGE
Diana Trandab
Elena Irimia
Verginica Barbu Mititelu
Dan Cristea
Dan Tu
White Paper Series Seria de studii

THE LIMBA
ROMANIAN ROMN
LANGUAGE IN N ERA
THE DIGITAL DIGITAL
AGE
Diana Trandab [1, 2]
Elena Irimia [3]
Verginica Barbu Mititelu [3]
Dan Cristea [1, 2]
Dan Tu [3]

[1] University Alexandru Ioan Cuza of Iai


[2] Romanian Academy, Institute of Computer Science
[3] Romanian Academy, Research Institute for AI

Georg Rehm, Hans Uszkoreit


(editori, editors)
PREFA PREFACE
Acest studiu face parte dintr-o serie de studii care is white paper is part of a series that promotes
promoveaz cunoaterea tehnologiilor limbajului i knowledge about language technology and its poten-
a potenialului lor. El se adreseaz jurnalitilor, po- tial. It addresses journalists, politicians, language com-
liticienilor, comunitilor lingvistice i tuturor celor munities, educators and others.
interesai de limba romn. n Europa, disponibilitatea e availability and use of language technology in
i utilizarea tehnologiilor limbajului variaz de la o Europe varies between languages. Consequently, the
limb la alta. n consecin, sunt necesare i aciuni actions that are required to further support research
diferite pentru a sprijini n continuare cercetarea i and development of language technologies also dif-
dezvoltarea acestor tehnologii. Aciunile necesare fers. e required actions depend on many factors,
depind de mai muli factori, cum ar complexitatea such as the complexity of a given language and the size
unei anumite limbi sau dimensiunea comunitii care of its community. META-NET, a Network of Excel-
o folosete. META-NET, o reea de excelen nanat lence funded by the European Commission, has con-
de Comisia European, a efectuat o analiz a resurselor ducted an analysis of current language resources and
i tehnologiilor lingvistice actuale prin intermediul technologies in this white paper series (p. 89). e
studiilor de fa (vezi lista lor la pag. 89). Aceast analysis focused on the 23 ocial European languages
analiz s-a concentrat pe cele 23 de limbi ociale as well as other important national and regional lan-
ale Uniunii Europene, precum i asupra altor limbi guages in Europe. e results of this analysis suggest
naionale i regionale importante din Europa. Rezul- that there are tremendous decits in technology sup-
tatele acestei analize indic faptul c exist un decit port and signicant research gaps for each language.
enorm n sprijinirea tehnologiei i lacune de cercetare e given detailed expert analysis and assessment of
semnicative pentru ecare limb. Analiza detaliat the current situation will help maximise the impact of
prezentat i evalurile experilor vor contribui la additional research. As of November 2011, META-
maximizarea impactului cercetrilor ulterioare. META- NET consists of 54 research centres from 33 European
NET este format din 54 de centre de cercetare din countries (p. 85). META-NET is working with stake-
33 de ri (n luna noiembrie 2011, vezi pag. 85), care holders from economy (Soware companies, techno-
colaboreaz cu persoane cheie din domeniul afacerilor logy providers, users), government agencies, research
(companii de soware, furnizori de tehnologie, uti- organisations, non-governmental organisations, lan-
lizatori), din agenii guvernamentale, organizaii de guage communities and European universities. To-
cercetare, organizaii nonguvernamentale, comuniti gether with these communities, META-NET is creat-
lingvistice i universiti europene. mpreun cu aceste ing a common technology vision and strategic research
comuniti, META-NET dezvolt o viziune comun agenda for multilingual Europe 2020.
asupra tehnologiei i o agend strategic de cercetare
pentru o Europ multilingv la nivelul anului 2020.

III
META-NET oce@meta-net.eu http://www.meta-net.eu

Autorii acestui document sunt recunosctori autorilor e authors of this document are grateful to the authors of
studiului pentru limba german, care le-au permis s the White Paper on German for permission to re-use selected
(re)foloseasc n prezentul document anumite materiale language-independent materials from their document [1].
independente de limb [1].
e development of this white paper has been funded by the
Acest studiu a fost nanat prin Programul Cadru nr. 7 Seventh Framework Programme and the ICT Policy Support
i prin Programul de sprijinire a politicii n domeniul Programme of the European Commission under the contracts
Tehnologiilor Informaiei i Comunicaiilor (ICT Policy T4ME (Grant Agreement 249 119), CESAR (Grant Agree-
Support Programme) al Comisiei Europene prin proiectele ment 271 022), METANET4U (Grant Agreement 270 893)
T4ME (contract nr. 249 119), CESAR (contract nr. 271 022), and META-NORD (Grant Agreement 270 899).
METANET4U (contract nr. 270 893) i META-NORD
(contract nr. 270 899).

IV
CUPRINS CONTENTS

LIMBA ROMN N ERA DIGITAL


1 Rezumat 1

2 Un risc pentru limbile noastre: O provocare pentru tehnologia limbajului 5


2.1 Frontierele lingvistice frneaz crearea unei societi informaionale europene . . . . . . . . . . . 6
2.2 Limbile noastre sunt n pericol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Tehnologia limbajului este cheia activrii tehnologiei . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Oportuniti ale tehnologiei limbajului . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Provocrile tehnologiei limbajului . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Achiziia limbii de ctre om i main . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Limba romn n societatea informaional european 11


3.1 Fapte generale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Particularitile limbii romne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Dezvoltri recente . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Cultivarea limbii n Romnia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Limba n educaie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Aspecte internaionale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Limba romn pe Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Sprijin tehnologic pentru limba romn 18


4.1 Arhitecturile aplicaiilor din tehnologia limbajului . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Principalele domenii de aplicaii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Alte domenii de aplicaii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Programe educaionale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 Proiecte i eforturi naionale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Situaia instrumentelor i resurselor pentru limba romn . . . . . . . . . . . . . . . . . . . . . . . 34
4.7 Comparaie ntre limbi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.8 Concluzii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Despre META-NET 41
THE ROMANIAN LANGUAGE IN THE DIGITAL AGE
1 Executive Summary 42

2 Languages at Risk: a Challenge for Language Technology 45


2.1 Language Borders Hold back the European Information Society . . . . . . . . . . . . . . . . . . 46
2.2 Our Languages at Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3 Language Technology is a Key Enabling Technology . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4 Opportunities for Language Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5 Challenges Facing Language Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6 Language Acquisition in Humans and Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3 The Romanian Language in the European Information Society 50


3.1 General Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Particularities of the Romanian Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Recent Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Ocial Language Protection in Romania . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Language in Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 International Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7 Romanian on the internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Language Technology Support for Romanian 57


4.1 Application Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Core Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Other Application Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Educational Programmes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 National Projects and Initiatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Availability of Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.7 Cross-language comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 About META-NET 77

A Referine bibliograce -- References 79

B Membrii META-NET -- META-NET Members 85

C Seria de studii META-NET -- The META-NET White Paper Series 89


1

REZUMAT

n ultimii 60 de ani, Europa a devenit o structur politic este afectat limba romn de aceast digitalizare masiv
i economic distinct, pstrndu-i ns diversitatea a informaiilor, cunotinelor i comunicrii de zi cu zi?
cultural i lingvistic. Acest lucru nseamn c, de la Se va schimba ea sau chiar va disprea?
portughez la polonez i de la italian la islandez, Toate calculatoarele noastre sunt legate ntr-o reea
comunicarea de zi cu zi ntre cetenii europeni, global din ce n ce mai dens i puternic. Fata
precum i comunicarea din domeniile economic i din Buenos Aires, oerul vamal din Constana i
politic, se confrunt inevitabil cu barierele lingvistice. inginerul din Katmandu pot discuta cu prietenii lor de
Instituiile Uniunii Europene cheltuiesc aproximativ pe Facebook, dar este puin probabil s se ntlneasc
un miliard de euro pe an pentru meninerea politicii n comunitile online i pe forumuri. Dac vor s ae
lor asupra multilingvismului, de exemplu, traducerea cum pot trata un iuit n urechi, probabil vor cuta
textelor i interpretarea discursurilor. Trebuie ns s un rspuns pe Wikipedia, dar chiar i atunci ei nu vor
e multilingvismul o astfel de povar? Tehnologiile citi acelai articol. Cnd internauii Europei discut
moderne ale limbajului i cercetarea lingvistic pot n forumuri i pe chat efectele accidentului nuclear
avea o contribuie semnicativ la reducerea acestor Fukushima asupra politicii energetice europene, ei fac
frontiere lingvistice. Combinate cu dispozitive i acest lucru n comuniti lingvistice distincte. Dei
aplicaii inteligente, tehnologiile limbajului vor n Internetul conecteaz, exist nc o separare evident n
msur n viitor s-i ajute pe europeni s comunice cu funcie de limba folosit de ecare utilizator. Va mereu
uurin unul cu altul i s fac afaceri mpreun, chiar aa?
dac nu vorbesc aceeai limb.

Tehnologiile limbajului cheia spre viitor.


Tehnologiile limbajului construiesc puni de
legtur pentru viitorul Europei.
n lmele SF, toat lumea vorbete aceeai limb. Ar
putea romna, chiar dac am avut doar un singur
Tehnologia informaiei ne schimb viaa de zi cu zi. astronaut romn? Multe dintre cele 6.000 de limbi nu
Scriem deja folosind calculatorul, editm, facem calcule, vor supravieui ntr-o societate a informaiilor digitale
cutm informaii, dar i, din ce n ce mai des, citim, globale. Se estimeaz c cel puin 2.000 de limbi sunt
ascultm muzic, vedem fotograi i urmrim lme pe condamnate la dispariie n deceniile urmtoare. Altele
calculator. Purtm calculatoare mici n buzunare i vor continua s joace un rol important n familii i n
le utilizm pentru a efectua apeluri telefonice, a scrie zone restrnse, dar nu i n lumea academic sau n lumea
e-mailuri, pentru a obine informaii de pe Internet i afacerilor. Care sunt ansele de supravieuire a limbii
pentru a ne ine de urt, oriunde ne-am aa. n ce mod romne?

1
Vorbit de aproximativ 29.000.000 de vorbitori n de lemn, discursul unidirecional etc.) la utilizarea
ntreaga lume, limba romn este prezent nu doar n deschis, n care noi modele lingvistice trebuie s
cri, lme sau canale TV, ci i n spaiul informaional se adapteze la tranziia social i cultural. Astfel,
digital. Piaa Internetului n Romnia este n continu asemntor multor altor limbi, romna traverseaz un
cretere. Din ce n ce mai muli romni au acces proces continuu de internaionalizare, sub inuena
la un calculator acas, ind i utilizatori de Internet. vocabularului anglo-saxon.
Domeniul .ro nregistreaz 0.4% din paginile web Principala noastr grij nu ar trebui s e anglicizarea
existente n acest moment, comparabil cu domeniul .eu. treptat a limbii romne, ci dispariia sa complet din
Limba romn prezint un numr de caracteristici domeniile majore ale vieii noastre personale. Nu
specice care contribuie la bogia limbii, dar care ngrijoreaz domenii precum tiinele, aviaia i pieele
pot , de asemenea, o provocare pentru prelucrarea nanciare mondiale, care chiar au nevoie de o lingua
computaional a limbajului natural. anca la nivel mondial, ci multe domenii ale vieii de
Instrumentele de traducere automat i de prelucrare a zi cu zi, n care este mult mai important s i aproape
vorbirii disponibile n prezent pe pia sunt nc departe de cetenii unei ri dect de partenerii internaionali,
de standardele la care se ateapt s ajung. Actorii cum sunt, de exemplu, politicile interne, procedurile
dominani n domeniu sunt, n principal, ntreprinderi administrative, dreptul sau cultura.
private cu sediul n America de Nord, axate pe prot. Tehnologia informaiei i comunicaiei se pregtete
De la sfritul anilor 1970, Uniunea European a acum pentru urmtoarea revoluie. Dup calculatoare
neles importana tehnologiilor lingvistice ca motor personale, reele, miniaturizare, multimedia i
al unitii europene i a nceput nanarea primelor dispozitive mobile, urmtoarea generaie de tehnologie
proiecte de cercetare, cum a fost EUROTRA. n acelai va include programe care neleg nu doar litere i sunete
timp, au fost iniiate proiecte naionale, care au generat vorbite sau scrise, ci cuvinte i fraze ntregi, i care vin n
rezultate valoroase, dar nu au condus niciodat la aciuni sprijinul utilizatorului pentru c vorbesc i neleg limba
concertate la nivel european. n contrast cu acest efort lui. Precursorii acestei evoluii sunt serviciul online
de nanare extrem de selectiv, alte societi multilingve, gratuit Google Translate, care traduce din i spre 57 de
cum ar India (cu 22 de limbi ociale) i Africa limbi, Watson, supercomputerul IBM care a fost capabil
de Sud (cu 11 limbi ociale) au ninat de curnd s-l nving pe campionul SUA n jocul Jeopardy, dar
programe naionale pe termen lung de cercetare a limbii i Siri, asistentul mobil de la Apple pentru iPhone, care
i dezvoltare tehnologic. poate reaciona la comenzi vocale i poate rspunde
Exist unele ngrijorri privind utilizarea din ce n ce la ntrebri n limbile englez, german, francez i
mai larg a anglicismelor, i unii lingviti chiar se tem japonez.
c limba romn va sufocat de cuvinte i expresii n Urmtoarea generaie de tehnologii informaionale vor
limba englez. Studiul nostru indic totui c aceast stpni limbajul uman ntr-o asemenea msur, nct
ngrijorare nu este fondat. utilizatorii umani vor capabili s comunice folosind
Similar procesului de relatinizare din secolul al XIX-lea, tehnologia n propria lor limb. Dispozitivele vor
de dup eliberarea de sub dominaia greac i otoman, capabile s gseasc n mod automat, la simpla
limba romn a parcurs, n ultimii douzeci de ani, solicitare a utilizatorului printr-o comand vocal, cele
un proces de trecere de la limbajul totalitar (limba mai importante tiri i informaii de la magazinul digital

2
de cunotine. Tehnologiile bazate pe limbaj vor lingvistice i pentru tehnologiile limbajului, de exemplu
capabile s traduc automat sau s asiste interpreii, s pentru construirea modelelor statistice de limb.
rezume conversaii i documente, dar i s asiste activ mpreun cu politicienii i factorii de decizie politic,
utilizatorii n procesul de nvare. cercettorii ar trebui s poat contribui la stabilirea
Noile tehnologii informaionale i de comunicaii vor unor legi sau reglementri care s le permit s utilizeze
permite roboilor industriali i de servicii (n curs de textele puse la dispoziia publicului pentru activiti de
dezvoltare n prezent n laboratoarele de cercetare) s cercetare i dezvoltare legate de limbaj.
neleag cu exactitate ceea ce utilizatorii i doresc de la Se observ, de asemenea, o lips a continuitii
ei i apoi s raporteze cu mndrie realizrile lor. n nanarea cercetrii i dezvoltrii. Programe
Acest nivel de performan presupune s trecem coordonate pe termen scurt tind s alterneze cu perioade
cu mult dincolo de simple seturi de caractere i de nanare insucient sau deloc. n plus, exist
lexicoane, programe de corectare a limbii i reguli de n general o slab coordonare cu programe din alte
pronunie. Tehnologia trebuie s depeasc abordrile ri ale UE i la nivelul Comisiei Europene (cum se
simpliste i s nceap s modeleze limbajul ntr-un ntmpl, de exemplu, cu programele PSP-ICT, care
mod atotcuprinztor, lund n considerare deopotriv au ca protagoniti i universiti din Romnia, dar
sintaxa i semantica pentru a nelege ntrebri i a genera care nu sunt sprijinite de guvern pentru asigurarea
rspunsuri complete i relevante. coerent a conanrii). Nevoia de mari cantiti de
n cazul limbii romne, cercetrile din universiti date i complexitatea extrem a sistemelor ce folosesc
i institute de cercetare din Romnia i Republica tehnologia limbajului fac s e vital dezvoltarea unei
Moldova au dus la dezvoltarea de sisteme de nalt noi infrastructuri i a unei organizri mai coerente
calitate, precum i modele i teorii aplicabile pe scar a nanrii cercetrii n domeniul tehnologiilor
larg. Cu toate acestea, domeniul de aplicare al limbajului natural, dac dorim s putem spera la
resurselor, precum i gama de instrumente sunt nc folosirea noii generaii de tehnologii ale comunicrii
foarte limitate n raport cu resursele i instrumentele i informaiei n domeniile vieii private sau publice n
existente pentru limba englez i nu sunt suciente care vorbim n limba romn.
din punct de vedere calitativ i cantitativ pentru a
n concluzie, putem considera c deocamdat limba
dezvolta tehnologiile necesare sprijinirii unei societi
romn nu este n pericol. Cu toate acestea, ntreaga
a cunoaterii cu adevrat multilingve. Subdezvoltarea
situaie s-ar putea schimba dramatic atunci cnd o nou
care se resimte n zona resurselor lingvistice (cantitativ
generaie de tehnologii ncepe s stpneasc ntr-adevr
i calitativ) ngreuneaz enorm eforturile de dezvoltare
ecient limbajul uman. Prin mbuntiri n traducerea
a tehnologiilor limbajului i a aplicaiilor.
automat, tehnologia limbajului va ajuta la depirea
barierelor lingvistice, dar va capabil s opereze doar
Tehnologiile limbajului ajut la ntre acele limbi care au reuit s supravieuiasc n
unicarea Europei. lumea digital. Dac este disponibil o tehnologie
adecvat a limbajului, atunci aceasta va n msur s
O situaie neclar din punct de vedere juridic asigure supravieuirea limbii, altfel, chiar i limbile mai
restricioneaz utilizarea textelor digitale, cum ar cele mari vor intra sub o presiune sever.
publicate on-line de ziare, pentru cercetri empirice Dac ne bazm pe experiena dobndit pn acum,

3
tehnologiile hibride de astzi ale limbajului, care Tehnologia va ajuta la drmarea barierelor existente i
combin prelucrri de adncime cu metode statistice, la construirea unor puni de legtur ntre limbile din
par s e capabile s elimine decalajul dintre limbile Europa. Acest lucru necesit ca toate prile politic,
europene. Dup cum arat aceast serie de studii, exist cercetare, afaceri i societate s i uneasc eforturile n
diferene dramatice ntre rile membre ale Uniunii viitor.
Europene n ceea ce privete disponibilitatea soluiilor Aceast serie de studii completeaz alte aciuni
lingvistice i stadiul cercetrilor n rile membre ale strategice ale reelei de excelen META-NET (a se
Uniunii Europene. vedea anexa pentru o descriere de ansamblu). Informaii
Obiectivul pe termen lung al META-NET este de actualizate, precum ultima versiune a prezentrii
a introduce tehnologii ale limbajului de calitate viziunii META-NET [2] sau Agenda Strategic
ridicat pentru toate limbile, n vederea realizrii de Cercetare, pot gsite pe site-ul META-NET:
unitii politice i economice prin diversitate cultural. http://www.meta-net.eu.

4
2

UN RISC PENTRU LIMBILE NOASTRE:


O PROVOCARE PENTRU TEHNOLOGIA
LIMBAJULUI

Suntem martorii unei revoluii digitale care are un Crearea de principii jurnalistice i bibliograce a
impact dramatic asupra comunicrii i societii. asigurat calitatea i disponibilitatea materialelor
Dezvoltrile recente din tehnologia informaiilor i imprimate.
comunicaiilor digitale sunt uneori comparate cu Crearea diferitelor tipuri de media, precum ziarele,
inventarea tiparului de ctre Gutenberg. Ce ne radioul, televiziunea, crile etc. a satisfcut nevoia
poate spune aceast analogie despre viitorul societii de comunicare.
informaionale europene n general, i despre viitorul
limbilor noastre n particular? n ultimii 20 de ani, tehnologia informaiei a contribuit
la automatizarea i facilitarea mai multor procese:

soware-ul pentru tehnoredactare computerizat


Revoluia digital este comparabil cu
inventarea tiparului de ctre Gutenberg. nlocuiete acum dactilograerea i culegerea
textelor.
Microso PowerPoint nlocuiete retroproiectorul.
Ulterior inveniei lui Gutenberg au avut loc progrese
Serviciile de e-mail permit trimiterea i primirea de
reale n comunicare i n schimbul de informaie,
documente mai rapid dect folosind un fax.
datorit unor eforturi precum traducerea textelor
Skype permite convorbiri prin Internet i gzduiete
religioase n limba enoriailor. n secolele urmtoare,
ntlniri virtuale.
au fost dezvoltate tehnici culturale pentru a mbunti
prelucrarea limbajului i schimbul de cunotine: Formatele de codicare audio i video uureaz
schimbul de coninut multimedia.
Standardizarea ortograc i gramatical a limbilor Motoarele de cutare ofer acces bazat pe cuvinte
importante a permis diseminarea rapid a noilor idei cheie la un numr din ce n ce mai mare de pagini
culturale i tiinice. web.
Dezvoltarea limbilor ociale a fcut posibil Serviciile online precum Google Translate produc
comunicarea dintre ceteni n interiorul anumitor traduceri rapide, chiar dac aproximative.
granie (adeseori politice). Platformele sociale de media, precum sunt
Predarea i traducerea limbilor strine au facilitat Facebook, Twitter i Google+, faciliteaz
schimburile dintre limbi. colaborarea i partajarea de informaii.

5
Dei astfel de instrumente i aplicaii sunt utile, acestea bunuri i servicii n limbi diferite de cea matern
nu sunt suciente pentru a implementa o societate (engleza este cea mai cunoscut limb strin, urmat
informaional european multilingv i sustenabil, n de francez, german i spaniol). 55% dintre utilizatori
care informaia i bunurile s poat circula liber. citesc ntr-o limb strin, n timp ce doar 35% utilizeaz
o alt limb pentru a scrie e-mail-uri sau a publica
comentarii pe web [3]. Cu civa ani n urm,
2.1 FRONTIERELE LINGVISTICE engleza era privit ca lingua franca (limba de lucru) a
FRNEAZ CREAREA UNEI Internetului o vast majoritate a coninutului era scris
n aceast limb dar situaia s-a schimbat drastic acum.
SOCIETI INFORMAIONALE Cantitatea de coninut online n alte limbi ne-europene
EUROPENE (precum cele asiatice sau cele din Orientul Mijlociu) a
explodat.
Nu putem ti cu precizie cum va arta viitoarea societate
informaional. Dar exist o mare probabilitate ca n mod surprinztor, diviziunea digital accentuat
revoluia n tehnologia comunicaiilor s faciliteze datorat frontierelor lingvistice nu a ctigat nc prea
apropierea ntre oameni, vorbitori de limbi diferite, mult atenie n discursul public; totui, ea ridic o
n noi moduri. Necesitatea de a comunica foreaz ntrebare foarte presant: Care dintre limbile europene
oamenii s nvee limbi strine noi, iar pe dezvoltatori vor prospera n societatea virtual a informaiei i
i oblig s creeze noi aplicaii tehnologice pentru a cunoaterii i care sunt sortite dispariiei?
asigura nelegerea reciproc i accesul la cunotinele
comune. Este limpede c progresul societii impune
acum o calitate a comunicrii diferit de cea de acum 2.2 LIMBILE NOASTRE SUNT N
civa ani. PERICOL
Tiparul, dei a contribuit la un inestimabil schimb de
Economia i spaiul informaional global ne informaii n Europa, a condus de asemenea la extincia
confrunt cu mai multe limbi, mai muli vorbitori, multora dintre limbile europene. Limbile regionale i
mai mult coninut. minoritare au fost tiprite arareori, iar limbi precum
dalmata sau limba din Cornwall au cunoscut doar forme
ntr-un spaiu economic i informaional global, suntem orale de transmitere, care le-au restricionat adoptarea,
confruntai cu mai multe limbi, mai muli vorbitori i rspndirea i utilizarea. Va avea Internetul acelai efect
mai mult coninut i suntem nevoii s interacionm asupra limbilor noastre?
rapid cu noi tipuri de media. Popularitatea actual Cele aproximativ 80 de limbi vorbite astzi n Europa
a mediilor sociale (Wikipedia, Facebook, Twitter i reprezint unul dintre cele mai bogate i importante
YouTube) reprezint doar vrful aisbergului. bunuri culturale ale sale, dar i o component
Astzi putem recepiona gigaoctei de text din orice important a modelului su social unic [4]. n timp ce
col al planetei n cteva secunde, doar pentru a aa c limbi populare precum engleza sau spaniola vor rmne
textul este ntr-o limb pe care nu o nelegem. Potrivit cu siguran prezente pe piaa digital emergent,
unui raport recent solicitat de Comisia European, 57% multe limbi europene ar putea deconectate de la
dintre utilizatorii de Internet din Europa achiziioneaz comunicarea digital i ar putea deveni irelevante pentru

6
societatea Internetului. O astfel de evoluie ar slbi competenele de lucru cu calculatorul. Tehnologia
poziia Europei pe piaa global i ar n contradicie limbajului opereaz de obicei n culise, n cadrul unor
cu obiectivul strategic de asigurare a participrii egale a sisteme complexe, care ne ajut, de exemplu:
ecrui cetean european, indiferent de limba lui.
s gsim informaii cu un motor de cutare pe
Internet;
Marea varietate de limbi ale Europei este s vericm ortograa i corectitudinea gramatical
unul dintre cele mai importante bunuri
culturale ale sale i o component esenial cu un editor de texte;
a succesului su social. s vizualizm recomandri de produse oferite ntr-un
magazin virtual;
Potrivit unui raport recent al UNESCO privind s ascultm instruciunile unui sistem de navigaie;
multilingvismul, limbile reprezint un mediu esenial s traducem pagini web cu un serviciu online.
pentru exercitarea drepturilor fundamentale precum
exprimarea politic, educaia i participarea n societate Tehnologia limbajului const ntr-o serie de aplicaii
[5]. de baz care activeaz procese auxiliare n cadrul unei
aplicaii mai complexe. Scopul realizrii seriei de studii
n cadrul proiectului METANET este s descopere ct
2.3 TEHNOLOGIA LIMBAJULUI de avansate sunt aceste tehnologii pentru ecare dintre
ESTE CHEIA ACTIVRII limbile europene.

TEHNOLOGIEI
n trecut, eforturile de investiii nanciare n prezervarea Europa are nevoie de tehnologii ale
limbajului robuste i accesibile, adaptate
limbilor s-au concentrat asupra educaiei lingvistice i tuturor limbilor europene.
a traducerii. De exemplu, potrivit anumitor estimri,
piaa european de traducere, interpretare, localizare
Pentru a-i menine poziia n prima linie a inovrii
de soware i globalizare a paginilor de Internet a
globale, Europa are nevoie de tehnologii ale limbajului
fost estimat la 8,4 miliarde n 2008 i este de
adaptate tuturor limbilor europene, care s e robuste,
ateptat s creasc cu 10% pe an [6]. Totui, aceast
accesibile nanciar i bine integrate n medii soware
cifr acoper doar o mic parte din nevoile curente i
complexe. Experiena utilizatorului cu mediul virtual
viitoare n comunicarea dintre ceteni. Soluia cea mai
n regim interactiv, multimedia i multilingv nu este
convingtoare pentru a asigura amploarea i extinderea
posibil fr tehnologia limbajului.
utilizrii limbilor n Europa de mine este de a utiliza
tehnologiile adecvate, aa cum folosim tehnologiile
pentru transport, energie sau alte nevoi. 2.4 OPORTUNITI ALE
Tehnologiile limbajului (care acoper toate formele
de texte scrise i discursuri rostite) pot ajuta oamenii TEHNOLOGIEI LIMBAJULUI
s colaboreze, s conduc afaceri, s mprteasc n lumea tiparului, realizarea tehnologic proeminent a
cunotine i s participe n dezbateri politice i constat n copierea rapid a imaginii unei pagini de text
sociale, independent de barierele lingvistice sau de folosind un dispozitiv de tiprire. Oamenilor le-a rmas

7
munca grea de a cuta, citi, traduce i rezuma cunotine. limbajului poate juca un rol important. Popularitatea
A trebuit s ateptm pn la Edison pentru a nregistra aplicaiilor de media social precum Twitter i Facebook
limba vorbit i, din nou, tehnologia lui a fcut pur i sugereaz nc o ocazie n care tehnologii sosticate
simplu copii analogice. ale limbajului sunt necesare pentru monitorizarea
Tehnologia limbajului digital permite dezvoltarea de publicaiilor, rezumarea discuiilor, identicarea unor
aplicaii precum traducerea automat, generarea de curente de opinie, detectarea rspunsurilor emoionale,
coninut, procesarea informaiei i managementul descoperirea nclcrilor drepturilor de autor sau a
cunotinelor pentru toate limbile europene. De situaiilor de abuz.
asemenea, ea poate mbogi cu interfee intuitive, Tehnologiile limbajului reprezint o oportunitate
bazate pe limbaj, dispozitive electrocasnice, utilaje, uria pentru Uniunea European, att din punct
vehicule, computere i roboi. Dei exist deja multe de vedere economic, ct i din perspectiv cultural.
prototipuri, aplicaiile comerciale i industriale sunt nc Multilingvismul a devenit o regul n Europa.
n stadii incipiente de dezvoltare. Realizrile recente Companiile, organizaiile i colile europene sunt, de
din cercetare i dezvoltare au creat o adevrat avalan asemenea, multinaionale i diverse. Cetenii doresc s
de oportuniti de aplicare a tehnologiei limbajului comunice dincolo de frontierele de limb care persist
(TL). De exemplu, traducerea automat (TA) ofer o pe Piaa Comun European, iar tehnologiile limbajului
acuratee rezonabil pentru domenii specice, iar o serie pot ajuta la depirea acestor bariere, sprijinind n acelai
de aplicaii experimentale pot asigura managementul timp utilizarea liber i deschis a limbilor. Privind
informaiei i cunotinelor, precum i producerea de chiar mai departe, o tehnologie european a limbajului,
coninut n multe din limbile europene. inovativ i multilingv, va putea un punct de referin
pentru partenerii notri globali i comunitile lor
multilingve. Tehnologiile limbajului pot vzute ca o
Tehnologiile limbajului ajut la depirea form de tehnologie de asisten care ajut la depirea
handicapului indus de diversitatea
handicapului indus de diversitatea lingvistic i face
lingvistic european.
comunitile lingvistice mai accesibile. Un cmp activ
de cercetare este reprezentat de tehnologia dedicat
Ca n majoritatea cazurilor, primele aplicaii lingvistice, operaiilor de salvare n zonele sinistrate. n astfel de
precum interfee vocale i sisteme de dialog, au fost medii cu risc nalt, acurateea comunicrii poate o
dezvoltate pentru domenii foarte specializate i prezint problem de via i de moarte. Roboi inteligeni cu
adeseori performane limitate. Exist oportuniti capaciti multilingve au potenialul de a salva viei.
uriae de pia n sectorul educaiei i al divertismentului
pentru integrarea tehnologiei limbajului n jocuri,
site-uri de patrimoniu cultural, oferte edutainment 2.5 PROVOCRILE
(educaie prin divertisment), medii de simulare sau
programe de formare. Serviciile mobile de informaii,
TEHNOLOGIEI LIMBAJULUI
soware-ul pentru nvarea limbilor strine asistat Dei tehnologia limbajului s-a dezvoltat considerabil n
de calculator, mediile e-learning, instrumentele de ultimii ani, ritmul actual al progresului tehnologic i al
autoevaluare i cele de detectare a plagiatului sunt doar inovrii este prea lent. Tehnologiile elementare care sunt
cteva exemple de zone ale aplicaiilor n care tehnologia utilizate pe scar larg, precum opiunile de corectare

8
gramatical i ortograc din editoarele de text, sunt Oamenii achiziioneaz competenele lingvistice n
de obicei monolingve i sunt disponibile doar pentru dou moduri distincte. Copiii nva o limb ascultnd
cteva limbi. interaciuni dintre prini, frai sau ali membri ai
Serviciile online de traducere automat, dei sunt utile familiei. La vrsta de aproximativ doi ani, copiii ajung
pentru generarea rapid a unei aproximri rezonabile s produc primele lor cuvinte sau fraze scurte. Acest
a coninutului unui document, ntmpin multe lucru este posibil pentru c oamenii au o predispoziie
diculti atunci cnd este nevoie de traduceri precise genetic pentru a imita i a nelege ceea ce aud.
i complete. Datorit complexitii limbajului uman, nvarea unei a doua limbi presupune un efort cognitiv
modelarea limbilor noastre n programe soware mult mai mare atunci cnd copilul nu este introdus
i testarea lor n lumea real este o ntreprindere ntr-o comunitate lingvistic de vorbitori nativi. La
costisitoare, care necesit angajamente de nanare vrsta colar, limbile strine sunt nsuite de obicei
susinut. Europa trebuie astfel s i menin rolul de prin nvarea structurii lor gramaticale, a vocabularului
pionierat n confruntarea cu provocrile tehnologice i a ortograei din cri i materiale educaionale care
ridicate de o comunitate multilingv, prin inventarea descriu cunoaterea lingvistic prin reguli abstracte,
de noi metode pentru a accelera dezvoltarea. Acestea tabele sau texte exemplu. nvarea unei limbi strine
ar putea include att noi direcii n tehnici i calcule presupune mult timp i efort i devine din ce n ce mai
computaionale, ct i crowdsourcing (exploatarea dicil cu naintarea n vrst.
cunotinelor maselor).
Cele dou tipuri principale de sisteme de TL
achiziioneaz capaciti lingvistice ntr-o manier
Ritmul actual al progresului tehnologic similar oamenilor. Abordrile statistice (sau bazate
este prea lent. pe date) obin cunotine lingvistice dintr-o colecie
vast de exemple concrete. Dac pentru anumite sistem,
precum corectoarele de limb, sunt suciente texte
ntr-o singur limb, alte aplicaii necesit texte n
dou sau mai multe limbi, cum este cazul sistemelor
2.6 ACHIZIIA LIMBII DE CTRE
de traducere automat. Algoritmii statistici de nvare
OM I MAIN automat nva abloane de traducere corect a
Pentru a ilustra modul n care computerele prelucreaz cuvintelor, a frazelor scurte sau chiar a propoziiilor
limbajul i pentru a explica de ce achiziia limbii este o ntregi.
sarcin foarte dicil, vom arunca o scurt privire asupra Abordarea statistic poate avea nevoie de milioane
modului n care oamenii achiziioneaz prima i a doua de exemple, iar calitatea performanei crete odat cu
limb i apoi asupra modului de funcionare a sistemelor numrului de texte analizate. Acesta este unul dintre
bazate pe tehnologiile limbajului. motivele pentru care furnizorii de motoare de cutare
sunt dornici s colecteze ct mai mult material scris.
Corectarea erorilor de scriere n editoarele de text i
Oamenii achiziioneaz competene lingvistice n servicii ca Google Search i Google Translate se bazeaz
dou moduri diferite: nvnd din exemple i
pe abordri statistice. Marele avantaj al statisticii este
nvnd regulile care stau la baza limbii.
faptul c maina nva repede, n cicluri repetate de

9
antrenare, dei calitatea nvrii poate varia arbitrar. Deoarece punctele forte i punctele slabe ale sistemelor
A doua abordare a tehnologiilor limbajului este statistice i ale sistemelor bazate pe reguli tind s e
dezvoltarea de sisteme bazate pe reguli. Experi din complementare, cercetrile actuale se concentreaz pe
lingvistic, lingvistic computaional sau informatic abordri hibride, care combin cele dou metodologii.
codic analize gramaticale (reguli de traducere) i Totui, aceste abordri nu au avut pn n prezent acelai
compileaz liste de tip vocabular (lexicoane). Realizarea succes n aplicaiile industriale ca cel din laboratoarele de
unui sistem bazat pe reguli este o activitate care necesit cercetare.
mult timp i efort intens, dar i experi cu specializare Dup cum am vzut n acest capitol, multe aplicaii
nalt. O parte dintre cele mai performante sisteme de utilizate pe scar larg n societatea informaional de
traducere automat bazat pe reguli se a n dezvoltare astzi se bazeaz pe tehnologii ale limbajului. Datorit
constant de mai mult de douzeci de ani. Avantajul comunitii sale multilingve, acest lucru este valabil
acestor sisteme este c experii pot avea un control n special n spaiul economic i informaional din
mai detaliat asupra procesrii limbajului. Aceasta face Europa. Dei tehnologia limbajului a fcut progrese
posibil corectarea sistematic a greelilor din soware considerabile n ultimii ani, exist nc un potenial uria
i furnizarea de rspunsuri detaliate ctre utilizator, n n mbuntirea calitii sistemelor bazate pe tehnologii
special cnd sistemele bazate pe reguli sunt folosite lingvistice. n cele ce urmeaz, vom descrie rolul
pentru nvarea unei limbi. Datorit constrngerilor limbii romne n societatea informaional european
nanciare, sisteme de tehnologia limbajului bazate pe i vom evalua stadiul actual al cercetrilor n domeniul
reguli au fost pn acum dezvoltate doar pentru cteva tehnologiei limbajului pentru limba romn.
limbi majore.

10
3

LIMBA ROMN N
SOCIETATEA INFORMAIONAL EUROPEAN

recensmnt (din 2002), cei mai numeroi erau ungurii


3.1 FAPTE GENERALE
(1.431.807) i rromii (535.140), urmai de germani,
Vorbit de aproximativ 29.000.000 de vorbitori [7], ucraineni, rui lipoveni, turci, srbi, croai, sloveni,
limba romn este limba matern a 25.000.000 de ttari, slovaci, bulgari, evrei, cehi, polonezi, greci,
vorbitori: n jur de 21.500.000 de vorbitori n Romnia armeni etc. Pentru toate minoritile, politicile
[8] plus aprox. 3.500.000 de vorbitori n Republica lingvistice ociale n Romnia garanteaz drepturile
Moldova [9] (unde limba este denumit n mod acestora de a protejate n calitate de comuniti
ocial moldoveneasc). n rile vecine Romniei lingvistice i de a utiliza limba matern n medii
(Albania, Bulgaria, Croaia, Grecia, Ungaria, Fosta private i publice, culturale i sociale, economice i de
Republic Iugoslav a Macedoniei, Serbia, Ucraina) i n comunicare. Totui, articolul 13 al Constituiei prevede
comunitile de imigrani din Australia, Canada, Israel, c n Romnia, limba ocial este romna. Mai
America Latin, Turcia, S.U.A. i alte ri europene i mult, Legea nr. 500 din 12 noiembrie 2004 stipuleaz
asiatice se mai a aproximativ 4.000.000 de vorbitori obligaia ca orice text (e el oral sau scris) de interes
nativi de romn [10]. public s e tradus sau adaptat n limba romn [12].
Romna este, de asemenea, limb ocial n Provincia
Autonom Voivodina din Serbia, n Muntele Athos
autonom din Grecia, n Uniunea European i n
3.2 PARTICULARITILE LIMBII
Uniunea Latin; ea este recunoscut ca limb minoritar ROMNE
n Ucraina.
Limba romn este o limb romanic oriental, care s-a
Limba romn are 4 dialecte [11]: dacoromna, format la distan de surorile sale occidentale. Elemente
aromna (vorbit de aproximativ 600.000 de vorbitori ale latinei populare, din care a evoluat, sunt mai bine
n Albania, Bulgaria, Grecia i Macedonia), istroromna pstrate n aceast limb izolat geograc: s-au motenit
(15.000 de vorbitori n 2 zone mici din Peninsula Istria, structura morfo-sintactic latineasc, particulariti
Croaia) i meglenoromna (n jur de 5.000 de vorbitori pe care alte limbi romanice le-au pierdut (precum
n Grecia i Macedonia). Din cauza numrului mic de declinrile), au fost ntrite elemente morfologice
vorbitori, ultimele trei dialecte sunt incluse n Cartea (reexivul) sau au fost preluate elemente non-romanice
Roie a Limbilor pe Cale de Dispariie UNESCO. (vocativul n -o).
n Romnia exist 18 minoriti etnice recunoscute Cea mai mare parte a vocabularului limbii romne
ocial; conform rezultatelor ociale ale ultimului are origine latin, e motenit din latina vulgar, e

11
mprumutat pe cale savant, n epoca modern. 60% origini: turc, greac, german, maghiar, bulgar,
din vocabularul fundamental (cuvintele cunoscute i rus etc. n romn au fost create cuvinte noi
folosite curent de toi vorbitorii) este motenit din mai ales prin suxare, dei studiile recente reect
latin. creterea importanei pe care a cptat-o n ultima vreme
n timpul colonizrii Daciei de ctre romani (106271 prexarea (mai multe informaii n [13]).
d. Hr.), colonitii au impus limba latin ca limb Limba romn are cinci litere cu diacritice: , , ,
ocial. Cu toate acestea, studii comparative ntre , . Pentru ultimele dou au circulat dou variante:
vocabularul romnesc i cel albanez dovedesc existena una cu virgul sub liter, alta cu sedil, ns numai
unui numr de aproximativ 100 de cuvinte pstrate din prima variant este recomandat astzi de Asociaia de
substratul traco-dac. Aceste cuvinte denumesc concepte Standardizare din Romnia (ASRO).
fundamentale, precum pri ale corpului, elemente Multe texte electronice nu sunt scrise cu diacritice, ns
naturale, hran. Ele sunt folosite i astzi, sunt foarte au fost create programe pentru a introduce diacriticele
frecvente, au dezvoltat o polisemie i familii lexicale n mod automat n astfel de texte.
bogate.
n timpul migraiei triburilor slave pe teritoriul Limba romn are cinci litere cu diacritice: , , ,
Romniei de astzi, limba romn a suferit un proces , . Pentru ultimele dou au circulat dou
variante: una cu virgul sub liter, alta cu sedil,
de transformare n toate compartimentele: fonetic,
ns numai prima variant este recomandat.
vocabular, morfologie i sintax. Cu toate acestea,
morfologia, care d esena unei limbi, a rmas latineasc
n cele mai multe aspecte ale sale. Alfabetul chirilic Limba romn prezint un numr de caracteristici
a fost adoptat n aceast perioad, mai ales datorit specice, care contribuie la bogia limbii, dar pot
inuenei bisericeti. Slavona a fost limba n care , de asemenea, o provocare pentru prelucrarea
s-a ociat serviciul religios n biserica ortodox pn computaional a limbajului natural. Sistemul exionar
n secolul al XVIII-lea, cnd romna a nceput un al limbii romne este destul de bogat. Pentru
proces de relatinizare, modernizare i occidentalizare. substantive, pronume i adjective exist cinci cazuri i
Atunci, multe cuvinte de alte origini au fost nlocuite dou numere. Pronumele pot avea forme accentuate
de cuvinte latineti, mprumutate direct sau indirect, sau neaccentuate (clitice), iar substantivele i adjectivele
prin intermediul altor limbi romanice (francez i pot articulate sau nearticulate. Verbele au dou
italian). Franceza, ca limb de cultur n ultimele numere, singular i plural, ecare cu cte trei persoane,
dou secole, i Frana, ca ara n care aristocraia romn cinci timpuri sintetice plus innitivul, gerunziul i
i trimitea copiii la nvtur, justic existena unui participiul. n medie, un substantiv poate avea cinci
numr extrem de mare de cuvinte de aceast origine forme, un pronume personal ase, un adjectiv ase, iar un
n limba romn. n ultimul timp, limba englez a verb peste treizeci. n afar de suxele morfologice i de
luat locul francezei, iar romna are multe anglicisme, desinene, exiunea cuvintelor mai prezint i alternane
adaptate total, parial sau deloc la sistemul su fonetic i fonetice n interiorul rdcinii.
morfologic.
Aspecte politice, economice i sociale din istoria Limba romn este o limb cu un sistem bogat de
poporului romn explic existena cuvintelor de diverse exionare, cu diferite particulariti lingvistice:

12
permite elipsa subiectului, dublarea cliticelor,
permite concordan negativ i negaie dubl. Unele caracteristici lingvistice ale limbii romne
reprezint adevrate provocri n cazul
prelucrrilor computaionale.
Romna este o limb care permite nelexicalizarea
subiectului pronominal, ca cele mai multe limbi
Limba romn prezint att fenomenul concordanei
romanice, de altfel:
negative (cnd prezena unuia sau mai multor cuvinte
tie. negative implic apariia unui marcator negativ), comun
mai multor limbi latine, precum portugheza, spaniola
Explicaia rezid n sistemul exionar bogat al verbelor,
sau franceza, ct i dubla negaie (similar dublei
care au desinene diferite pentru persoane i numere
negaii logice, cnd dou negaii sunt echivalente cu o
diferite. Cu toate acestea, i dublarea subiectului este
armaie), care este acceptat de anumite limbi, precum
posibil n romn, atunci cnd un pronume personal
limba englez, doar pentru realizarea anumitor valene
dubleaz un grup nominal lexical:
stilistice. Un exemplu de concordan negativ este:
Vine el tata imediat! Nu am vzut pe nimeni niciodat aici.
Structura este caracteristic limbajului familiar, unde prezena marcatorului negativ nu n grupul
marcnd o anumit atitudine ilocuionar a verbal imprim caracter negativ ntregii propoziii
vorbitorului: ameninare, promisiune, asigurare i accentueaz cuvintele negative din respectiva
verbal. propoziie.
Romna are n comun cu anumite dialecte spaniole i cu Totui, anumite conguraii n care apar marcatorii i
cteva limbi balcanice o structur cunoscut sub numele cuvintele negative trebuie interpretate ca avnd dubl
de dublare clitic. Dublarea clitic pronominal n negaie (adic, n ciuda formei negative a verbului
romn se face cu pronume neaccentuate de dativ, de predicativ, enunul are un coninut armativ). De
acuzativ sau ambele. De exemplu, n propoziia exemplu, o propoziie principal negativ urmat de
o subordonat cu verbul la forma negativ a modului
Ii lj -am dat mameii pe Ionj la telefon.
conjunctiv este o astfel de conguraie cu sens armativ:
substantivul mamei i cliticul de dativ i se refer la
Maria nu a vrut s nu spun nimic.
aceeai persoan, iar cliticul de acuzativ l- i substantivul
n acuzativ Ion sunt tot corefereniale. Prezena este echivalent cu:
cliticelor n asemenea construcii este obligatorie, dei
ele nu complinesc valene verbale. ns atunci cnd Maria a vrut s spun ceva.
substantivele nu sunt prezente, pronumelor le revine Cazul este sintetic n limba romn: substantivul i
sarcina de a satura valenele verbale: schimb forma pentru exprimarea cazului. Cu toate
acestea, exist i trei prepoziii care marcheaz cazul:
I l-am dat la telefon.
pe pentru acuzativ (condiionat de trsturile animat,
Este obligatorie dublarea numelor proprii i a hotrt i specic ale grupului nominal), la pentru dativ
substantivelor articulate hotrt, cu funcie sintactic i a pentru genitiv (ambele condiionate de prezena
de complement direct sau indirect. unui numeral n grupul nominal):

13
L-am vzut pe colegul meu. cu care majoritatea romnilor nu sunt obinuii.
Am dat crile la trei dintre copii. Exemplul de mai sus demonstreaz importana tragerii
unui semnal de alarm asupra unei dezvoltri care risc
Crile a trei copii erau noi.
s exclud din societatea informaional o mare parte a
populaiei, care nu este familiar cu limba englez.

3.3 DEZVOLTRI RECENTE


Similar procesului de relatinizare din secolul al XIX-lea, 3.4 CULTIVAREA LIMBII N
de dup eliberarea de sub dominaia greac i otoman,
limba romn a parcurs, n ultimii douzeci de ani, ROMNIA
un proces de trecere de la limbajul totalitar (limba Academia Romn, cel mai nalt forum cultural al
de lemn, discursul unidirecional etc.) la utilizarea rii, are printre obiectivele sale principale cultivarea
deschis, n care noi modele lingvistice trebuie s limbii naionale. Scopul major al institutelor sale
se adapteze la tranziia social i cultural. Astfel, lingvistice, Institutul de Lingvistic Iorgu Iordan
asemntor multor altor limbi, romna traverseaz un Al. Rosetti din Bucureti, Institutul de Filologie
proces continuu de internaionalizare, sub inuena Romn A. Philippide din Iai i Institutul de
vocabularului anglo-saxon. Lingvistic i Istorie Literar Sextil Pucariu din
n domenii eseniale, precum tiinele politice, Cluj-Napoca, a fost crearea i publicarea Dicionarului
administrative i economice, n pres, n publicitate, Tezaur al Limbii Romne, proces care a durat aproape
n informatic etc., au fost mprumutate numeroase un secol. Seria mai veche, cunoscut sub numele
cuvinte sau cuvinte existente au cptat sensuri noi, de Dicionarul Academiei (DA), include 5 volume cu
dup model englezesc; terminologiile domeniilor noi se 3146 de pagini i 44890 de intrri lexicale i a fost
bazeaz pe mprumuturi din englez, vocabularul activ realizat ntre anii 1913 i 1947. Dup o ntrerupere,
al oamenilor instruii conine din ce n ce mai multe lucrul a fost reluat la mijlocul deceniului al aptelea al
anglicisme; se pot observa noi modele intonaionale, secolului trecut cu o serie nou, cunoscut sub numele
precum i tendina de a folosi persoana a doua singular, de Dicionarul Limbii Romne (DLR). Ultimul volum a
mai familiar, n locul persoanei a doua plural, mai fost publicat la Editura Academiei la nceputul lui 2009.
formal. n total, DA i DLR au 33 de volume, peste 15000
n anumite domenii, anglicismele au nceput s de pagini i n jur de 175000 de intrri. Dicionarul a
nlocuiasc vocabularul limbii romne. Un exemplu fost creat n stilul tradiional, cu creionul pe hrtie, cu
este folosirea titlurilor englezeti pentru anunuri de citate adunate din peste 2500 de volume de literatur
locuri de munc, n special pentru poziii de conducere, romn scris.
de ex. Human Resource Manager n loc de Director Institutul de Lingvistic Iorgu Iordan Al. Rosetti
de Resurse Umane. O tendin puternic de a exagera are un program de cercetare ce urmrete cultivarea
folosirea anglicismelor poate observat n reclame. limbii, elaboreaz dicionare normative (Dicionar
Bnci din Romnia folosesc slogane promoionale de ortograc, ortoepic i morfologic al limbii romne,
genul: Cu cine faci banking? sau Prima modalitate Dicionarul mprumuturilor neadaptate, Dicionarul
de plat contactless, dei banking sau contactless sunt termenilor ociali) i gramatici (Gramatica limbii
anglicisme care nu au intrat n vocabularul comun i romne, Dinamica limbii romne actuale).

14
Institutul de Filologie Romn A. Philippide din Iai, interesul public, s respecte normele academice.
prin departamentele specializate, deruleaz proiecte
fundamentale pentru cultura romn n domeniile
Exist n strintate peste 70 de centre n care
lexicograei, dialectologiei, toponimiei, etnograei i se pred limba romn ca limb strin
folclorului. Institutul din Iai a colaborat cu institutele de ctre cadre didactice din nvmntul
lingvistice din Bucureti i Cluj-Napoca la crearea i universitar romnesc.
publicarea Atlasului lingistic pe regiuni, o oper de
importan major pentru lingvistica romneasc. Pe Institutul Limbii Romne a fost creat cu scopul de
baza atlaselor regionale din Romnia i a Atlasului a promova nvarea limbii romne peste hotare, de
lingvistic moldovenesc se ntocmete la Institutul de a-i sprijini pe cei care nva limba romn i de a le
lingvistic Iorgu Iordan Al. Rosetti Atlasul lingistic atesta cunotinele de romn [14]. Exist n strintate
romn pe regiuni. Sintez. peste 70 de centre n care se pred limba romn ca
Tot n cadrul Academiei Romne funcioneaz alte dou limb strin de ctre cadre didactice din nvmntul
institute care se ocup de cultivarea limbii romne: universitar romnesc.
Institutul de Istorie i Teorie Literar G. Clinescu Se constat c strinii manifest un interes crescnd
i Institutul de Etnograe i Folclor C. Briloiu. pentru studiul limbii romne: la nivel diplomatic
Institutul de Istorie i Teorie Literar G. Clinescu are (de ctre reprezentanii misiunilor diplomatice ale
direcii de cercetare precum elaborarea de enciclopedii diverselor ri), n mediul de afaceri i de ctre turiti.
i lucrri de sintez fundamentale n domeniul istoriei i n afar de universiti, care ofer cursuri de romn
teoriei literare, conservarea i dezvoltarea patrimoniului ca limb strin, de obicei pentru studenii strini
literar naional i denirea identitii culturale naionale din Romnia, exist i numeroase rme particulare cu
n context european. Institutul de Etnograe i oferte mai ales pentru strinii implicai n sectorul
Folclor Constantin Briloiu este o structur de economic. Sunt organizate cursuri de var de limb
cercetare multidisciplinar a crei principal sarcin i civilizaie romn pentru toate nivelurile, anual, n
este elaborarea de studii fundamentale i avansate diverse locuri din ar, de Fundaia Cultural Romn,
asupra fenomenului culturii populare tradiionale precum i de cteva instituii de nvmnt superior
i contemporane, rurale i urbane, n domeniile (precum Universitatea Alexandru Ioan Cuza din Iai
folcloristicii (folclor literar), etnomuzicologiei, sau Universitatea din Bucureti).
etnograei i arhivelor multimedia, neconvenionale, Cultivarea limbii n contextul nnoirii accelerate este
de folclor. o prioritate i pentru pres. Canalele naionale de
radio i televiziune au emisiuni n care sunt discutate
Importante lucrri de etimologie romneasc, de studii
mpreun cu specialiti i explicate publicului aspecte
asupra limbii vechi, biblice (precum Monumenta
mai complicate ale limbii.
linguae Dacoromanorum Biblia 1688), sau de
indexare a lucrrilor marilor scriitori (precum opera lui
Eminescu) au fost realizate la Universitatea Alexandru 3.5 LIMBA N EDUCAIE
Ioan Cuza of Iai. Conform noului curriculum naional (2000), romna
Legea 500 din 12 noiembrie 2004 prevede ca toate se pred 45 ore obligatorii pe sptmn n coala
textele scrise sau orale n limba romn, care servesc gimnazial i 34 ore n liceu. Aspectele prescriptive

15
ale conservrii limbii se combin cu comunicarea, Aceeai situaie se ntlnete i n lumea afacerilor. n
comportament axat pe competene, accentundu-se multe companii internaionale mari, engleza a devenit
relaia limb-cultur. Limba i literatura romn lingua anca, att n comunicarea scris (e-mail i
reprezint o materie obligatorie la examenele naionale documente), ct i n cea oral, n special n companii
(la absolvirea ciclului gimnazial i liceal; bacalaureatul multinaionale cu directori strini.
cuprinde dou probe de limba romn: una oral i Tehnologiile limbajului pot rezolva aceast provocare
alta scris). Limba i literatura romn se studiaz din alt perspectiv prin oferirea unor servicii precum
ca specializri principale sau secundare n peste 30 de traducerea automat sau regsirea de informaii
universiti de stat i particulare din Romnia. multilingve n texte redactate n diverse limbi strine,
ajutnd astfel la diminuarea dezavantajelor personale i
economice cu care se confrunt vorbitorii care nu au
3.6 ASPECTE INTERNAIONALE cunotine avansate de limb englez.
Romnia este recunoscut pe plan internaional pentru Minoriti romne triesc n rile vecine i n diaspora
literatura sa, lucrrile principale ale lui Eminescu peste tot n lume. Romnia promoveaz politici
(marele poet naional al Romniei) ind traduse n peste pentru pstrarea identitii lingvistice i culturale
60 de limbi. Alte nume cunoscute din literatura romn de ctre comunitile romneti. Centrul Euxodius
sunt: Mircea Eliade, primul istoric care a scris o istorie Hurmuzachi ofer sute de burse anual n Romnia
a religiilor, Eugen Ionesco, unul dintre promotorii pentru minoritile romne din rile vecine. Sunt
Teatrului Absurdului, sau Emil Cioran, cunoscut pentru multe schimburi colare i academice, mai ales cu
losoa lui. De asemenea, un numr de scriitori Republica Moldova. Primele extinderi n sistem
contemporani sunt acum tradui n limbi strine: franciz ale colilor i universitilor din Romnia au
Mircea Crtrescu, Filip Florian, Radu Aldulescu etc. aprut n Republica Moldova n anul 2000. Exist
n prezent, ca o necesitate de rspndire internaional, iniiative diverse n comuniti din diaspora, prin care
o mare parte din publicaiile tiinice din domeniul cei interesai pot studia limba i cultura romneasc.
TL sunt scrise n limba englez, inclusiv cele dedicate De exemplu, coala de limba romn din Kitchener
cercetrilor n TL pentru limba romn, cum sunt (Canada) ofer ore de limb i cultur romn pentru
lucrrile conferinei organizate de Consoriul de copii i adolesceni. Institutele Culturale Romne exist
Informatizare pentru Limba Romn. Folosirea n 19 orae din lume (inclusiv Bucureti, New York,
cu predilecie a limbii engleze pentru comunicarea Paris, Londra, Roma, Istanbul etc.) i toate au drept
rezultatelor cercetrilor este o caracteristic a majoritii preocupare important promovarea limbii romne
domeniilor tiinei i este mai puin proeminent pentru i a civilizaiei romneti prin cursuri i evenimente
discipline precum lozoe, lingvistic, teologie sau culturale de diverse tipuri.
pentru domeniul juridic.

3.7 LIMBA ROMN PE


Consoriul de Informatizare pentru Limba Romn
ConsILR organizeaz anual o conferin INTERNET
internaional dedicat cercetrilor n tehnologia
Piaa Internetului n Romnia este n continu cretere.
limbajului pentru limba romn.
n 2010, 44,2% dintre romni aveau acces la un

16
calculator acas, iar 35,5% (i. e. 7.786.700 de romni) domenii de aplicare pentru tehnologia limbajului.
erau utilizatori de Internet [15] (aproximativ 60% Operaia cea mai frecvent utilizat pe web este
dintre ei ind utilizatori zilnici), ceea ce plaseaz cutarea, care implic prelucrarea automat a limbajului
Romnia pe locul 8 ntr-un top 10 al utilizatorilor de pe mai multe niveluri, dup cum vom arta mai
Internet din Europa [16]. Peste 500.000 de site-uri trziu. Cutarea pe Internet implic tehnologii
web sunt nregistrate cu domeniul .ro. Comparnd lingvistice sosticate, diferite de la o limb la alta. Un
aceste date cu cele din 2000, cnd numai 3,6% exemplu pentru limba romn presupune uniformizarea
din populaie (adic 800.000 de romni) foloseau diacriticelor, dar sunt multe altele ce vor detaliate n
Internetul, observm o cretere de aproape zece ori. seciunea urmtoare.
Un studiu al Uniunii Latine din 2007 [17] arat c, Utilizatorii i furnizorii de coninut web pot s
similar cu tendina celorlalte limbi neolatine, prezena foloseasc tehnologia limbajului n moduri mai puin
limbii romne pe Internet a crescut din 1998 pn n evidente, de exemplu, prin traducerea n mod automat
2007. mprind procentul de pagini web pentru ecare a coninutul paginilor web dintr-o limb n alta. Dei
limb la procentul de prezen relativ a vorbitorilor o traducere manual a coninutului paginilor web ar
limbii din lumea real, s-a calculat vigoarea ecrei limbi presupune un cost ridicat, au fost dezvoltate relativ
(sau prezena limbilor studiate n spaiul virtual). Dei puine tehnologii ale limbajului care s e aplicate
acest coecient este considerat unul redus pentru limba problemei de traducere a site-urilor web. Acest lucru se
romn (0,6 n 2007, n comparaie cu 4,44 pentru poate datora complexitii limbii romne, dar i gamei
englez, 2,24 pentru francez i 2,93 pentru italian), variate de tehnologii diferite implicate.
romna este singura limb care a cunoscut o cretere la
acest capitol n perioada 20052007 (naintea integrrii
n Uniunea European). Internetul ofer o gam larg de domenii de
aplicare pentru tehnologiile limbajului.
Importana din ce n ce mai mare a Internetului este
critic pentru tehnologia limbajului. Cantitatea mare de
date lingvistice digitale constituie o resurs cheie pentru Urmtorul capitol ofer o prezentare sumar a
analizarea modului de folosire a limbajului natural, n tehnologiilor limbajului i a aplicaiilor de baz,
special pentru colectarea de informaii statistice despre mpreun cu o evaluare a sprijinului acordat n prezent
abloane lingvistice. Iar Internetul ofer o gam larg de tehnologiilor limbajului pentru limba romn.

17
4

SPRIJIN TEHNOLOGIC
PENTRU LIMBA ROMN

Tehnologiile limbajului sunt tehnologii informatice i tehnologii de baz precum:


specializate pentru lucrul cu limbajul uman, e el n
form rostit sau scris. n timp ce vorbirea este corector gramatical
modul cel mai vechi i mai natural al comunicrii sisteme suport pentru autori
umane, informaiile complexe i cea mai mare parte nvarea asistat de calculator a limbilor strine
a cunotinelor omeneti sunt pstrate i transmise
regsirea de informaii
prin texte scrise. Tehnologia vorbirii i a textelor
extragerea de informaii
scrise prelucreaz i produce limbaj n aceste dou
modaliti de realizare. Dar vorbirea i scrierea au rezumarea automat a textelor
multe aspecte comune, precum lexicul, cea mai mare sistemele de ntrebare-rspuns
parte a gramaticii i semantica. De aceea, o mare recunoaterea vocal
parte a tehnologiilor limbajului nu poate subsumat sinteza vocal.
nici tehnologiei vorbirii, nici tehnologiei textelor scrise.
Printre acestea se a tehnologiile care leag limbajul Tehnologiile limbajului sunt un domeniu de cercetare
de cunoatere. Figura 1 ilustreaz peisajul tehnologiilor de sine stttor, cu o bogat literatur de specialitate.
limbajului. Cititorul interesat este invitat s consulte crile
fundamentale ale domeniului, precum [18, 19, 20, 21,
n comunicare, oamenii combin limbajul cu alte
22].
moduri de comunicare i cu alte mijloace de informare.
nainte de descrierea domeniilor de aplicare enumerate
mbinm vorbirea cu gesturile i expresiile faciale.
mai sus, vom prezenta pe scurt arhitectura clasic a
Textele electronice se combin cu imagini i sunete.
sistemelor bazate pe tehnologiile limbajului.
Filmele pot conine limbaj n form scris i vorbit.
De aceea, tehnologia vorbirii i a textelor scrise se
suprapune i interacioneaz cu multe alte tehnologii 4.1 ARHITECTURILE
care faciliteaz comunicarea multimodal i prelucrarea
documentelor multimedia. APLICAIILOR DIN
n cele ce urmeaz, vom discuta principalele domenii TEHNOLOGIA LIMBAJULUI
de aplicaii ale tehnologiilor limbajului, cum sunt Aplicaiile soware tipice pentru prelucrarea limbii
corectorul de limb, cutarea pe Internet, tehnologiile constau din cteva componente care reect diferite
vorbirii i traducerea automat. Acestea includ aplicaii aspecte ale limbii i ale sarcinii pe care o implementeaz.

18
Tehnologiile vorbirii
Tehnologii Tehnologiile
multimedia & Tehnologiile cunoaterii
multimodale limbajului

Tehnologiile textelor scrise

1: Tehnologiile limbajului

Figura 2 prezint arhitectura foarte simplicat a unui face o scurt prezentare a situaiei din cercetarea i
sistem de prelucrare a textelor. Primele trei module educaia din domeniul tehnologiei limbajului, ncheind
abordeaz structura i sensul textului analizat: cu o enumerare a programelor de nanare. La
nalul acestei seciuni vom prezenta evaluarea de ctre
1. Preprocesarea: curarea datelor, eliminarea
experi a instrumentelor i resurselor principale din
formatrilor, recunoaterea limbii din textul
tehnologia limbajului, pe baza unor criterii precum
analizat, nlocuirea diacriticelor greite cu cele
disponibilitate, maturitate sau calitate. Situaia general
recomandate (de exemplu, nlocuirea lui cu sedil
pentru limba romn este prezentat sub forma unui
cu cu virgul).
tabel (Figura 8) la pagina 35, la sfritul acestui capitol.
2. Analiza gramatical: gsirea verbelor i a Instrumentele i resursele care sunt ngroate n text sunt
argumentelor sale, a modicatorilor etc.; enumerate n acest tabel. n ncheiere, limba romn este
recunoaterea structurii propoziionale. comparat, din punctul de vedere al sprijinului acordat
3. Analiza semantic: dezambiguizarea (cu ce sens sunt tehnologiilor limbajului, cu celelalte limbi europene
folosite cuvintele n context?), rezoluia anaforei i pentru care au fost create studii similare.
a expresiilor refereniale precum ea, maina etc.;
reprezentarea sensului unei propoziii ntr-un mod
accesibil calculatorului. 4.2 PRINCIPALELE DOMENII DE
Dup analiza textelor, module specice pot efectua apoi APLICAII
diferite operaii, precum rezumare automat a unui text, Aceast seciune se concentreaz asupra instrumentelor
cutri n baze de date i multe altele. i resurselor TL celor mai importante i ofer o imagine
Mai jos vom ilustra principalele domenii de aplicaii de ansamblu a activitilor legate de TL din Romnia i
i vom evidenia anumite module ale diferitelor Republica Moldova.
arhitecturi n ecare seciune. Arhitecturile sunt foarte
simplicate i idealizate, servind pentru ilustrarea
4.2.1 Corector de limb
complexitii aplicaiilor tehnologiei limbajului
ntr-o manier inteligibil, la modul general. Dup Oricine a folosit un instrument de prelucrare a textului
introducerea principalelor domenii de aplicaii, vom precum Microso Word a ntlnit o component

19
Text intrare Ieire

Preprocesare Analiz gramatical Analiz semantic Module specifice


problemei

2: Arhitectura tipic a aplicaiilor de procesare a textelor

care veric ortograa, identic greelile de scriere printr-o schimbare brusc a construciei logice) implic
i face sugestii de corectur. Primele corectoare cunotine aprofundate despre structurile sintactice
ortograce comparau lista cuvintelor extrase din text pentru a identicate.
cu cele dintr-un dicionar de cuvinte scrise corect. Pentru a corecta astfel de greeli este necesar n multe
Astzi, aceste programe sunt mult mai sosticate. cazuri analiza contextului, de exemplu, pentru a decide
Folosind algoritmi dependeni de limb pentru analiza dac un cuvnt trebuie scris cu sau fr cratim n
textelor, corectoarele ortograce sunt acum capabile s romn, precum n:
recunoasc greeli de morfologie (de exemplu formele
de plural greite) i de sintax, precum lipsa unui Plou ntruna de ieri.
verb sau dezacordul n numr i persoan dintre verb ntr-una din zile am s merg la Paris.
i subiect (de exemplu ei *scrie o scrisoare). Cu
toate acestea, cele mai multe corectoare ortograce Aceasta presupune e formularea unor gramatici
disponibile nu vor gsi nicio greeal n exemplul sau reguli gramaticale specice limbii, transpuse n
urmtor: soware de experi, e utilizarea aa-numitelor modele
lingvistice statistice (vezi gura 3). Acestea pot calcula
Neam cumprat un calculator care sa probabilitatea ca un cuvnt s apar ntr-un anumit
defectat dea doua zi: supt multe cuvinte se context (i. e. cuvintele dinainte i de dup). De exemplu,
pune o linie ro care nu pot cum sos cot. ntr-una din zile este o secven de cuvinte mult mai
probabil dect ntruna din zile, iar plou ntruna este
Un corector ortograc probabil va capabil s corecteze mai frecvent dect plou ntr-una. Un model de limb
doar forma ro (o form arhaic) n roie. Celelalte statistic poate creat automat pe baza unei cantiti
greeli (neam, sa, dea, supt, sos, cot) necesit interpretarea mari de date (corecte) de limb (ceea ce se numete
contextului, pentru c toate cuvintele aparin limbii corpus textual). i totui, sunt cazuri cnd nici mcar
romne, dar nu-i au locul n contextele respective. acesta nu este util:
n plus, erori de stil, precum anacolutul din ultima
propoziie subordonat (propoziia ncepe ntr-un Plou ntruna din primele zile ale lui martie.
mod care sugereaz o anumit nalizare i continu Ploua ntr-una din primele zile ale lui martie.

20
Model de limb statistic

Text intrare Verificare ortografic Verificare gramatical Propuneri de corecii

3: Arhitectura general a unui corector de limb (sus: statistic, jos: bazat pe reguli)

Singurul element discriminatoriu aici este verbul. n i structurile sintactice conforme anumitor reguli
prima propoziie acesta este la prezent, avnd un industriale i restricii terminologice (ale corporaiilor).
sens durativ. n a doua este la trecut. Numai Astzi nu exist companii romneti sau furnizori
adnotarea morfo-sintactic are valoare discriminatorie de servicii lingvistice care s ofere astfel de produse,
n asemenea cazuri. dei cercettorii din diverse grupuri de prelucrare a
Pn acum, aceste abordri au fost dezvoltate i aplicate limbajului natural au dezvoltat modele de limb ajustate
mai ales pe date de limb englez. Ele nu pot la particularitile limbii romne. La Institutul de
transferate direct n limba romn, care are o morfologie Cercetri pentru Inteligen Articial al Academiei
mai bogat i construcii specice. Romne (RACAI) au fost create modele de limb
pentru romn pe baza unor corpusuri de mari
dimensiuni. ntruct majoritatea textelor de pe web
Vericarea corectitudinii limbii nu se folosete
doar pentru instrumentele de prelucrare a sunt scrise fr diacritice, RACAI a mai dezvoltat i
textelor, ci i n sistemele suport pentru autori. o aplicaie de inserare a diacriticelor [23], care are
scopul de a indica diacriticele corecte ale unui cuvnt
Utilitatea vericrii limbii nu se limiteaz la scris iniial fr diacritice; aceast aplicaie folosete
instrumentele de prelucrare a textelor, ci se regsete un lexicon romnesc de mari dimensiuni dezvoltat n
i n sistemele suport pentru autori, platforme soware cadrul Institutului i un model de ferestre de 5 caractere
n care sunt scrise manuale i alte tipuri se documentaii pentru a gsi cea mai probabil interpretare n termeni
tehnice. Ca urmare a sporirii numrului de produse de diacritice a unui cuvnt necunoscut. Metoda de
tehnice, cantitatea de documentaii tehnice a crescut lucru ia n considerare contextul unui cuvnt n faza de
vertiginos n ultimele decenii. Pentru a evita reclamaiile preprocesare prin adnotare morfo-sintactic, esenial
clienilor n legtur cu utilizarea incorect i preteniile pentru alegerea cuvntului corect din lexicon. De
pentru pagube rezultate din instruciunile greite exemplu, cuvntul peste este transformat n pete n:
sau din nelegerea greit a acestora, companiile au Am cumprat peste.
nceput s se concentreze tot mai mult pe calitatea
documentaiei tehnice, intind n acelai timp pieele dar este pstrat ca peste n
internaionale (prin traducere sau localizare). Evoluia
Era un pod peste ru.
prelucrrii limbajului natural a dus la dezvoltarea de
soware n sprijinul autorilor, care l asist pe cel care Aceast decizie se bazeaz pe o etap anterioar de
scrie documentaii tehnice s foloseasc vocabularul adnotare morfo-sintactic, n care peste din prima

21
propoziie este adnotat cu o etichet substantival, iar [26]. Povestea de succes a Google dovedete c,
acelai cuvnt din a doua propoziie este adnotat cu o dispunnd de o cantitate uria de date i de tehnici
etichet prepoziional. eciente de indexare a acestora, o abordare n principal
n limba romn, cel puin 30% dintre cuvintele dintr-o statistic poate conduce la rezultate satisfctoare.
propoziie folosesc semne diacritice, cu o medie de Cu toate acestea, pentru o cutare mai sosticat
1.16 semne diacritice per cuvnt. Doar aproximativ de informaii este esenial integrarea cunotinelor
12% dintre aceste cuvinte pot transformate imediat lingvistice mai detaliate n sisteme de interpretare a
n versiunea lor cu diacritice (ntruct forma fr textului. Experimentele care folosesc resurse lexicale
diacritice nu reprezint un cuvnt valid n dicionarul precum tezaure n format electronic sau resurse
limbii romne). Pentru restul cuvintelor, este necesar ontologice (de exemplu WordNet sau echivalentul su
programul de descoperire a diacriticelor. romnesc Romanian WordNet [27]) au demonstrat
Un alt efort important vine de la Institutul de mbuntiri ale rezultatelor procesului de cutare dac
Matematic i Informatic al Academiei de tiine a se folosesc sinonime ale termenilor de cutare, de
Republicii Moldova, care a dezvoltat o colecie de exemplu energie atomic ori energie nuclear, sau chiar
resurse lingvistice reutilizabile pentru limba romn, termeni mai ndeprtai semantic.
de circa 1 000 000 de forme exionate, cu informaii
morfologice, deniii, sinonime, traduceri romn-rus
i romn-englez [24]. Generaia urmtoare a motoarelor de cutare
va trebui s includ tehnologii ale
n afar de corectoarele de limb i sistemele suport limbajului mult mai sosticate.
pentru autori, vericarea limbii este important i n
domeniul nvrii limbilor cu ajutorul calculatorului
i se folosete la corectarea automat a ntrebrilor Generaia urmtoare a motoarelor de cutare va trebui
introduse n motoarele de cutare pe web: vezi sugestiile s includ tehnologii ale limbajului mult mai sosticate,
Ai vrut s scriei din Google. n special pentru fraze de interogare care constau
ntr-o ntrebare sau alt tip de propoziie, i nu o
list de cuvinte-cheie. De exemplu, pentru fraza de
4.2.2 Cutarea pe web
interogare D-mi o list cu toate companiile care au
Cutarea pe web, n Intranet sau n biblioteci digitale fost preluate de alte companii n ultimii cinci ani,
este probabil cea mai folosit i totui cea mai sistemele bazate pe TL trebuie s analizeze propoziia
subdezvoltat aplicaie actual de tehnologia limbajului. att din punct de vedere sintactic, ct i semantic.
Motorul de cutare Google, care a aprut n 1998, este Sistemul trebuie s dispun de un index pentru a regsi
folosit pentru 80% dintre cutrile la nivel mondial rapid documentele relevante. Un rspuns satisfctor
[25]. Nici interfaa de cutare, nici prezentarea necesit analiz sintactic pentru identicarea structurii
rezultatelor cutrii nu s-au schimbat semnicativ gramaticale a propoziiei, dar i o analiz semantic
de la prima versiune. n actuala versiune, Google (analiz numit i interpretarea textelor) pentru stabili
ofer corectarea grac a cuvintelor scrise greit i faptul c utilizatorul dorete s gseasc companii care
ncorporeaz abiliti de cutare semantic elementar, au fost preluate, i nu companii care au preluat alte
care pot mbunti acurateea cutrii prin analiza n companii. De asemenea, expresia n ultimii cinci ani
context a sensului termenilor din fraza de interogare trebuie prelucrat pentru a stabili la ce ani se refer,

22
Pagini web

Preprocesare Procesare semantic Indexare

Potrivire
&
Relevan

Preprocesare Analiza interogrii

Interogare Rezultatele cutrii

4: Cutarea pe Internet

innd cont de anul curent. n sfrit, trebuie ncercat vorbirii care convertete coninutul de vorbire n text sau
potrivirea dintre fraza de interogare i o cantitate ntr-o reprezentare fonetic, care va apoi comparat cu
uria de date nestructurate pentru a gsi informaia fraza de interogare.
cutat de utilizator. Acest proces este cunoscut sub n Romnia, tehnologiile de cutare bazate pe limbaj
numele de regsirea informaiei i presupune cutarea natural nu sunt nc vizate de aplicaiile industriale. n
i ordonarea documentelor relevante. n plus, generarea schimb, tehnologiile de tip open source precum Lucene
unei liste de companii presupune c sistemul trebuie s sunt adesea folosite de companiile care fac cutri pentru
identice i faptul c un anumit ir de cuvinte dintr-un a le furniza infrastructura elementar de cutare. Totui,
document se refer la numele unei companii, un proces grupurile de cercetare de la Universitatea Alexandru
numit recunoaterea entitilor cu nume. Ioan Cuza din Iai (UAIC) i de la RACAI au dezvoltat
i mai solicitant este ncercarea de potrivire a frazei de diverse module care constituie partea central a unui
interogare cu documente scrise n alt limb. Pentru instrument de cutare semantic, precum analizoare
regsirea informaiei la nivel interlingual trebuie s morfo-sintactice, analizoare sintactice, analizoare
traducem automat fraza de interogare n toate limbile semantice, programe de recunoatere a entitilor cu
surs posibile i s transferm informaia regsit napoi nume, instrumente de indexare, programe de regsire a
n limba iniial. informaiei multimedia etc. Acoperirea i eciena lor
Creterea fr precedent a cantitii de date disponibile sunt ns, deocamdat, destul de limitate.
n format non-text necesit servicii care s permit Astfel, la RACAI, un analizor morfo-sintactic capabil
regsirea informaiei multimedia, prin cutarea n date s identice forma de dicionar i partea de vorbire
de tip imagine, audio i video. Pentru ierele audio i a cuvintelor din text este disponibil ca serviciu web
video, aceasta presupune un modul de recunoatere a [28]. De exemplu, dac fraza de interogare a unui

23
utilizator pentru o cutare pe web conine cuvntul Maria i-a luat bilet la concertul trupei din var de la
evenimente, poate utilizat rdcina (eveniment) Paris.
pentru a efectua cutarea [29].
acest sistem recunoate Maria ca ind nume de
persoan, din var ca ind o referin temporal, iar
Paris ca nume de loc.
Un analizor semantic dezvoltat la UAIC [30],
disponibil pentru limba romn, poate identica ntr-o
propoziie rolurile semantice pe care le joac diferite
entiti. De exemplu, pentru propoziia de mai sus,
sistemul identic Maria ca persoana care face aciunea
i bilet la concertul trupei ca obiectul care a fost
cumprat. n mod asemntor, n exemplul

Maria i-a luat fr ezitare bilet pentru a-i vedea


trupa preferat.

sistemul recunoate fr ezitare ca modalitatea n care


a fost efectuat aciunea, iar pentru a-i vedea trupa
preferat ca reprezentnd scopul pentru care biletul a
fost achiziionat. Acest sistem a fost dezvoltat pe baza
unui corpus adnotat cu roluri semantice [31], care a
fost creat n ncercarea de aliniere a limbii romne cu
resursele semantice existente pentru limba englez.
Recent, un grup de cercettori de la UAIC a nceput
o cercetare pentru detectarea i adnotarea automat a
imaginilor, n vederea dezvoltrii unui instrument de
cutare n colecii de imagini [32]. Sistemul este nc
ntr-o faz incipient.

4.2.3 Interaciunea vocal


Interaciunea prin voce face obiectul unui subdomeniu
al tehnologiilor limbajului: tehnologii de prelucrare
a limbii vorbite. Tehnologiile interaciunii vocale
Alt modul dezvoltat de cercettorii de la UAIC i de la reprezint punctul de plecare pentru crearea de interfee
RACAI este un program de recunoatere a entitilor cu care s permit utilizatorului s interacioneze cu
nume, care este capabil s recunoasc nume de persoane, mainile utiliznd limba vorbit mai degrab dect, de
de companii, de organizaii, de evenimente etc. din exemplu, o interfa grac, o tastatur ori un mouse.
texte. De exemplu, pentru propoziia Astzi, interfeele vocale cu utilizatorul (VUI Vocal

24
Ieire vorbire Sinteza vorbirii Fonetic & planifi-
care intonaional
nelegerea
limbajului
natural & dialog
Intrare vorbire Procesarea Recunoatere
semnalului

5: Sistem de dialog vorbit

User Interface) sunt utilizate pentru servicii complet sau


parial automatizate furnizate de companii, prin telefon, Tehnologia vorbirii reprezint punctul de
clienilor, angajailor, sau partenerilor. Domeniile plecare pentru crearea de interfee care
s permit utilizatorului s interacioneze
de afaceri care se bazeaz foarte mult pe interfeele utiliznd limbajul vorbit i nu o interfa
vocale cu utilizatorul sunt bncile, logistica, transportul grac, o tastatur ori un mouse.
public i telecomunicaiile. Alte utilizri ale tehnologiei
interaciunii vocale sunt interfeele pentru sistemele de
navigare ale autovehiculelor i utilizarea limbii vorbite Una dintre provocrile majore este realizarea unui
ca alternativ la interfeele grace sau ecranele tactile de sistem de recunoaterea automat a vorbirii care s
pe smartphone-uri. recunoasc cuvintele pronunate de utilizator ct mai
Interaciunea vocal cuprinde urmtoarele patru precis cu putin. Acest lucru necesit e o restrngere
tehnologii: a domeniului enunurilor posibile la un set limitat de
cuvinte-cheie, e crearea manual a unor modele de
1. Recunoaterea automat a vorbirii, responsabil limb care s acopere un interval larg de enunuri n
pentru identicarea cuvintelor dintr-o secven de limbaj natural. Utiliznd tehnici de nvare automat,
sunete rostit de utilizator. modelele lingvistice pot generate n mod automat din
2. Analizarea structurii sintactice a enunului corpusuri de limbaj vorbit, colecii mari de iere audio
utilizatorului i interpretarea acestuia conform i transcrierile lor n text. Restricionarea propoziiilor
scopurilor sistemului n care este integrat foreaz de obicei oamenii s utilizeze interfeele vocale
tehnologia. ntr-un mod rigid i poate afecta gradul de acceptare
din partea utilizatorilor, dar, pe de alt parte, crearea,
3. Managementul dialogului, necesar pentru
reglarea i ntreinerea unor modele de limb ct mai
determinarea aciunii care va efectuat n funcie
exacte cresc semnicativ costurile. Interfeele care
de solicitarea utilizatorului i funcionalitatea
folosesc modele de limb i permit utilizatorului s-i
sistemului.
exprime intenia mai exibil (de exemplu, n care
4. Sinteza vorbirii (n englez Text-to-Speech TTS),
utilizatorul este ntmpinat folosindu-se formularea Cu
utilizat pentru transformarea cuvintelor unui text
ce v pot ajuta?) tind s e mai bine acceptate de ctre
n sunetele.
utilizatori.
Pentru componenta de ieire a unei interfee vocale

25
cu utilizatorul, companiile tind s utilizeze enunuri Universitatea Politehnica Bucureti i la Institutul de
prenregistrate ale unor vorbitori profesioniti. Pentru Informatic Teoretic al Academiei Romne, Filiala Iai.
enunuri statice, n care rostirea nu depinde de contextul Majoritatea cercettorilor se concentreaz pe sinteza
particular de folosire sau de datele personale ale unui vorbirii, n timp ce aplicaiile de interpretare a vorbirii
anumit utilizator, aceast soluie va conduce la o sunt mai puin dezvoltate.
experien plcut pentru utilizator. Privind dincolo de starea actual a tehnologiei,
Totui, cu ct coninutul unui enun este mai dinamic, preconizm schimbri semnicative datorit rspndirii
cu att experiena utilizatorului are de suferit pentru smartphone-urilor ca o nou platform pentru
c vorbirea a rezultat pur i simplu din concatenarea administrarea relaiilor cu clienii, alturi de canalele
mai multor iere audio coninnd silabe i/sau cuvinte. mai vechi precum telefon, Internet i email. Aceast
Sistemele actuale de sintez a vorbirii care folosesc tendin va afecta i modul n care este folosit
diferite tehnici de optimizare se dovedesc a superioare tehnologia pentru interaciunea prin voce. Pe de o parte,
n ceea ce privete naturaleea prozodic a enunurilor cererea pentru interfee vocale bazate pe telefonie va
dinamice. scdea pe termen lung. Pe de alt parte, utilizarea limbii
vorbite ca o modalitate convenabil de interaciune cu
Pe piaa interaciunii vocale, ultimul deceniu a adus
smartphone-ul va cpta o importan semnicativ.
o puternic standardizare a interfeelor dintre diferite
Aceast tendin este sprijinit de progresul evident
componente tehnologice. A avut loc o puternic
al acurateei sistemelor de recunoatere a vorbirii
consolidare a pieei n ultimii zece ani, cu precdere
independente de vorbitor din cadrul serviciilor de
n domeniile recunoaterii automate a vorbirii i a
dictare, care sunt deja oferite ca servicii centralizate
sintetizatoarelor de voce. Pieele naionale din rile
utilizatorilor de smartphone-uri.
blocului G20 altfel spus ri puternice din punct
de vedere economic i cu o populaie considerabil 4.2.4 Traducerea automat
sunt dominate de doar 5 actori mondiali, Nuance
(SUA) i Loquendo (Italia) ind cei mai proemineni Ideea de a folosi calculatoarele pentru traducere a aprut
din Europa. n 2011, Nuance a anunat cumprarea n 1946 la A. D. Booth i a fost urmat de nanare
rmei Loquendo, ceea ce reprezint un pas esenial n substanial pentru cercetri n acest domeniu ntre anii
consolidarea pieei. 1950 i 1980. Cu toate acestea, traducerea automat
(TA) nc nu a ajuns la nivelul ateptrilor ridicate
Domeniul recunoaterii i analizei vorbirii este unul
stabilite n primii ani de la apariia domeniului.
dintre cele mai puin reprezentate n Romnia. Pe
piaa romneasc de sisteme de sintez a vorbirii,
exist soluii comercializate de companii internaionale n cea mai simpl form, traducerea automat
(precum MBROLA sau IVONA), dar rezultatele nlocuiete pur i simplu cuvintele dintr-o limb cu
echivalentul lor din alt limb.
prezint o acuratee i o uen redus. Companiile
de echipamente auto sau de telecomunicaii, precum
Continental i Orange, au nceput recent s aloce resurse n cea mai simpl form, TA nlocuiete pur i simplu
pentru departamente specializate n procesarea vorbirii, cuvintele dintr-o limb cu echivalentul lor din alt
adaptnd soluii deja existente nevoilor lor specice. limb. Acest lucru poate util n domenii cu
Pe de alt parte, cercetri n aceast direcie au loc la limbaj foarte restrns, formalizat, cum sunt de exemplu

26
rapoartele meteo. ns, pentru o traducere bun a unor de cuvinte. Cu toate acestea, spre deosebire de sistemele
texte mai puin standardizate, trebuie potrivite elemente bazate pe cunotine, sistemele de TA statistic (sau
mai lungi din text (expresii, propoziii sau chiar pasaje bazate pe date) genereaz de multe ori texte incorecte
ntregi) cu fragmentele lor echivalente din limba int. gramatical. Avantajele sistemelor statistice de traducere
Dicultatea major aici const n faptul c limbajul automat sunt puin efort uman i faptul c pot acoperi
uman este ambiguu, ceea ce ridic provocri pe mai particulariti ale limbii, precum expresiile idiomatice,
multe niveluri, de exemplu dezambiguizarea sensurilor care nu sunt de obicei tratate n sistemele bazate pe
cuvintelor la nivel lexical (Jaguar poate nsemna e o cunotine.
main, e un animal) sau ataarea corect a grupurilor
prepoziionale la nivel sintactic, ca n: Deoarece avantajele i dezavantajele sistemelor de
TA bazate pe date i ale celor bazate pe cunotine
Poliistul a vzut omul cu telescopul. sunt complementare, cercettorii folosesc n prezent,
Poliistul a vzut omul cu arma. aproape n unanimitate, abordri hibride, care combin
cele dou metodologii. Un astfel de sistem folosete
Una din modalitile de abordare a traducerii automate att sisteme de TA bazate pe cunotine, ct i a
se bazeaz pe reguli lingvistice. Pentru traduceri ntre sisteme bazate pe date, dar i un modul de selecie
limbi nrudite, o traducere direct poate fezabil n care decide, pentru ecare propoziie, care dintre cele
cazuri precum cele din exemplele de mai sus. Dar, dou traduceri este mai bun. Cu toate acestea, pentru
de cele mai multe ori, sistemele bazate pe reguli (sau propoziii mai lungi de 12 cuvinte, de exemplu, nici una
bazate pe cunotine lingvistice) analizeaz textul de dintre traduceri nu va perfect. O soluie mai bun
intrare i creeaz o reprezentare intermediar, simbolic, combinarea, pentru ecare propoziie, a secvenelor
pe baza creia este generat textul pentru limba int. traduse corect de sisteme diferite, sarcin destul de
Succesul acestor metode depinde n mare msur de complex, deoarece nu este totdeauna evident care este
disponibilitatea unor lexicoane extinse cu informaii corespondena dintre diferite alternative i este necesar
morfologice, sintactice i semantice, precum i de o aliniere.
existena unor seturi mari de reguli gramaticale atent
proiectate de lingviti calicai. Acesta este un proces Calitatea sistemelor de TA poate nc mult
foarte lung i costisitor. mbuntit. Modicrile includ adaptabilitatea
De la sfritul anilor 1980, pe msur ce puterea de resurselor lingvistice la diferite domenii sau utilizatori,
calcul a crescut i a devenit mai puin costisitoare, au precum i integrarea tehnologiilor n platforme
nceput s atrag interes modelele statistice pentru TA. existente cu memorii de traducere sau baze de date
Parametrii acestor modele statistice sunt derivai din de termeni. O alt problem este faptul c cele mai
analiza corpusurilor de texte bilingve, numite corpusuri multe sisteme actuale sunt centrate pe limba englez,
paralele, cum este corpusul Europarl, care conine iar traducerile din/spre limba romn nu sunt nc
lucrrile Parlamentului European n 21 limbi europene. sucient de precise. Acest lucru duce la ncetinirea
Avnd date suciente, TA statistic funcioneaz uxului de traducere i foreaz utilizatorii de TA
sucient de bine pentru a obine un neles aproximativ s nvee s foloseasc diferite instrumente pentru
al unui text ntr-o limb strin prin prelucrarea codicarea dicionarelor pentru ecare sistem, pentru
versiunilor paralele i identicarea de abloane posibile a le mbunti traducerile oferite.

27
Textul surs Analiza textului (formatare,
morfologie, sintax etc.)
Traducere
automat Reguli de traducere
statistic

Textul int Generarea de texte

6: Traducere automat (stnga: statistic, dreapta: bazat pe reguli)

Campanii de evaluare sunt folosite pentru compararea limba englez. De asemenea, exist n mediul online o
sistemelor de traducere automat, a diferitelor abordri multitudine de dicionare pentru limba romn.
i a situaiei existente pentru diferite limbi. Figura 7 Eforturi importante de cercetare au fost i continu s
(p. 29), prezentat n cadrul proiectului european e dedicate domeniului traducerii automate cu romna
Euromatrix+, arat performanele obinute de sisteme ca limb surs sau int. Au fost raportate rezultate
la traducerile automate ncruciate n 22 din cele 23 de mai bune, comparativ cu rezultatele sistemului Google
limbi ociale ale Uniunii Europene (limba irlandez nu Translate, pentru un experiment de traducere bazat pe
a fost comparat), raportate la scorul BLEU [33], unde date pentru perechea de limbi romn-englez [35].
un scor mai mare indic o traducere mai bun. Un La RACAI, de mai bine de 5 ani se experimenteaz cu
traductor uman ar avea un scor de aprox. 80 puncte. diferite abordri: traducere automat bazat pe exemple,
Cele mai bune rezultate (n verde i albastru) le traducere automat statistic, extragerea de traduceri
au limbile care beneciaz de eforturi de cercetare din corpusuri paralele etc.
considerabile n domeniul TA n cadrul unor programe Dou teze de doctorat, nsoite de mai multe articole
coordonate i de existena unor corpusuri paralele tiinice i susinute de diferite proiecte naionale
substaniale (ex. englez, francez, olandez, spaniol, sau internaionale, precum STAR i ACCURAT, sunt
german). Rezultatele cele mai slabe (n rou) sunt dedicate acestui domeniu [36, 37].
obinute de limbi care nu beneciaz de eforturi
similare sau care sunt foarte diferite din punctul de
vedere al comportamentului lingvistic fa de alte limbi 4.3 ALTE DOMENII DE APLICAII
(ex. ungar, maltez, nlandez). Construirea de aplicaii bazate pe tehnologiile
Domeniul traducerii automate este, n ochii limbajului implic o varietate de subprobleme care nu
investitorilor, cel mai atractiv domeniu dintre apar ntotdeauna la nivelul interaciunii cu utilizatorul,
tehnologiile limbajului. Astfel, companii precum dar ofer funcionaliti semnicative n culisele
Language Weaver lucreaz n domeniul traducerilor sistemului. Din acest motiv, ele constituie domenii
din/spre romn folosind diferite tehnologii lingvistice. importante de cercetare care au devenit discipline de
Sistemele majore online de traducere automat cuprind sine stttoare ale lingvisticii computaionale.
limba romn att ca limb surs, ct i ca limb int, Sistemele de ntrebare-Rspuns (R) reprezint o zon
ns de cele mai multe ori traducerea este mediat prin important a cercetrii, pentru care au fost construite

28
Limb int Target language
EN BG DE CS DA EL ES ET FI FR HU IT LT LV MT NL PL PT RO SK SL SV
EN 40.5 46.8 52.6 50.0 41.0 55.2 34.8 38.6 50.1 37.2 50.4 39.6 43.4 39.8 52.3 49.2 55.0 49.0 44.7 50.7 52.0
BG 61.3 38.7 39.4 39.6 34.5 46.9 25.5 26.7 42.4 22.0 43.5 29.3 29.1 25.9 44.9 35.1 45.9 36.8 34.1 34.1 39.9
DE 53.6 26.3 35.4 43.1 32.8 47.1 26.7 29.5 39.4 27.6 42.7 27.6 30.3 19.8 50.2 30.2 44.1 30.7 29.4 31.4 41.2
CS 58.4 32.0 42.6 43.6 34.6 48.9 30.7 30.5 41.6 27.4 44.3 34.5 35.8 26.3 46.5 39.2 45.7 36.5 43.6 41.3 42.9
DA 57.6 28.7 44.1 35.7 34.3 47.5 27.8 31.6 41.3 24.2 43.8 29.7 32.9 21.1 48.5 34.3 45.4 33.9 33.0 36.2 47.2
EL 59.5 32.4 43.1 37.7 44.5 54.0 26.5 29.0 48.3 23.7 49.6 29.0 32.6 23.8 48.9 34.2 52.5 37.2 33.1 36.3 43.3
ES 60.0 31.1 42.7 37.5 44.4 39.4 25.4 28.5 51.3 24.0 51.7 26.8 30.5 24.6 48.8 33.9 57.3 38.1 31.7 33.9 43.7
ET 52.0 24.6 37.3 35.2 37.8 28.2 40.4 37.7 33.4 30.9 37.0 35.0 36.9 20.5 41.3 32.0 37.8 28.0 30.6 32.9 37.3
FI 49.3 23.2 36.0 32.0 37.9 27.2 39.7 34.9 29.5 27.2 36.6 30.5 32.5 19.4 40.6 28.8 37.5 26.5 27.3 28.2 37.6
FR 64.0 34.5 45.1 39.5 47.4 42.8 60.9 26.7 30.0 25.5 56.1 28.3 31.9 25.3 51.6 35.7 61.0 43.8 33.1 35.6 45.8
HU 48.0 24.7 34.3 30.0 33.0 25.5 34.1 29.6 29.4 30.7 33.5 29.6 31.9 18.1 36.1 29.8 34.2 25.7 25.6 28.2 30.5
IT 61.0 32.1 44.3 38.9 45.8 40.6 26.9 25.0 29.7 52.7 24.2 29.4 32.6 24.6 50.5 35.2 56.5 39.3 32.5 34.7 44.3
LT 51.8 27.6 33.9 37.0 36.8 26.5 21.1 34.2 32.0 34.4 28.5 36.8 40.1 22.2 38.1 31.6 31.6 29.3 31.8 35.3 35.3
LV 54.0 29.1 35.0 37.8 38.5 29.7 8.0 34.2 32.4 35.6 29.3 38.9 38.4 23.3 41.5 34.4 39.6 31.0 33.3 37.1 38.0
MT 72.1 32.2 37.2 37.9 38.9 33.7 48.7 26.9 25.8 42.4 22.4 43.7 30.2 33.2 44.0 37.1 45.9 38.9 35.8 40.0 41.6
NL 56.9 29.3 46.9 37.0 45.4 35.3 49.7 27.5 29.8 43.4 25.3 44.5 28.6 31.7 22.0 32.0 47.7 33.0 30.1 34.6 43.6
PL 60.8 31.5 40.2 44.2 42.1 34.2 46.2 29.2 29.0 40.0 24.5 43.2 33.2 35.6 27.9 44.8 44.1 38.2 38.2 39.8 42.1
PT 60.7 31.4 42.9 38.4 42.8 40.2 60.7 26.4 29.2 53.2 23.8 52.8 28.0 31.5 24.8 49.3 34.5 39.4 32.1 34.4 43.9
RO 60.8 33.1 38.5 37.8 40.3 35.6 50.4 24.6 26.2 46.5 25.0 44.8 28.4 29.9 28.7 43.0 35.8 48.5 31.5 35.1 39.4
SK 60.8 32.6 39.4 48.1 41.0 33.3 46.2 29.8 28.4 39.4 27.4 41.8 33.8 36.7 28.5 44.4 39.0 43.3 35.3 42.6 41.8
SL 61.0 33.1 37.9 43.5 42.6 34.0 47.0 31.1 28.8 38.2 25.7 42.3 34.6 37.3 30.0 45.9 38.2 44.1 35.8 38.9 42.7
SV 58.5 26.9 41.0 35.6 46.6 33.3 46.6 27.4 30.9 38.9 22.7 42.0 28.2 31.0 23.7 45.6 32.2 44.2 32.7 31.3 33.5

7: Traducere automat ntre 22 de perechi de limbi Machine translation between 22 EU-languages [34]

corpusuri adnotate i au fost iniiate competiii Aplicaiile bazate pe tehnologiile limbajului ofer
tiinice. Ideea este trecerea de la cutarea bazat de cele mai multe ori funcionaliti semnicative
n culisele sistemelor software complexe.
pe cuvinte-cheie (n care sistemul rspunde printr-o
colecie de documente cu posibil relevan) la scenariul
n care utilizatorul pune o ntrebare concret i sistemul
Acest domeniu este strns legat de cel al extragerii de
ofer un singur rspuns. De exemplu:
informaii (EI), o zon extrem de popular i inuent n
ntrebare: La ce vrst a pit Neil Armstrong pe perioada statistic a lingvisticii computaionale, nc de
lun? Rspuns: La 38 de ani. la nceputul deceniului 1990. Sistemele de EI identic
fragmente de informaie n clase de documente; de
Dei acest domeniu este n mod evident legat de
exemplu, detectarea persoanelor cheie n prelurile de
domeniul cutrii pe Internet, sistemele R au devenit
companii, dup cum sunt raportate n ziare. Alt scenariu
un termen general pentru cercetri de genul: ce
comun care a fost studiat este reprezentat de rapoartele
tipuri de ntrebri exist i cum trebuie ele tratate,
asupra incidentelor teroriste. n acest caz, problema se
cum poate o colecie de documente cu un posibil
reduce la potrivirea pe text a unui ablon care specic
rspuns s e analizat i comparat (de exemplu,
atentatorul, inta, locul i momentul incidentului,
pentru detectarea rspunsurilor conictuale) i cum
precum i rezultatul acestuia. Caracteristica principal
poate extras dintr-un document o informaie specic
a sistemelor de EI este completarea unor abloane
(rspunsul) fr a ignora contextul.
specice ecrui domeniu, din acest motiv ind un
exemplu de tehnologie din culise care constituie o arie

29
de cercetare bine delimitat, dar care necesit precizarea introduse n rezumat folosind un scor care s in cont
explicit a tipurilor de informaii de interes pentru i de relevana propoziiei n discurs, dar i de coerena
ecare domeniu de aplicaie. textului, rezultat din rezoluia anaforelor [38]. Pentru
Dou zone de limit, care uneori joac rolul de aplicaii rezumatul dat ca exemplu mai sus, rezoluia anaforelor
independente, iar alteori de componente din culise, sunt presupune identicarea relaiei dintre ea i Hera i dintre
rezumarea automat i generarea de texte. Rezumarea -l i Hercule. Astfel, rezumatul devine inteligibil:
se refer n esen la scurtarea unui text lung i este
oferit ca funcionalitate, de exemplu, n MS Word. Una Hera a trimis un arpe cu dou capete s-l atace pe
dintre abordrile rezumrii automate are baze statistice, Hercule.
identicnd cuvinte importante din text (de exemplu
cuvinte care au frecven mare n text i care sunt mai Sistemul de rezumare automat dezvoltat de UAIC a
puin frecvente n utilizarea comun a limbajului) i apoi adoptat aceast metod, producnd rezumate foarte
determinnd acele propoziii care conin aceste cuvinte bune pentru texte de dimensiuni reduse [39]. Aceast
importante. Propoziiile sunt apoi marcate n document direcie este dezvoltat n continuare la UAIC prin
sau extrase din el, pentru a constitui rezumatul. n acest introducerea informaiilor semantice n rezumarea
scenariu, rezumatul este o extragere de propoziii, iar automat [40].
textul este redus la un subset din propoziiile sale. O metod alternativ creia i sunt dedicate multe
Un dezavantaj al acestei abordri este faptul c ignor cercetri este sintetizarea de noi propoziii, adic
expresiile deictice care pot aprea n textul iniial i care construirea unui rezumat din propoziii care nu sunt
vor pstrate n rezumat. Dac, din cauza eliminrii neaprat i n textul iniial. Aceast metod necesit
de propoziii, antecedentul acestor referine nu mai este o nelegere mai profund a textului (ceea ce este
prezent, rezumatul rezultat poate deveni de neneles. mai costisitor din punctul de vedere al resurselor
De exemplu, pentru textul: computaionale i mai greu de realizat), dar poate
aplicat cu succes pentru texte mai lungi. De exemplu,
Hercule, dintre toi copiii nelegitimi ai lui Zeus, prea pentru romn nu este relevant calculul celor mai
s e centrul mniei Herei. Pe cnd el era doar un frecvente cuvinte (pentru c acestea vor cuvintele
copil, ea a trimis un arpe cu dou capete s-l atace. funcionale gen i, iar, dar, al etc.) i nici structura
de discurs (aceasta ind mult prea stufoas). n aceste
rezumatul acestui fragment ar putea , folosind metoda
cazuri, alte metode pot aplicate, ca de exemplu
de eliminare a propoziiilor:
expandarea unui set de abloane exibile predenite
Ea a trimis un arpe cu dou capete s-l atace. (bazate, de pild, pe identicarea tipului de discurs
sau pe anumite informaii despre personajele principale,
ceea ce este destul de greu de neles dac nu exist nici timpul sau locul intrigii).
o explicaie despre cine este ea sau el (din cliticul -l se Un generator de text nu este, n majoritatea cazurilor,
nelege doar c exist o persoan atacat care este de o aplicaie de sine stttoare, ci este inclus ntr-o
genul masculin). platform soware mai larg, aa cum ntr-un sistem de
O modalitate de a spori coerena acestor rezumate management medical sunt colectate, stocate i procesate
este de a deriva iniial structura de discurs a textului informaii despre pacient, iar generarea rapoartelor este
i de a ghida selecia propoziiilor care urmeaz a doar o funcionalitate.

30
de cuvinte). WordNet-ul romnesc este n lucru de
Pentru limba romn, cercetrile n majoritatea peste 10 ani i conine mai mult de 57.000 de serii
domeniilor bazate pe tehnologiile textului sunt sinonimice (synset-uri) n care apar aprox. 60.000
mai puin dezvoltate dect pentru limba englez.
de cuvinte, distribuite ntre patru pri de vorbire:
substantive, verbe, adjective i adverbe. Fiecare synset
Limba romn, ca limb int pentru cercetrile din conine un set de cuvinte (cu un numr de sensuri
toate aceste domenii, este mai puin investigat dect asociate) care sunt sinonime. Synset-urile sunt noduri
limba englez, unde sistemele de ntrebare-rspuns, ale reelei, n timp ce arcele sunt relaiile semantice
de extragere de informaii sau de rezumare automat dintre synset-uri: hiponimie, hiperonimie, meronimie,
au fost, nc din anii 1990, subiectul a numeroase implicaie, cauz i altele. WordNet-ul romnesc este
competiii, precum cele organizate de DARPA/NIST aliniat cu Princeton WordNet [43] (varianta pentru
n Statele Unite sau campaniile CLEF n Europa. Totui, limba englez), primul i cel mai mare wordnet dintre
echipe de cercettori romni de la UAIC i RACAI cele existente pentru diferite limbi. Synset-urile au
au participat, ncepnd cu anul 2006, la competiii de etichete DOMENIU: ecare synset este etichetat cu
ntrebare-rspuns cu sisteme proprii i rezultate foarte numele domeniului n care este folosit. Mai mult,
bune [41]. Principalul dezavantaj este dimensiunea WordNet-ul romnesc este aliniat cu cea mai mare
redus a corpusurilor adnotate sau alte resurse necesare ontologie disponibil gratuit, SUMO&MILO [44], i
dezvoltrii acestor domenii. Sistemele de rezumare este folosit n diverse aplicaii dezvoltate pentru limba
automat, dac folosesc doar metode statistice, sunt n romn: sisteme de ntrebare-rspuns, dezambiguizarea
mare msur independente de limb, astfel c exist sensurilor cuvintelor, traducere automat.
prototipuri care pot aplicate i pentru limba romn.
O aplicaie experimental dezvoltat la Laboratorul
La UAIC, un instrument de rezumare bazat pe structura
de Inginerie a Limbajului Uman, de la Universitatea
discursului i pe rezoluia anaforei este disponibil pentru
Tehnic a Republicii Moldova, Chiinu, este o baz de
texte n limba romn.
date de asocieri de cuvinte pentru limba romn [45].
Domenii adiacente n care cercettori romni au
O problem esenial pentru cercettorii din domeniul
fost implicai cuprind lexicologia computaional,
lingvisticii cognitive este modul de asociere a cuvintelor
e-learning i analiza sentimentelor i a opiniilor.
limbii. Baza de date creat poate folosit n domenii
Un consoriu de trei institute de cercetare lingvistic,
precum prelucrarea limbajului natural, lexicograe etc.
dou institute de cercetare n informatic i o
universitate (UAIC), a fost implicat recent n Un alt domeniu n care cercettorii din UAIC
transformarea n format electronic a Dicionarului au fost implicai este e-learning, prin ncorporarea
Tezaur al Limbii Romne, care nsumeaz 33 de instrumentelor multilingve de tehnologie a limbajului
volume, redactate din 1913 pn n prezent. Obiectivul i tehnici de semantic web pentru mbuntirea
principal a fost transformarea celor aprox. 15.000 regsirii de materiale de nvare. Tehnologia dezvoltat
de pagini ale dicionarului ntr-un format electronic faciliteaz accesul personalizat la cunoatere n cadrul
structurat, care s permit cutri complexe, dar i o sistemelor de gestionare a nvrii i ajut la operarea
editare i o activitate de actualizare mai uoar [42]. colectiv a datelor n gestionarea coninutului.
Accesul la materialul lexicograc al limbii este facilitat Cel mai nou domeniu de interes pentru tehnologiile
i de reelele semantice sub form de wordnets (reele limbajului este analiza sentimentelor i a opiniilor.

31
Astfel, ind dat un text, un program identic Ioan Cuza din Iai. Totui trebuie conceput un
dac acesta are o ncrctur emoional pozitiv sistem consolidat de educaie superioar n procesarea
sau negativ. Cercetri n acest domeniu, pentru limbajului natural i lingvistic computaional.
limba romn, au nceput la RACAI cu utilizarea Cele mai reprezentative centre n lingvistica
SentiWordNet, o adnotare la sentimente a WordNet- computaional a limbii romne sunt n Romnia la
ului [46]. La UAIC, cercetri n aceast direcie au Bucureti, Iai, Cluj-Napoca, Timioara i Craiova, iar
implicat colaborarea cu fundaia Intelligentics din n Republica Moldova la Chiinu. Din multitudinea
Cluj-Napoca pentru dezvoltarea unui sistem capabil s de centre de cercetare i universiti n care se lucreaz
monitorizeze web-ul i s extrag opinia utilizatorilor n domeniul tehnologiilor limbajului, putem meniona
(din forumuri, bloguri, reele sociale etc.) referitoare Institutul de Cercetri pentru Inteligen Articial,
la diferite produse [47]. La Laboratorul de Inginerie Academia Romn; Institutul de Informatic Teoretic
a Limbajului Uman din cadrul Universitii Tehnice a al Academiei Romne, Filiala Iai; Departamentul de
Republicii Moldova, lucrul la analiza sentimentelor a Informatic al Universitii Alexandru Ioan Cuza
dus la traducerea WordNet-Aect [48], care conine din Iai; Facultatea de Matematic i Informatic a
informaii despre ncrctura emoional a cuvintelor, Universitii Babe-Bolyai din Cluj-Napoca; Institutul
n limbile romn i rus. WordNet-Aect a fost de Matematic i Informatic al Academiei de tiine
iniial dezvoltat pe baza resursei lexicale WordNet, a Republicii Moldova; Laboratorul de Inginerie a
prin atribuirea de etichete afective synset-urilor din Limbajului Uman din cadrul Departamentului de
Princeton WordNet [49]. Cuvintele etichetate ca avnd Informatic Aplicat al Facultii de Calculatoare,
ncrctur emoional au fost clasicate ulterior n ase Informatic i Microelectronic a Universitii Tehnice
categorii: bucurie, fric, suprare, tristee, dezgust i din Republica Moldova i altele. Unele dintre aceste
surpriz. WordNet-Aect este disponibil gratuit pentru centre colaboreaz la proiecte naionale i internaionale
cercetare [50]. din domeniul tehnologiilor limbajului.
Punctele comune de ntlnire ale celor mai muli
cercettori din domeniul TL sunt, pe lng conferinele
4.4 PROGRAME
internaionale din strintate, o serie de evenimente
EDUCAIONALE internaionale i naionale care adun tinerii i
Tehnologiile limbajului sunt un domeniu interdisciplinar, cercettorii cu experien, lingviti i informaticieni,
care implic expertiza lingvitilor, informaticienilor, inute periodic n Romnia: conferinele anuale ale
statisticienilor, psiholingvitilor. Pn acum nu Consoriului de Informatizare pentru Limba Romn
i-a stabilit un loc x n sistemul de nvmnt ConsILR [51], seria de coli de var internaionale
din Romnia. Multe universiti din Romnia i EUROLAN, conferinele SPED Tehnologiile vorbirii
din Republica Moldova au introdus recent cursuri i interaciunea om calculator, conferinele KEPT
de prelucrare a limbajului natural i lingvistic Ingineria cunoaterii: principii i tehnici, conferinele
computaional la nivelul studiilor universitare, de ECIT Conferina european pe domeniul sistemelor
masterat i doctorat. Din 2001, un masterat n i tehnologiilor inteligente etc.
lingvistic computaional a fost introdus n curricula Lingvistica computaional este un domeniu
Facultii de Informatic a Universitii Alexandru interdisciplinar i este studiat e la faculti de

32
informatic, e la faculti de tiine umaniste. Acest (RACAI), situat n Bucureti; Departamentul de
lucru este un dezavantaj pentru domeniul TL, deoarece Cercetare al Facultii de Informatic a Universitii
studiul lingvisticii computaionale este astfel orientat Alexandru Ioan Cuza din Iai (UAIC); Institutul
e pe aspectele lingvistice, e pe cele de inginerie, iar de Informatic Teoretic al Academiei Romne,
suprapunerile sunt doar pariale. Alt dezavantaj al Filiala Iai, care gzduiete arhiva Sunetele Limbii
acestui peisaj este implicarea minor a companiilor din Romne un repozitoriu online de sunete ale
domeniul Tehnologiilor Informaiei i Comunicrii n limbii romne nregistrate; Facultatea de Electronic
cercetarea n TL (dei recent au nceput s e prezente i Telecomunicaii a Universitii Politehnica din
n viaa educaional prin oferirea de stagii de practic). Bucureti, unde exist un colectiv care lucreaz n
tehnologia vorbirii. n ceea ce privete programele de
cercetare, UAIC i RACAI au fost implicate n mai
4.5 PROIECTE I EFORTURI multe proiecte de cercetare naionale i internaionale
care i propun s dezvolte tehnologii ale limbajului
NAIONALE
existente sau noi. Printre acestea pot menionate
Firmele care folosesc i furnizeaz TL n Romnia sunt proiectele europene: ACCURAT (Analiza i evaluarea
cu siguran importante (SOFTWIN, Continental, corpusurilor comparabile pentru domenii cu puine
Microso Romnia etc.), dar este necesar o mai resurse pentru traducere automat), See-ERA Net
bun colaborare ntre ele i institutele de cercetare (Sisteme de traducere automat pentru limbile din
i universiti, care sunt cel mai activ implicate Balcani), proiectul PC7 CLARIN (Infrastructur
n cercetarea din acest domeniu. O problem interoperabil de resurse lingvistice pentru limba
important este caracterul ezoteric al TL, care romn), BALKANET (Construirea unei reele
ar putea rezolvat printr-o strategie bun de de wordnet-uri pentru limbile balcanice), proiectul
marketing. Industria limbajului nu este un angajator PC6 LT4eL (Tehnologii ale limbajului pentru
important n Romnia, puine companii din domeniul e-learning), proiectul INTAS RoLTech (platform
Tehnologiilor Informaiei i Comunicrii (TIC) avnd pentru tehnologiile limbajului pentru limba romn:
deja departamente de TL. resurse, instrumente i interfee), proiectul Roric-
Programele naionale anterioare au avut un impuls Ling, proiectul ALEAR (Evoluie a limbajului
iniial, dar lipsa ajutorului nanciar consecvent sau articial pentru roboi autonomi), proiectele PSP-ICT
destul de atractiv a dus la pierderea interesului marilor METANET4U (mbogirea infrastructurii europene
companii de TIC i a tinerilor cercettori, formai multilingve) i ATLAS (Tehnologii aplicate pentru
de universiti i de institutele de cercetare. Unul sisteme de gestiune a coninutului care folosesc limbajul
dintre programele de colaborare dintre industrie i natural) etc. Au existat, de asemenea, proiecte cu
educaie care a avut un impact pozitiv i rezultate bune nanare naional precum: STAR (Sistem de traducere
n Romnia n domeniul TL este Aliana Academic automat pentru limba romn), SIR-RESDEC (Sistem
MSDN, care ofer acces gratuit studenilor la diferite de ntrebare rspuns pentru domeniu deschis pentru
tehnologii Microso. limbile romn i englez), ROTEL (Sisteme inteligente
Principalele laboratoare de cercetare cu activitate n pentru web-ul semantic, bazate pe logica ontologiilor i
domeniul TL n Romnia sunt: Institutul de Cercetri pe TL), eDTLR (Dicionarul Tezaur al Limbii Romne
pentru Inteligen Articial al Academiei Romne

33
n format electronic), printre altele. Dac poate observat o atenie semnicativ
Piaa pentru tehnologiile limbajului poate doar pentru domenii precum tokenizarea, semantica
estimat i mai mult ca sigur va primi un impuls prin propoziiilor sau sisteme de ntrebare-rspuns, nu
platformele mobile, de tipul Apple iPad i alte produse acelai lucru este valabil i pentru domenii mai
similare, jocuri (educaionale) etc. complexe precum analiza semantic sau procesarea
Proiectele realizate pn n prezent au dus la dezvoltarea avansat a discursului.
unei game largi de instrumente i resurse tehnologice i Resursele pentru limba romn sunt mai puin
lingvistice pentru limba romn. n seciunea urmtoare reprezentative dect instrumentele, dei sunt
va discutat stadiul actual al sprijinului tehnologic eseniale pentru testarea instrumentelor create.
acordat limbii romne. Cu cteva excepii, cum ar serviciile web pentru
procesri de baz ale limbajului, analiz morfologic,
instrumente de ntrebare-rspuns i sisteme de
4.6 SITUAIA INSTRUMENTELOR traducere automat, sistemele existente pentru limba
I RESURSELOR PENTRU LIMBA romn nu pot accesate fr restricii.
Instrumentele pentru limba romn au o acoperire
ROMN larg pentru domenii privind semantica propoziiei
Tabelul urmtor ofer o privire de ansamblu asupra i regsirea de informaii, dar sunt restrnse pentru
situaiei actuale a tehnologiilor limbajului pentru celelalte probleme.
limba romn. Evaluarea tehnologiilor i resurselor Printre instrumentele existente de TL pentru limba
existente este bazat pe estimarea mai multor experi romn, cele mature sunt disponibile gratuit.
din domeniu, care au folosit apte criterii, ecare notat Dac instrumentele nu sunt n mod necesar
de la 0 (foarte slab) la 6 (foarte bine). meninute activ, resursele pentru limba romn au o
Rezultatele principale pentru limba romn pot calitate bun i sunt n general sustenabile.
rezumate dup cum urmeaz: Deoarece majoritatea instrumentelor sunt bazate pe
modele de limb sau folosesc tehnici de nvare
Exist domenii care nu sunt nc avute n vedere automat, adaptarea lor este n general posibil, ceea
de cercettori pentru limba romn: generarea ce nu se ntmpl n cazul resurselor.
de limbaj, sisteme de gestionare a dialogului i
Multe dintre aceste instrumente, resurse i formate
construirea de corpusuri multimodale.
de date nu respect standardele din industrie i nu
Dei sunt disponibile diferite tehnologii de parsare pot integrate n mod ecient. Un program susinut
pentru limba romn, un corpus de referin care s este necesar pentru a standardiza formatele datelor i
e refolosit pentru evaluarea automat a parsrilor API-urile.
nu exist nc. Scorurile pe care diferii experi le-au dat aceluiai
Procesarea vorbirii este momentan mult mai puin domeniu din TL au fost n general asemntoare, n
dezvoltat dect alte domenii ale TL, n ceea ce special n ceea ce privete disponibilitatea, ceea ce
privete disponibilitatea pentru mediul de cercetare indic faptul c instrumentele i resursele existente
a corpusurilor i instrumentelor pentru prelucrarea pentru limba romn sunt diseminate pe scar larg.
vorbirii. Uneori, totui, pentru sustenabilitate i acoperire,

34
Disponibilitate

Sustenabilitate
Maturitate
Acoperire
Cantitate

Adaptare
Calitate
Tehnologiile limbajului: instrumente, tehnologii, aplicaii
Recunoaterea vorbirii 2 1 1.8 1.4 2 2 2
Sinteza vorbirii 1 1 1.2 1.4 2 2 1
Analiza textelor 4 3.5 4 3.6 4.5 3.5 4
Interpretarea textelor 3.3 3 3 3 3.6 4 4
Generarea de texte 0 0 0 0 0 0 0
Traducere automat 3 4 3.2 2.4 4 4 4
Resurse lingvistice: resurse, date, baze de cunotine
Corpusuri textuale 2 2 2.4 2.4 3 2.5 3
Corpusuri de vorbire 3 2 2.4 1.2 3 3 3
Corpusuri paralele 4 5 3.2 2.4 5 5 4
Resurse lexicale 4 3 3.6 3.2 5 4.5 4
Gramatici 2 2 2.4 1.6 2 3 3
8: Situaia sprijinului alocat tehnologiilor limbajului pentru limba romn

experii au dat scoruri care difer cu mai mult de puse la dispoziia publicului pentru activitile de
jumtate din scorul total. Principalele zone de cercetare i dezvoltare legate de limb.
dezacord au fost: corpusul de referin, corpusuri
n concluzie, putem spune c s-au obinut deja
semantice, gramatici i resurse ontologice.
rezultate importante ntr-o serie de domenii specice
Rndul care conine informaii despre modele de ale tehnologiei limbajului i c exist instrumente i
limb poate interpretat diferit, deoarece unii resurse cu funcionalitate limitat. Este ns evident
experi au dat scoruri innd cont de modele pentru necesitatea continurii eforturilor de cercetare pentru
limbajul scris, n timp ce alii au dat scoruri mai mici a depi limitele actuale, n special n domeniul
gndindu-se la modele pentru limbajul vorbit. procesrii vorbirii sau al generrii de texte, dar i pentru
O situaie neclar din punct de vedere juridic dezvoltarea unui corpus reprezentativ pentru limba
restricioneaz utilizarea textelor digitale, cum romn.
ar cele publicate on-line de ziare, pentru
cercetri empirice lingvistice i pentru tehnologiile
limbajului, de exemplu pentru construirea 4.7 COMPARAIE NTRE LIMBI
modelelor statistice de limb. mpreun cu Sprijinul actual al domeniului TL variaz considerabil
politicienii i factorii de decizie politic, cercettorii de la o comunitate lingvistic la alta. Pentru a
ar trebui s ncerce s stabileasc legi sau compara situaia dintre diferite limbi, aceast seciune
reglementri care s le permit s utilizeze texte prezint o evaluare bazat pe dou domenii de aplicare

35
a TL (traducerea automat i prelucrarea vorbirii), o paralele, calitatea i acoperirea resurselor lexicale
tehnologie care st la baza aplicaiilor bazate pe TL existente i a gramaticilor.
(analiz de text) i resursele de baz necesare pentru Figurile 9 12 arat c, dei au nceput s e dezvoltate
construirea aplicaiilor de LT. Limbile au fost clasicate sisteme i resurse pentru aplicaii de TL pentru limba
folosind o scar cu cinci niveluri: romn, ele nu se compar deocamdat n ceea ce
privete calitatea i gradul de acoperire cu resursele i
1. sprijin excelent pentru TL instrumentele existente pentru limba englez, pentru
2. sprijin bun care sunt dezvoltate cele mai multe sisteme n aproape
toate domeniile. Exist nc o mulime de lacune
3. sprijin mediu
n resursele pentru limba englez, dac considerm
4. sprijin fragmentar
aplicaiile care necesit cea mai nalt calitate.
5. sprijin redus, spre deloc Pentru prelucrarea vorbirii, dei la nivel internaional
tehnologiile actuale sunt sucient de bune pentru a
Evaluarea celor dou domenii de aplicare a TL, al integrate cu succes n aplicaii industriale, cum
instrumentelor necesare pentru analiza textual i al ar sisteme de dialog vorbit i de dictare, limba
resurselor existente, a avut la baz urmtoarele criterii: romn nu este reprezentat n acest domeniu. Totui,
Prelucrarea vorbirii: Calitatea tehnologiilor existente componentele actuale de analiz de text i resursele
de recunoatere a vorbirii, calitatea tehnologiilor lingvistice acoper deja fenomenele lingvistice ale limbii
existente de sintez a vorbirii, gradul de acoperire a romne ntr-o anumit msur i fac parte din diverse
domeniului, numrul i dimensiunea corpusurilor de aplicaii care implic prelucrarea limbajului natural
vorbire existente, cantitatea i varietatea aplicaiilor predominant de suprafa, de exemplu corectoare
disponibile bazate pe vorbire. ortograce i sisteme de sprijin pentru autori.
Traducerea automat: Calitatea tehnologiilor de Pentru construirea de aplicaii mai sosticate, cum
traducere automat existente, numrul de perechi de ar cele de traducere a vorbirii, este nevoie de
limbi acoperite, gradul de acoperire a fenomenelor resurse i tehnologii care s acopere o gam mai
lingvistice i a diferitelor domenii, calitatea i larg de aspecte lingvistice i s permit o analiz
dimensiunea corpusurilor paralele existente, cantitatea mai profund, semantic, a enunului rostit. Prin
i varietatea de aplicaii de traducere automat mbuntirea calitii i a gradului de acoperire a
disponibile. acestor resurse i a tehnologiilor de baz, vom putea
Analiza textelor: Calitatea i gradul de acoperire a deschide noi oportuniti pentru abordarea unei game
tehnologiilor existente de analiz de texte (morfologie, largi de domenii de aplicaii avansate, inclusiv traducere
sintax, semantic), gradul de acoperire a fenomenelor automat de nalt calitate.
lingvistice i a diferitelor domenii, cantitatea
i varietatea aplicaiilor disponibile, calitatea i
dimensiunea corpusurilor de texte existente (adnotate), 4.8 CONCLUZII
calitatea i acoperirea resurselor lexicale existente (de n aceast serie de studii lingistice, un efort iniial
exemplu, WordNet) i a gramaticilor. substanial a fost fcut pentru a evalua suportul acordat
Resurse: Calitatea i dimensiunea corpusurilor de texte tehnologiilor limbajului pentru 30 de limbi europene
existente, a corpusurilor de vorbire i a corpusurilor i pentru a oferi o comparaie de niel nalt ntre

36
aceste limbi. Prin identicarea lacunelor, neoilor copiile electronice ale publicaiilor, o campanie de
i decienelor, comunitatea european a tehnologiei contientizare adresat editurilor, cu scopul de a le
limbajului i factorii de decizie sunt acum n poziie convinge s doneze o parte din textele lor pentru
de a proiecta un program pe scar larg de cercetare cercetare, este mai mult dect necesar [52].
i dezoltare care vizeaz construirea unei Europe cu Pe de alt parte, nu putem pur i simplu transfera
adevrat multilinge, bazate pe tehnologie. n limba romn tehnologiile dezvoltate i optimizate
Am vzut c exist diferene enorme ntre limbile din pentru limba englez. Sistemele de parsare (analiz
Europa. n timp ce anumite limbi i domenii de aplicare sintactic i gramatical a structurii propoziiilor) bazate
dein soware i resurse de bun calitate, altele (de pe limba englez dau de obicei rezultate slabe cnd
obicei pentru limbile mai mici) au lacune majore. sunt aplicate textelor din limba romn, datorit
Multor limbi le lipsesc tehnologiile de baz pentru caracteristicilor specice limbii romne i complexitii
analiza textual i resursele eseniale pentru dezvoltarea sale.
acestor tehnologii. Altele au instrumentele i resursele Generarea de limbaj i sistemele de gestionare a
de baz, dar nu sunt deocamdat capabile s investeasc dialogului sunt domenii ale TL la nceput de drum
n procesarea semantic. Prin urmare, avem nevoie de un pentru limba romn, pentru care se pot dezvolta nc
efort pe scar larg pentru a atinge obiectivul ambiios multe tehnologii, aplicaii i resurse. Tehnologiile i
de a oferi servicii de traducere automat de nalt calitate corpusurile pentru vorbire necesit o atenie deosebit
ntre toate limbile europene. n vederea alinierii limbii romne la standardele
n cazul limbii romne, putem prudent optimiti celorlalte limbi europene.
n legtur cu stadiul actual al suportului acordat Concluziile noastre sunt c singura alternativ este
tehnologiilor limbii. Cercetrile din universiti de a face un efort substanial pentru a crea resurse
i institute de cercetare din Romnia i Republica lingvistice pentru limba romn i de a le folosi
Moldova au dus la dezvoltarea de sisteme de nalt pentru a avansa cercetarea, inovarea i dezvoltarea n
calitate, precum i modele i teorii aplicabile pe scar domeniul tehnologiilor limbajului. Nevoia de mari
larg. Cu toate acestea, domeniul de aplicare al cantiti de date i complexitatea extrem a sistemelor
resurselor, precum i gama de instrumente sunt nc de tehnologia limbajului fac s e vital dezvoltarea unei
foarte limitate n raport cu resursele i instrumentele noi infrastructuri i a unei organizri mai coerente a
existente pentru limba englez i nu sunt suciente cercetrii pentru a stimula cooperarea.
din punct de vedere calitativ i cantitativ pentru a Se observ, de asemenea, o lips a continuitii
dezvolta tehnologiile necesare sprijinirii unei societi a n nanarea cercetrii i dezvoltrii. Programe
cunoaterii cu adevrat multilingve. coordonate pe termen scurt tind s alterneze cu perioade
Subdezvoltarea care se resimte n zona resurselor de nanare insucient sau deloc. n plus, exist n
lingvistice (cantitativ i calitativ) ngreuneaz enorm general o lips de coordonare cu programe din alte
eforturile de dezvoltare a tehnologiilor limbajului i a ri ale UE i la nivelul Comisiei Europene (cum se
aplicaiilor. Exist o necesitate major de resurse, de la ntmpl, de exemplu, cu programele PSP-ICT, care
texte n limba romn pn la corpusuri adnotate, n au ca protagoniti i universiti din Romnia, dar care
care fenomene lingvistice particulare s e evideniate nu sunt sprijinite de guvern pentru asigurarea coerent
de experi. Cum cea mai bun surs de texte sunt a conanrii).

37
Putem conchide, prin urmare, c exist o nevoie calitate pentru toate limbile, n vederea realizrii
stringent pentru o iniiativ pe scar larg, coordonat, unitii politice i economice prin diversitate cultural.
axat pe depirea diferenelor n disponibilitatea Tehnologia va ajuta la nlturarea barierelor existente
tehnologiilor lingvistice pentru limbile europene n i la construirea de puni ntre limbile europene. Acest
ansamblu lor. lucru presupune ca toi factorii de decizie din politic,
Obiectivul pe termen lung al META-NET este cercetare, afaceri i societate s-i uneasc eforturile
de a introduce tehnologii ale limbajului de nalt pentru viitor.

38
sprijin sprijin sprijin sprijin sprijin
excelent bun mediu fragmentar puin/deloc

englez ceh basc croat


nlandez bulgar islandez
francez catalan leton
german danez lituanian
italian estonian maltez
olandez galiian romn
portughez greac
spaniol irlandez
maghiar
norvegian
polonez
suedez
srb
slovac
sloven

9: Prelucrarea vorbirii: situaia pentru 30 de limbi europene

sprijin sprijin sprijin sprijin sprijin


excelent bun mediu fragmentar puin/deloc

englez francez catalan basc


spaniol german bulgar
italian ceh
maghiar croat
olandez danez
polonez estonian
romn nlandez
galiian
greac
irlandez
islandez
leton
lituanian
maltez
norvegian
portughez
suedez
srb
slovac
sloven

10: Traducere automat: situaia pentru 30 de limbi europene

39
sprijin sprijin sprijin sprijin sprijin
excelent bun mediu fragmentar puin/deloc

englez francez basc croat


german bulgar estonian
italian catalan irlandez
olandez ceh islandez
spaniol danez leton
nlandez lituanian
galiian maltez
greac srb
maghiar
norvegian
polonez
portughez
romn
suedez
slovac
sloven

11: Analiza de text: situaia pentru 30 de limbi europene

sprijin sprijin sprijin sprijin sprijin


excelent bun mediu fragmentar puin/deloc

englez ceh basc irlandez


francez bulgar islandez
german catalan leton
italian croat lituanian
maghiar danez maltez
polonez estonian
olandez nlandez
spaniol galiian
suedez greac
norvegian
portughez
romn
srb
slovac
sloven

12: Resurse pentru text i vorbire: situaia pentru 30 de limbi europene

40
5

DESPRE META-NET

META-NET este o reea de excelen nanat parial META-VISION promoveaz o comunitate dinamic i
de ctre Comisia European. Reeaua cuprinde n inuent, unit n jurul unei viziuni comune i a unei
prezent 54 de membri din 33 de ri europene [53]. agende strategice comune de cercetare. Principalul scop
META-NET promoveaz Aliana Tehnologic pentru al acestei activiti este constituirea unei comuniti
o Europ Multilingv (Multilingual Europe Technology de TL coerente i coezive n Europa, prin persoane
Alliance META), o comunitate de profesioniti i cheie din diferite grupuri reprezentative. Aceast serie
organizaii din domeniul tehnologiei limbajului din de studii cuprinde studii similare pentru alte 29 de
Europa aat n continu cretere. META-NET limbi. Viziunea tehnologic comun a fost dezvoltat
promoveaz fundamentele tehnologice pentru stabilirea n cadrul a trei grupuri de viziune. A fost creat un
i meninerea unei societi informaionale europene cu Consiliu Tehnologic META n scopul pregtirii agendei
adevrat multilingve, care: strategice de cercetare, pe baza viziunii n strns
vor facilita comunicarea i cooperarea ntre limbi legtur cu ntreaga comunitate de TL.
diferite; META-SHARE creeaz o infrastructur public
distribuit pentru schimbul i partajarea de resurse.
vor asigura acces egal la informaii i cunoatere n
Reeaua de arhive digitale va conine date lingvistice,
orice limb;
instrumente i servicii web documentate cu metadate
vor oferi funcionaliti ale tehnologiei informaiei
de nivel nalt, organizate n categorii standardizate.
cetenilor europeni.
Resursele pot accesate direct i permit cutri
Aceast reea de excelen sprijin dezvoltarea unei
uniformizate. Resursele disponibile includ materiale
Europe unite ntr-o singur pia digital i spaiu
gratuite, cu acces Open Source sau restricionat, precum
informaional. META-NET stimuleaz i promoveaz
i resurse disponibile contra cost.
tehnologiile multilingve pentru toate limbile europene.
META-RESEARCH construiete puni ntre domenii
Aceste tehnologii sunt folosite n traducerea automat,
tehnologice nvecinate. Aceast activitate ncearc s
producerea de coninut, procesarea informaiilor i
aplice descoperirile recente i inovaiile din alte domenii
gestionarea cunotinelor pentru o gam larg de
n scopul mbuntirii tehnologiilor limbajului. n
aplicaii i domenii. Totodat, ele permit dezvoltarea
particular, aceast linie de aciune se concentreaz
de interfee intuitive bazate pe limbaj pentru diverse
pe cercetri de nivel nalt n domeniul traducerii
tehnologii, de la aparate electrocasnice, mainrii i
automate, colectarea datelor, pregtirea seturilor de date
vehicule, pn la calculatoare i roboi.
i organizarea resurselor pentru evaluare, compilarea
Lansat pe data de 1 Februarie 2010, META-NET a
de inventarii de instrumente i metode, precum i
desfurat deja mai multe activiti pe cele trei linii de
organizarea de ateliere de lucru i evenimente de formare
aciune: META-VISION, META-SHARE i META-
pentru membrii comunitii.
RESEARCH.
oce@meta-net.eu http://www.meta-net.eu

41
1

EXECUTIVE SUMMARY

During the last 60 years, Europe has become a distinct Buenos Aires, the customs ocer in Constana and the
political and economic structure. Culturally and lin- engineer in Kathmandu can all chat with their friends
guistically it is rich and diverse. However, from Por- on Facebook, but they are unlikely ever to meet one an-
tuguese to Polish and Italian to Icelandic, everyday com- other in online communities and forums. If they are
munication between Europes citizens, within business worried about how to treat earache, they will all check
and among politicians is inevitably confronted with lan- Wikipedia to nd out all about it, but even then they
guage barriers. e EUs institutions spend about a bil- wont read the same article. When Europes netizens dis-
lion euros a year on maintaining their policy of multilin- cuss the eects of the Fukushima nuclear accident on
gualism, i. e., translating texts and interpreting spoken European energy policy in forums and chat rooms, they
communication. Does this have to be such a burden? do so in cleanly-separated language communities. What
Language technology and linguistic research can make a the internet connects is still divided by the languages of
signicant contribution to removing the linguistic bor- its users. Will it always be like this?
ders. Combined with intelligent devices and applica- In science ction movies, everyone speaks the same lan-
tions, language technology will help Europeans talk and guage. Could it be Romanian, even though we only had
do business together even if they do not speak a com- one Romanian astronaut? Many of the worlds 6,000
mon language. languages will not survive in a globalized digital infor-
mation society. It is estimated that at least 2,000 lan-
Language technology builds bridges. guages are doomed to extinction in the decades ahead.
Others will continue to play a role in families and neigh-
bourhoods, but not in the wider business and academic
Information technology changes our everyday lives. We
world. What are the Romanian languages chances of
typically use computers for writing, editing, calculating,
survival?
and information searching, and increasingly for reading,
listening to music, viewing photos and watching movies. Spoken by approx. 29.000.000 worldwide, the Roma-
We carry small computers in our pockets and use them nian language is not only present through books, lms
to make phone calls, write emails, get information and or TV stations, but also in the digital information space.
entertain ourselves, wherever we are. How does this e internet market is in a continuous growth in Roma-
massive digitization of information, knowledge and ev- nia. Ever more Romanians have a computer with inter-
eryday communication aect our language? Will our net connection at home. e top level domain .ro is used
language change or even disappear? by 0.4% of all the websites, similar to the .eu domain.
All our computers are linked together into an increas- e Romanian language features a set of particularities
ingly dense and powerful global network. e girl in that contributes to the language richness, but can also

42
be a challenge to the computational processing of Ro- from major areas of our personal lives. Not science, avi-
manian. ation and the global nancial markets, which actually
e automated translation and speech processing tools need a world-wide lingua anca. We mean the many ar-
currently available on the market fall short of the en- eas of life in which it is far more important to be close to
visaged goals. e dominant actors in the eld are pri- a countrys citizens than to international partners do-
marily privately-owned for-prot enterprises based in mestic policies, for example, administrative procedures,
Northern America. As early as the late 1970s, the EU the law, culture and shopping.
realised the profound relevance of language technology Information and communication technology are now
as a driver of European unity, and began funding its preparing for the next revolution. Aer personal com-
rst research projects, such as EUROTRA. At the same puters, networks, miniaturisation, multimedia, mobile
time, national projects were set up that generated valu- devices and cloud-computing, the next generation of
able results, but never led to a concerted European ef- technology will feature soware that understands not
fort. In contrast to these highly selective funding eorts, just spoken or written letters and sounds but entire
other multilingual societies such as India (22 ocial lan- words and sentences, and supports users far better be-
guages) and South Africa (11 ocial languages) have cause it speaks, knows and understands their language.
set up long-term national programmes for language re- Forerunners of such developments are the free online
search and technology development. service Google Translate that translates between 57 lan-
guages, IBMs supercomputer Watson that was able to
Language technology as a key for the future. defeat the US-champion in the game of Jeopardy, and
Apples mobile assistant Siri for the iPhone that can re-
act to voice commands and answer questions in English,
ere are some complaints about the ever-increasing
German, French and Japanese.
use of Anglicisms, and some linguists even fear that the
Romanian language will become riddled with English e next generation of information technology will
words and expressions. But our study suggests that this master human language to such an extent that human
is misguided. users will be able to communicate using the technology
Analogue to the re-latinisation phase in the 19th cen- in their own language. Devices will be able to automat-
tury aer the liberation from the Greek and Turkish ically nd the most important news and information
domination, Romanian language was passing in the last from the worlds digital knowledge store in reaction to
20 years through a process of transformation from the easy-to-use voice commands. Language-enabled tech-
totalitarian usage (langue de bois, unidirectional dis- nology will be able to translate automatically or assist
course, etc.) to an open usage in which new linguistic interpreters; summarise conversations and documents;
patterns must adapt to the social and cultural transition. and support users in learning scenarios.
erefore, similar to many other languages, Romanian e next generation of information and communi-
is going through a continuous process of internationali- cation technologies will enable industrial and service
sation under the inuence of the Anglo-Saxon vocabu- robots (currently under development in research labo-
lary. ratories) to faithfully understand what their users want
Our main concern should not be the gradual Anglici- them to do and then proudly report on their achieve-
sation of our language, but its complete disappearance ments.

43
is level of performance means going way beyond sim- whole situation could change dramatically when a new
ple character sets and lexicons, spell checkers and pro- generation of technologies really starts to master hu-
nunciation rules. e technology must move on from man languages eectively. rough improvements in
simplistic approaches and start modelling language in machine translation, language technology will help in
an all-encompassing way, taking syntax as well as seman- overcoming language barriers, but it will only be able
tics into account to understand the dri of questions to operate between those languages that have managed
and generate rich and relevant answers. to survive in the digital world. If there is adequate lan-
In the case of the Romanian language, research in uni- guage technology available, then it will be able to ensure
versities and academia from Romania and the Republic the survival of languages with very small populations of
of Moldova was successful in designing particular high speakers. If not, even larger languages will come under
quality soware, as well as models and theories widely severe pressure.
applicable. However, the scope of the resources and the
range of tools are still very limited when compared to
English, and they are simply not sucient in quality and Language Technology helps unify Europe.
quantity to develop the kind of technologies required to
support a truly multilingual knowledge society. How-
ever, it is nearly impossible to come up with sustainable Drawing on the insights gained so far, todays hybrid
and standardised solutions given the current relatively language technology mixing deep processing with statis-
low level of linguistic resources. tical methods should be able to bridge the gap between
A legally unclear situation restricts the usage of digital all European languages and beyond. But as this series
texts, such as those published online by newspapers, for of white papers shows, there is a dramatic dierence be-
empirical linguistics and language technology research, tween Europes member states in terms of both the ma-
for example, to train statistical language models. To- turity of the research and in the state of readiness with
gether with politicians and policy makers, researchers respect to language solutions.
should try to establish laws or regulations that enable META-NETs vision is high-quality language techno-
researchers to use publicly available texts for language- logy for all languages that supports political and eco-
related R&D activities. nomic unity through cultural diversity. is technology
Finally, there is a lack of continuity in research and will help tear down existing barriers and build bridges
development funding. Short-term coordinated pro- between Europes languages. is requires all stakehold-
grammes tend to alternate with periods of sparse or ers in politics, research, business, and society to unite
zero funding. e need for large amounts of data and their eorts for the future.
the extreme complexity of language technology systems is white paper series complements the other strate-
makes it vital to develop an infrastructure and a coher- gic actions taken by META-NET (see the appendix for
ent research nancing and organisation to spur greater an overview). Up-to-date information such as the cur-
sharing and cooperation. rent version of the META-NET vision paper [2] or the
Summing up, we can safely consider that for now, the Strategic Research Agenda (SRA) can be found on the
Romanian language is not in danger. However, the META-NET website: http://www.meta-net.eu.

44
2

LANGUAGES AT RISK: A CHALLENGE


FOR LANGUAGE TECHNOLOGY

We are witnesses to a digital revolution that is dramati- the creation of dierent media like newspapers, ra-
cally impacting communication and society. Recent de- dio, television, books, and other formats satised
velopments in information and communication tech- dierent communication needs.
nology are sometimes compared to Gutenbergs inven-
tion of the printing press. What can this analogy tell In the past twenty years, information technology has
us about the future of the European information soci- helped to automate and facilitate many processes:
ety and our languages in particular?
desktop publishing soware has replaced typewrit-
ing and typesetting;
The digital revolution is comparable to
Microso PowerPoint has replaced overhead projec-
Gutenbergs invention of the printing press.
tor transparencies;
e-mail allows documents to be sent and received
Aer Gutenbergs invention, real breakthroughs in more quickly than using a fax machine;
communication were accomplished by eorts such as
Skype oers cheap internet phone calls and hosts vir-
Luthers translation of the Bible into vernacular lan-
tual meetings;
guage. In subsequent centuries, cultural techniques have
been developed to better handle language processing audio and video encoding formats make it easy to ex-
and knowledge exchange: change multimedia content;
web search engines provide keyword-based access;
the orthographic and grammatical standardisation
online services like Google Translate produce quick,
of major languages enabled the rapid dissemination
approximate translations;
of new scientic and intellectual ideas;
social media platforms such as Facebook, Twitter
the development of ocial languages made it possi-
and Google+ facilitate communication, collabora-
ble for citizens to communicate within certain (of-
tion, and information sharing.
ten political) boundaries;
the teaching and translation of languages enabled ex- Although these tools and applications are helpful, they
changes across languages; are not yet capable of supporting a fully-sustainable,
the creation of editorial and bibliographic guidelines multilingual European society in which information
assured the quality of printed material; and goods can ow freely.

45
2.1 LANGUAGE BORDERS Surprisingly, this ubiquitous digital linguistic divide
has not gained much public attention; yet, it raises a
HOLD BACK THE EUROPEAN very pressing question: Which European languages will
INFORMATION SOCIETY thrive in the networked information and knowledge so-
We cannot predict exactly what the future information ciety, and which are doomed to disappear?
society will look like. However, there is a strong like-
lihood that the revolution in communication techno-
logy is bringing together people who speak dierent lan-
guages in new ways. is is putting pressure both on in- 2.2 OUR LANGUAGES AT RISK
dividuals to learn new languages and especially on devel-
While the printing press helped step up the exchange of
opers to create new technology applications to ensure
information in Europe, it also led to the extinction of
mutual understanding and access to shareable knowl-
many European languages. Regional and minority lan-
edge. In the global economic and information space,
guages were rarely printed and languages such as Cor-
there is increasing interaction between dierent lan-
nish and Dalmatian were limited to oral forms of trans-
guages, speakers and content thanks to new types of me-
mission, which in turn restricted their scope of use. Will
dia. e current popularity of social media (Wikipedia,
the internet have the same impact on our modern lan-
Facebook, Twitter, YouTube, and, recently, Google+) is
guages?
only the tip of the iceberg.
Europes approximately 80 languages are one of our rich-
est and most important cultural assets, and a vital part
The global economy and information space of this unique social model [4]. While languages such
confronts us with dierent languages,
as English and Spanish are likely to survive in the emerg-
speakers and content.
ing digital marketplace, many European languages could
become irrelevant in a networked society. is would
Today, we can transmit gigabytes of text around the weaken Europes global standing, and run counter to the
world in a few seconds before we recognise that it is in strategic goal of ensuring equal participation for every
a language that we do not understand. According to a European citizen regardless of language.
recent report from the European Commission, 57% of
According to a UNESCO report on multilingualism,
internet users in Europe purchase goods and services in
languages are an essential medium for the enjoyment of
non-native languages; English is the most common for-
fundamental rights, such as political expression, educa-
eign language followed by French, German and Spanish.
tion and participation in society [5].
55% of users read content in a foreign language while
35% use another language to write e-mails or post com-
ments on the Web [3]. A few years ago, English might
have been the lingua franca of the Webthe vast ma-
jority of content on the Web was in Englishbut the The variety of languages in Europe is one of its
richest and most important cultural assets.
situation has now drastically changed. e amount of
online content in other European (as well as Asian and
Middle Eastern) languages has exploded.

46
2.3 LANGUAGE TECHNOLOGY To maintain our position in the frontline of global inno-
vation, Europe will need language technology, tailored
IS A KEY ENABLING to all European languages, that is robust and aordable
TECHNOLOGY and can be tightly integrated within key soware envi-
In the past, investments in language preservation fo- ronments. Without language technology, we will not
cussed primarily on language education and transla- be able to achieve a really eective interactive, multime-
tion. According to one estimate, the European mar- dia and multilingual user experience in the near future.
ket for translation, interpretation, soware localisation
and website globalisation was 8.4 billion in 2008 and
is expected to grow by 10% per annum [6]. Yet this g- 2.4 OPPORTUNITIES FOR
ure covers just a small proportion of current and future LANGUAGE TECHNOLOGY
needs in communicating between languages. e most
In the world of print, the technology breakthrough was
compelling solution for ensuring the breadth and depth
the rapid duplication of an image of a text using a suit-
of language usage in Europe tomorrow is to use appro-
ably powered printing press. Human beings had to do
priate technology, just as we use technology to solve our
the hard work of looking up, assessing, translating, and
transport and energy needs among others.
summarising knowledge. We had to wait until Edison
Language technology targeting all forms of written text
to record spoken language and again his technology
and spoken discourse can help people to collaborate,
simply made analogue copies.
conduct business, share knowledge and participate in
Language technology can now simplify and automate
social and political debate regardless of language barri-
the processes of translation, content production, and
ers and computer skills. It oen operates invisibly inside
knowledge management for all European languages. It
complex soware systems to help us already today to:
can also empower intuitive speech-based interfaces for
nd information with a search engine; household electronics, machinery, vehicles, computers
and robots. Real-world commercial and industrial ap-
check spelling and grammar in a word processor;
plications are still in the early stages of development,
view product recommendations in an online shop;
yet R&D achievements are creating a genuine window
follow the spoken directions of a navigation system; of opportunity. For example, machine translation is al-
translate web pages via an online service. ready reasonably accurate in specic domains, and ex-
perimental applications provide multilingual informa-
Language technology consists of a number of core ap- tion and knowledge management, as well as content
plications that enable processes within a larger applica- production, in many European languages.
tion framework. e purpose of the META-NET lan- As with most technologies, the rst language applica-
guage white papers is to focus on how ready these core tions such as voice-based user interfaces and dialogue
enabling technologies are for each European language. systems were developed for specialised domains, and of-
ten exhibit limited performance. However, there are
huge market opportunities in the education and enter-
Europe needs robust and aordable language
tainment industries for integrating language technolo-
technology for all European languages.
gies into games, edutainment packages, libraries, simu-

47
lation environments and training programmes. Mobile 2.5 CHALLENGES FACING
information services, computer-assisted language learn-
ing soware, eLearning environments, self-assessment
LANGUAGE TECHNOLOGY
tools and plagiarism detection soware are just some Although language technology has made considerable
of the application areas in which language technology progress in the last few years, the current pace of tech-
can play an important role. e popularity of social nological progress and product innovation is too slow.
media applications like Twitter and Facebook suggest a Widely-used technologies such as the spelling and gram-
need for sophisticated language technologies that can mar correctors in word processors are typically mono-
monitor posts, summarise discussions, suggest opinion lingual, and are only available for a handful of languages.
trends, detect emotional responses, identify copyright Online machine translation services, although useful
infringements or track misuse. for quickly generating a reasonable approximation of a
documents contents, are fraught with diculties when
highly accurate and complete translations are required.
Due to the complexity of human language, modelling
Language technology helps overcome the our tongues in soware and testing them in the real
disability of linguistic diversity. world is a long, costly business that requires sustained
funding commitments. Europe must therefore main-
tain its pioneering role in facing the technological chal-
Language technology represents a tremendous opportu- lenges of a multiple-language community by inventing
nity for the European Union. It can help to address the new methods to accelerate development right across the
complex issue of multilingualism in Europe the fact map. ese could include both computational advances
that dierent languages coexist naturally in European and techniques such as crowdsourcing.
businesses, organisations and schools. However, citi-
zens need to communicate across the language borders
Technological progress needs to be accelerated.
of the European Common Market, and language tech-
nology can help overcome this nal barrier, while sup-
porting the free and open use of individual languages.
Looking even further ahead, innovative European mul- 2.6 LANGUAGE ACQUISITION
tilingual language technology will provide a benchmark
for our global partners when they begin to support IN HUMANS AND MACHINES
their own multilingual communities. Language tech- To illustrate how computers handle language and why it
nology can be seen as a form of assistive technology is dicult to program them to process dierent tongues,
that helps overcome the disability of linguistic diver- lets look briey at the way humans acquire rst and sec-
sity and makes language communities more accessible to ond languages, and then see how language technology
each other. Finally, one active eld of research is the use systems work.
of language technology for rescue operations in disas- Humans acquire language skills in two dierent ways.
ter areas, where performance can be a matter of life and Babies acquire a language by listening to the real inter-
death: Future intelligent robots with cross-lingual lan- actions between their parents, siblings and other family
guage capabilities have the potential to save lives. members. From the age of about two, children produce

48
their rst words and short phrases. is is only possi- systems. Experts in the elds of linguistics, computa-
ble because humans have a genetic disposition to imitate tional linguistics and computer science rst have to en-
and then rationalise what they hear. code grammatical analyses (translation rules) and com-
Learning a second language at an older age requires pile vocabulary lists (lexicons). is is very time con-
more cognitive eort, largely because the child is not im- suming and labour intensive. Some of the leading rule-
mersed in a language community of native speakers. At based machine translation systems have been under con-
school, foreign languages are usually acquired by learn- stant development for more than 20 years. e great
ing grammatical structure, vocabulary and spelling using advantage of rule-based systems is that the experts have
drills that describe linguistic knowledge in terms of ab- more detailed control over the language processing.
stract rules, tables and examples. is makes it possible to systematically correct mistakes
in the soware and give detailed feedback to the user, es-
pecially when rule-based systems are used for language
Humans acquire language skills in two dierent
ways: learning from examples and learning the learning. However, due to the high cost of this work,
underlying language rules. rule-based language technology has so far only been de-
veloped for a few major languages.
Moving now to language technology, the two main
types of systems acquire language capabilities in a simi-
lar manner. Statistical (or data-driven) approaches ob- The two main types of language technology
systems acquire language in a similar manner.
tain linguistic knowledge from vast collections of con-
crete example texts. While it is sucient to use text in a
single language for training, e. g., a spell checker, paral- As the strengths and weaknesses of statistical and rule-
lel texts in two (or more) languages have to be available based systems tend to be complementary, current re-
for training a machine translation system. e machine search focuses on hybrid approaches that combine the
learning algorithm then learns patterns of how words, two methodologies. However, these approaches have so
short phrases and complete sentences are translated. far been less successful in industrial applications than in
is statistical approach usually requires millions of sen- the research lab.
tences to boost performance quality. is is one rea- As we have seen in this chapter, many applications
son why search engine providers are eager to collect as widely used in todays information society rely heavily
much written material as possible. Spelling correction on language technology, particularly in Europes eco-
in word processors, and services such as Google Search nomic and information space. Although this techno-
and Google Translate, all rely on statistical approaches. logy has made considerable progress in the last few years,
e great advantage of statistics is that the machine there is still huge potential to improve the quality of lan-
learns quickly in a continuous series of training cycles, guage technology systems. In the next section, we de-
even though quality can vary randomly. scribe the role of Romanian in European information
e second approach to language technology, and to society and assess the current state of language techno-
machine translation in particular, is to build rule-based logy for the Romanian language.

49
3

THE ROMANIAN LANGUAGE IN THE


EUROPEAN INFORMATION SOCIETY

numerous were Hungarians (1,431,807) and Romas


3.1 GENERAL FACTS
(535,140), followed by Germans, Ukrainians, Lippovan
Spoken by over 29,000,000 speakers [7], Romanian Russians, Turks, Serbs, Croats, Slovenes, Tartars, Slo-
is mother tongue for approx. 25,000,000 speakers: vaks, Bulgarians, Jewish, Czechs, Poles, Greeks, Arme-
around 21,500,000 speakers in Romania [8] plus ap- nians, etc. For all these minorities, ocial language poli-
prox. 3,500,000 speakers in the Republic of Moldavia cies in Romania guarantee their rights to be protected as
[9] (where the language is ocially called Molda- language communities and to use their own languages
vian). e countries around Romania (Albania, Bul- in private and public, culturally and socially, in econ-
garia, Croatia, Greece, Hungary, e Former Yugoslav omy and in communication media. However, article 13
Republic of Macedonia, Serbia, Ukraine) and com- of the Constitution states that In Romania, the o-
munities of immigrants in Australia, Canada, Israel, cial language is Romanian. Moreover, Law number 500
Latin America, Turkey, USA and other European and from 12th November, 2004 stipulates the obligation of
Asian countries totals around 4,000,000 Romanian na- any text (either oral or written) that serves public inter-
tive speakers [10]. est to be translated or adapted into Romanian [12].
Romanian is an ocial language also in the Au-
tonomous Province of Vojvodina in Serbia, in the au-
tonomous Mount Athos in Greece, in the European 3.2 PARTICULARITIES OF THE
Union and in the Latin Union; it is a recognised minor-
ity language in Ukraine. ROMANIAN LANGUAGE
Romanian has four dialects [11]: DacoRomanian, Developed at distance from the other languages in the
Aromanian (spoken by approximately 600.000 speak- Romance family, Romanian is an eastern Romance lan-
ers in Albania, Bulgaria, Greece and Macedonia), guage. Elements of the Vulgar Latin from which it
IstroRomanian (15,000 speakers in 2 small areas in descends are more faithfully preserved in this isolated
the Istrian Peninsula, Croatia) and MeglenoRomanian language: it has inherited the Latin morpho-syntactic
(about 5,000 speakers in Greece and Macedonia). Be- structure, preserved features that other Romance lan-
cause of their small number of speakers, these dialects guages have lost (such as declensions), and incorporated
are included in the UNESCO Red Book of Endangered some non-Romance features in its structure (-o voca-
Languages. tives).
In Romania there are 18 ocially recognised national e great part of the Romanian vocabulary has a Latin
(ethnic) minorities; in the last Census (2002), the most origin, either inherited from Vulgar Latin or borrowed

50
from Latin in modern times. 60% of the fundamental in Romanian mostly through sux derivation. How-
vocabulary (i. e., the words that are known and currently ever, recent studies reveal the importance prex deriva-
used by all speakers of the language) is inherited from tion has got lately (for more information see [13]).
Latin. Romanian has 5 letters using diacritics: , , , , .
During Roman colonisation of Dacia (106-271 A. D.), For the last 2, two variants have circulated: one with a
the colonisers imposed Latin as the ocial language. comma under the letter, and another one with a cedilla.
However, comparative studies of Romanian and Alba- However, only the former is recommended nowadays by
nian vocabularies reveal a set of around 100 words that the Romanian National Standardisation Body (ASRO).
have been preserved from the raco-Dacian substra- Many electronic texts are not written with diacritics.
tum. ese words designated fundamental concepts, In order to automatically introduce diacritics, programs
like body parts, natural elements or food. ey are still have been created to recover them in such texts.
used today, are very frequent, have rich polysemy and
lexical families.
Romanian has ve letters using diacritics: , , ,
During the migration of Slavic tribes over the territory , . For the last two, a couple of variants have
of nowadays Romania, the language underwent a pro-
circulated: one with a comma under the letter,
and another one with a cedilla. However, only
cess of transformation in all its compartments: pho- the former is recommended.
netics, vocabulary, morphology and syntax. However,
morphology, the backbone of a language, remained
Romanian exhibits a number of specic characteristics
Latin in most of its aspects. e Cyrillic alphabet was
that contribute to the richness of the language but can
adopted in this period, especially due to the church
also be a challenge for the computational processing of
inuence. e old Slavonic was the liturgical lan-
a natural language. Romanian inection is quite rich.
guage of the Romanian Orthodox Church until the late
For nouns, pronouns and adjectives there are ve cases
18th century, when Romanian started a process of re-
and two numbers. Pronouns can have stressed and un-
latinisation, modernisation and westernisation. It is
stressed forms, while nouns and adjectives can be de-
now when many words of other origin are replaced by
nite or indenite. For verbs there are two numbers, each
Latin words, borrowed directly or indirectly, via other
with three persons, and ve synthetic tenses, plus inni-
Romance languages (French and Italian). French as a
tive, gerund and participle forms. In average, a noun can
language of culture in the last 2 centuries and France as
have 5 forms, a personal pronoun about 6 forms, an ad-
a place where the Romanian aristocracy sent their chil-
jective around 6 forms, while a verb has more than 30
dren to school justify the existence of extremely numer-
forms. Besides morphologic suxes and endings, pho-
ous words of this origin in Romanian. Lately, English
netic alternations inside the root are also possible with
took the place of French and Romanian has many Angli-
inected words.
cisms, entirely, partially or at all adapted to its phonetic
and morphologic systems.
Political, economic and social aspects in the history of Romanian is a highly inected language, with
Romania explain the words of various other origins in various linguistic particularities: it is a pro-drop
language, it allows clitic doubling, negative
this language: Turkish, Greek, German, Hungarian,
concord and double negation.
Bulgarian, Russian etc. New words have been created

51
Romanian is a subject pro-drop language, like most of (4) I l-am dat la telefon.
its Romance sisters, that is, it allows the deletion of the To her him have I given on phone.
subject: I gave him to her on the phone.

(1) tie.
Knows-he/she/it e clitic doubling phenomenon is obligatory with
He/She/It knows. proper names and denite nouns functioning as direct
or indirect objects.
e explanation resides in the rich inectional systems
Romanian displays both Negative Concord (when the
of verbs that have distinctive endings for dierent per-
presence of one or more negative words is conditioned
sons and numbers.
by the occurrence of a negative marker on the matrix
Nevertheless, subject doubling is also possible in Roma-
verb; this is the case in more Romance languages: Por-
nian when a personal pronoun doubles a lexical noun
tuguese, Spanish, French) and Double Negation (simi-
phrase:
lar to double negation in logics, where two negations
(2) Vine el tata imediat! cancel each other and an armation results; this phe-
Comes he father-the immediately! nomenon is accepted in some languages such as English
Father will come immediately! only for stylistic reasons). e presence of the negative
marker nu not in the verbal phrase negates the sen-
e structure is characteristic of colloquial language,
tence and licenses negative words in the respective sen-
marking a certain illocutionary attitude of the speaker:
tence (negative concord):
threat, promise, and reassurance.
Romanian has in common with some Spanish di-
alects and several Balkan languages a structure currently (5) Nu am vzut pe nimeni niciodat
Not have I seen nobody never
known as clitic doubling. Pronominal clitic doubling in
aici.
Romanian may be realised with accusative clitics, with here.
dative ones or with both. For example, in the sentence:
I have never seen anybody here.
(3) Ii lj am
Dat. cl. Acc. masc. cl. have-I given However, certain congurations in which the negative
dat mameii pe Ionj la
markers and words occur trigger the double negation
to-mother PE John on phone.
(that is, the sentence acquires a positive meaning). For
telefon.
instance, a negative main clause followed by a negative
I gave John to my mother on the phone. subjunctive clause is such a conguration with overall
positive meaning:
the noun mamei and the Dative clitic i refer to the same
person, and the Accusative clitic l- and the Accusative
(6) Maria nu a vrut s nu spun
noun Ion are also coreferential. e presence of clitics in
Maria not has wanted to not say
such constructions is mandatory, although they do not nimic.
saturate any verbal valences. However, when the nouns nothing.
are not present, it is the task of these pronouns to satu- Maria did not want to say nothing. or
rate the verbal valences: Maria wanted to say something.

52
Case is inectional in Romanian. However, there are In essential domains like political, administrative and
also three case marking prepositions: pe for Accusative economic sciences, media, advertising, computers,
(conditioned by the animacy, deniteness and speci- etc. substantial loans and semantic extensions from En-
city features of the nominal phrase), la for Dative and a glish occurred; terminologies in new elds are based on
for Genitive (both of them conditioned by the presence English loans, the active vocabulary of educated peo-
of numerals in the nominal phrase): ple contains more and more anglicisms, new intonation
patterns can be observed (especially in media), as well as
(7) L am vzut pe
the use of the second person singular (informal) instead
Acc. masc. cl. have I seen
of the second person plural (formal).
colegul meu.
colleague-the my. In some areas, anglicisms have started to replace exist-
I have seen my colleague. ing Romanian vocabulary. One example is the use of
English titles in job advertisements, in particular for ex-
(8) Am dat crile la trei dintre ecutive positions, e. g., Human Resource Manager in-
Have I given books-the to three of
stead of Director de Resurse Umane. A strong tendency
ei.
them. to overuse anglicisms can also be detected in products
I gave the books to three of them. advertisements. Banks in Romania use for promotion
slogans such as: Cu cine faci banking? or Prima modali-
(9) Crile a trei copii erau noi. tate de plat contactless, although banking or contactless
Books-the of three children were new. are anglicisms that most Romanians are not used to. e
e books of three children were new. example demonstrates the importance of raising aware-
ness for a development that runs the risk of excluding
large parts of the population from taking part in infor-
Certain linguistic characteristics of Romanian are mation society, namely those who are not familiar with
challenges for computational processing. English.

3.4 OFFICIAL LANGUAGE


3.3 RECENT DEVELOPMENTS
Analogue to the re-latinisation phase in the 19th cen-
PROTECTION IN ROMANIA
tury aer the liberation from the Greek and Turkish e Romanian Academy, Romanias highest cultural fo-
domination, Romanian language was passing in the last rum, has, as one of its main objectives, the cultivation of
20 years through a process of transformation from the the national language. e major goal of its linguistic in-
totalitarian usage (langue de bois, unidirectional dis- stitutes was building and publishing Dicionarul Tezaur
course, etc.) to an open usage in which new linguistic al Limbii Romne (the esaurus Dictionary of the Ro-
patterns must adapt to the social and cultural transition. manian Language), a process which took almost one
erefore, similar to many other languages, Romanian century. e old series, known as Dicionarul Academiei
is going through a continuous process of internationali- (e Dictionary of the Academy DA) includes 5 vol-
sation under the inuence of the Anglo-Saxon vocabu- umes with 3,146 pages and 44,890 entries, and has been
lary. developed between 1913 and 1947. Aer an interrup-

53
tion, the work was restarted in the middle of the 7th lore C. Briloiu. e Institute of History and Liter-
decade of the last century with the new series, known ary eory G. Clinescu has the following lines of re-
as Dicionarul Limbii Romne (the Dictionary of the search: development of encyclopaedias and fundamen-
Romanian Language DLR). e last volume was - tal syntheses of the history and literary theory, preserva-
nally published by the Publishing House of the Roma- tion and development of national literature and dening
nian Academy at the beginning of 2009. In total, DA the national cultural identity in the European context.
and DLR have 36 volumes, more than 15,000 pages and e Institute of Ethnography and Folklore Constantin
about 175,000 entries. e dictionary was created in Briloiu is a multidisciplinary research structure whose
the traditional pencil-and-paper way, with excerpts col- main task is to develop fundamental and advanced re-
lected from more than 2,500 volumes of the written Ro- search on traditional and contemporary culture, in rural
manian literature. and urban areas, in the domains of folklore (folkloric lit-
e Institute of Linguistics Iorgu Iordan Al. Rosetti erature), ethnomusicology, ethnography and multime-
has a research program focusing on language cultivation. dia, unconventional, archives.
ey elaborate normative dictionaries (Dicionarul m- Law 500 of 12th November 2004 states that all written
prumuturilor neadaptate Dictionary of non-adapted or spoken texts in Romanian that serve the public inter-
words, Dicionarul termenilor ociali Dictionary of est must conform to the norms established by the Ro-
ocial terms, Dicionar ortograc, ortoepic i morfo- manian Academy.
logic al limbii romne Orthographic, orthoepic and
morphologic dictionary of Romanian) and grammars
There are over 70 international centres abroad
(Gramatica limbii romne Romanian Language Gram- where Romanian is taught as a foreign language
mar, Dinamica limbii romne actuale e Dynamics by Romanian university teachers.
of Contemporary Romanian).
e Institute of Romanian Philology A. Philippide of Institutul Limbii Romne (e Institute of the Roma-
Iasi, through its specialized departments, develops fun- nian Language) was created with the aims of promoting
damental projects for the Romanian culture in the ar- Romanian language learning abroad, supporting learn-
eas of lexicography, dialectology, ethnography and folk- ers of Romanian and attesting their knowledge of Ro-
lore. e Institute has collaborated with the linguis- manian [14]. ere are over 70 international centres
tic institutes from Bucharest and Cluj-Napoca to create abroad where Romanian is taught as a foreign language
and publish the Regional Linguistic Atlas, a work of ma- by Romanian university teachers.
jor importance for the Romanian linguistics. Based on In Romania there is also an increasing interest for study-
the regional atlases and on the Atlas of the Moldavian ing Romanian among foreigners, not only at the diplo-
language elaborated in the Republic of Moldova, the matic level (by representatives of various diplomatic
Institute of Linguistics Iorgu Jordan Al. Rosetti is missions of dierent countries), but also by business
preparing the Romanian Linguistic Atlas. Synthesis. people. Besides universities, that oer Romanian as a
Within the Romanian Academy, two other institutes foreign language classes (usually for foreign students in
deal with the protection of the Romanian language: Romania), there are numerous private rms with classes
the Institute of History and Literary eory G. C- oered in general to foreigners involved in the eco-
linescu and the Institute of Ethnography and Folk- nomic sector. Romanian summer courses for all levels

54
are organised annually by the Romanian Cultural Foun- a Consortium for the Digitalisation of the Romanian
dation in various places of the country and by several Language ConsILR organises annually a scientic
high education institutions (such as Alexandru Ioan workshop dedicated to research in LT regarding the Ro-
Cuza University of Iai or University of Bucharest). manian language, with the proceedings written in Ro-
Language cultivation in the context of accelerated in- manian. e same situation also holds for other do-
novation is a priority also for media. e national ra- mains, possibly being less prominent for disciplines such
dio and television channels have programmes in which as law, philosophy, linguistics or theology.
tricky aspects of language are discussed with specialists Similarly, this is true of the business world. In many
and explained to the audience. large and internationally active companies, English has
become the lingua anca, both in written (emails and
documents) and oral communication (e. g., talks), espe-
3.5 LANGUAGE IN EDUCATION cially in multinational companies with foreign manage-
According to the New National Curriculum (2000) Ro- ment.
manian is taught for 45 compulsory classes per week
in secondary school and for 34 compulsory classes in
The Consortium for the Digitalisation of Romanian
high school. Prescriptive aspects of language preserva- Language ConsILR organises annually a
tion are combined with communication as skilled be- scientic workshop dedicated to research in LT
haviour and the language-culture relation is emphasised. regarding the Romanian language.
Romanian language and literature are compulsory sub-
jects for national exams (graduation exam from sec- Language technology can address this challenge from a
ondary school and graduation exam from high school; dierent perspective by oering services like Machine
the latter involves two kinds of examination: oral and Translation or cross-lingual information retrieval to for-
written). eign language text and thus help diminish personal and
Romanian language and literature are studied as major economic disadvantages naturally faced by non-native
and minor subjects in more than 30 state and private speakers of English.
universities throughout Romania. Romanian minorities live in neighbouring countries
and in Diaspora communities all over the world. Roma-
nia promotes policies for language and cultural identity
3.6 INTERNATIONAL ASPECTS preservation of the Romanian communities. e Eux-
Romania is internationally known for its literature, the odius Hurmuzachi Centre oers hundreds of scholar-
major works of Eminescu (the great national poet of Ro- ships a year in Romania for Romanian minorities from
mania) being translated into more than 60 languages. neighbouring countries. ere are many school and aca-
Other known names of the Romanian literature are: demic exchanges, especially with the Republic of Mol-
Mircea Eliade, the rst to write a history of religions, davia. e rst Romanian school and university exten-
Eugen Ionesco, one of the forerunners of the eatre of sions through franchising appeared in the Republic of
the Absurd, or Emil Cioran known for his philosophical Moldavia in 2000.
system. In dierent communities from the Diaspora, there are
Nowadays, the large majority of the scientic publica- various initiatives through which those interested can
tions in the LT eld are written in English, although study Romanian language and culture. For instance,

55
Romanian Language School in Kitchener, Canada, has 4.44 for English, 2.24 for French, 2.93 for Italian),
classes of Romanian language and culture for children this is the only language whose vigour increased in the
and teenagers. 20052007 period (previous to the European Union in-
Romanian Cultural Institutes are established in 19 cities tegration).
all over the world (including Bucharest, New York,
Paris, London, Roma, Istanbul, etc.) and they all have
The internet oers a wide range of application
as an important concern the promotion of the Roma-
areas for language technology.
nian through language classes and cultural events of all
types.
e growing importance of the internet is critical for
language technology. e vast amount of digital lan-
3.7 ROMANIAN ON THE guage data is a key resource for analysing the usage of
natural language, in particular, for collecting statistical
INTERNET information about patterns. And the internet oers a
e internet market in Romania is in continuous wide range of application areas for language technology.
growth. In 2010, 44.2% of the Romanians had access e most commonly used web application is search,
to a computer at home, and 35.5% (i. e., 7,786,700 Ro- which involves the automatic processing of language on
manians) were internet users [15] (with almost 60% of multiple levels as will be shown in more detail later. Web
them using the internet daily), which places Romania search involves sophisticated language technology that
on the 8th place in a top 10 of internet users from Eu- diers for each language. For the Romanian language,
ropean countries [16]. Over 500,000 websites are regis- for example, this involves matching s to and t to .
tered in the .ro domain. internet users and providers of web content can also
When compared to the data from 2000, when only 3.6% use language technology in less obvious ways, for ex-
of the population (800,000) used the internet, we notice ample, by automatically translating web page contents
an increase of almost 10 times. from one language into another. Despite the high cost
A study of the Latin Union in 2007 [17] states that, of manually translating this content, comparatively little
similar to most of the Romance languages, Romanian language technology has been developed and applied to
had in the 19982007 period an increase of the lan- the issue of website translation in light of the supposed
guage evolution over the internet. Dividing the web need. is may be due to the complexity of the Roma-
pages percent for every language with the percent of nian language and to the range of dierent technologies
the languages relative presence of speakers in the real involved in typical applications.
world, they computed the vigour of each language (or e next chapter gives an introduction to language tech-
the weighted presence of the studied languages in cy- nology and its core application areas, together with an
berspace). Although this coecient is considered re- evaluation of current language technology support for
duced for Romanian (0.62 in 2007, in comparison with Romanian.

56
4

LANGUAGE TECHNOLOGY SUPPORT


FOR ROMANIAN

Language technology is used to develop soware sys- information retrieval


tems designed to handle human language and are there- information extraction
fore oen called human language technology. Human text summarisation
language comes in spoken and written forms. While
question answering
speech is the oldest and in terms of human evolution the
speech recognition
most natural form of language communication, com-
speech synthesis
plex information and most human knowledge is stored
and transmitted through the written word. Speech Language technology is an established area of research
and text technologies process or produce these dier- with an extensive set of introductory literature. e in-
ent forms of language, using dictionaries, rules of gram- terested reader is referred to the following works: [18,
mar, and semantics. is means that language techno- 19, 20, 21, 22].
logy (LT) links language to various forms of knowledge, Before discussing the above application areas, we will
independently of the medium (speech or text) in which briey describe the architecture of a typical LT system.
it is expressed. Figure 1 illustrates the LT landscape.
When we communicate, we combine language with
other modes of communication and information media
4.1 APPLICATION
for example speaking can involve gestures and facial ARCHITECTURES
expressions. Digital texts link to pictures and sounds.
Soware applications for language processing typically
Movies may contain language in spoken and written
consist of several components that mirror dierent as-
form. In other words, speech and text technologies over-
pects of language. While such applications tend to be
lap and interact with other multimodal communication
very complex, gure 2 shows a highly simplied archi-
and multimedia technologies.
tecture of a typical text processing system. e rst three
In this section, we will discuss the main application
modules handle the structure and meaning of the input
areas of language technology, i. e., language checking,
text:
web search, speech interaction, and machine transla-
tion. ese applications and basic technologies include 1. Pre-processing: cleans the data, analyses or removes
formatting, detects the input languages, and so on.
spelling correction 2. Grammatical analysis: nds the verb, its objects,
authoring support modiers and other sentence elements; detects the
computer-assisted language learning sentence structure.

57
Speech Technologies
Multimedia & Language
Multimodality Knowledge Technologies
Technologies Technologies

Text Technologies

1: Language technologies

3. Semantic analysis: performs disambiguation (i. e., manian in terms of various dimensions such as availabil-
computes the appropriate meaning of words in a ity, maturity and quality. e general situation of LT for
given context); resolves anaphora (i. e., which pro- the Romanian language is summarised in gure 7 (p. 72)
nouns refer to which nouns in the sentence); rep- at the end of this chapter. is table lists all tools and
resents the meaning of the sentence in a machine- resources that are boldfaced in the text. LT support for
readable way. Romanian is also compared to other languages that are
part of this series.
Aer analysing the text, task-specic modules can per-
form other operations, such as automatic summarisa-
tion and database look-ups.
In the remainder of this section, we rstly introduce
the core application areas for language technology, and 4.2 CORE APPLICATION AREAS
follow this with a brief overview of the state of LT re-
search and education today, and a description of past In this section, we focus on the most important LT tools
and present research programmes. Finally, we present and resources, and provide an overview of LT activities
an expert estimate of core LT tools and resources for Ro- in Romania and the Republic of Moldova.

Input Text Output

Pre-processing Grammatical Analysis Semantic Analysis Task-specific Modules

2: A typical text processing architecture

58
4.2.1 Language Checking A statistical language model can be automatically cre-
ated by using a large amount of (correct) language data
Anyone who has used a word processor such as Mi-
(called a text corpus). However, there are cases when
croso Word knows that it has a spell checker that high-
not even this could be of any help:
lights spelling mistakes and proposes corrections. e
rst spelling correction programs compared a list of ex-
(12) Plou ntruna din primele zile ale lui martie.
tracted words against a dictionary of correctly spelled
It keeps raining since the rst days of March.
words. Today these programs are far more sophisticated.
Using language-dependent algorithms for grammatical (13) Ploua ntr-una din primele zile ale lui martie.
analysis, they detect errors related to morphology (e. g., It rained in one of the rst days of March.
plural formation) as well as syntax-related errors, such as
a missing verb or a verb-subject disagreement (e. g., she e only discriminating element here is the verb. In the
*write a letter). However, most spell checkers will not rst sentence it is in the present tense, with a durative
nd any errors in the following text [54]: meaning. In the latter, it is in the past tense. Only the
part-of-speech tag has discriminative value in such ex-
I have a spelling checker,
amples.
It came with my PC.
Up to now, these approaches have mostly been devel-
It plane lee marks four my revue
oped and evaluated on data from English. Neither ap-
Miss steaks aye can knot sea.
proach can transfer straightforwardly to Romanian be-
cause the latter has richer inection and many particular
Handling these kinds of errors usually requires an analy-
constructions.
sis of the context, e. g., for deciding if a word needs to be
Language checking is not limited to word processors;
written with or without a hyphen in Romanian, as in:
it is also used in authoring support systems, i. e., so-
(10) Plou ntruna de ieri. ware environments in which manuals and other types
It keeps raining since yesterday. of technical documentation for complex IT, healthcare,
engineering and other products are written. To oset
(11) ntr-una din zile am s merg la Paris. customer complaints about incorrect use and damage
One of these days I will go Paris. claims resulting from poorly understood instructions,
companies are increasingly focusing on the quality of
is type of analysis either needs to draw on language- technical documentation while targeting the interna-
specic grammars laboriously coded into the soware tional market (via translation or localisation) at the same
by experts, or on a statistical language model (see Fig. 3). time. Advances in natural language processing have
In this case, a model calculates the probability of a par- led to the development of authoring support soware,
ticular word as it occurs in a specic position (e. g., be- which helps the writer of technical documentation to
tween the words that precede and follow it). For ex- use vocabulary and sentence structures that are consis-
ample, ntr-una din zile is a much more probable word tent with industry rules and (corporate) terminology re-
sequence than ntr-una de ieri, and plou ntruna is strictions.
more frequent than plou ntr-una, therefore in the sec- Nowadays there are no Romanian companies or Lan-
ond case, the writing without hyphen is recommended. guage Service Providers oering products in this area,

59
Statistical Language Models

Input Text Spelling Check Grammar Check Correction Proposals

3: Language checking (top: statistical; bottom: rule-based)

although researchers in dierent natural language pro- In Romanian, at least 30% of the words in a sentence
cessing groups have developed language models tailored use diacritic signs, with an average of 1.16 diacritic signs
for the Romanian language particularities. At the Re- per word. Only approx. 12% of these words can be im-
search Institute for Articial Intelligence within the Ro- mediately transformed into their diacritic version (since
manian Academy (RACAI), language models for Ro- their nondiacritic form is not a valid word in the Ro-
manian are created from large corpora. Due to the fact manian language dictionary). For the rest of the words,
that most of the Romanian texts on the Web are writ- the diacritic discovery program is useful.
ten with no diacritics, RACAI has also developed a di-
acritics recovery facility [23], intended to indicate the
right diacritics form of a word initially written with no Language checking is not limited to
word processors but also applies to
diacritics, using a large Romanian lexicon developed by authoring support systems.
their team and character based 5gram model to nd
the most probable interpretation in terms of diacritic
occurrences for an unknown word. e approach takes Another important step ahead is the collection of
into account the context surrounding the word in a pre- reusable linguistic resources for the Romanian language,
liminary process of part-of-speech tagging, which is crit- containing about 1,000,000 inected Romanian word
ical for choosing the right word form in the lexicon. For forms, with morphological information, denitions,
instance, the word peste is transformed into pete synonyms, RomanianRussian and RomanianEnglish
(sh) in the example below: translations, oered by the Institute of Mathematics and
Computer Science, the Academy of Sciences of the Re-
(14) Am cumparat peste. public of Moldova and freely accessible [24].
I have bought sh. Besides spell checkers and authoring support, language

but it is kept as peste (over) in: checking is also important in the eld of computer-
assisted language learning. Language checking applica-
(15) Era un pod peste rau. tions also automatically correct search engine queries, as
ere was a bridge over the river. found in Googles Did you mean suggestions.

is decision is based on the previous step of part-of-


4.2.2 Web Search
speech tagging in which peste in the rst example is
annotated with a noun tag and the same word in the sec- Searching the Web, intranets or digital libraries is prob-
ond example is annotated with a preposition tag. ably the most widely used, yet largely underdeveloped

60
language technology application today. e Google of the sentence and determine that the user wants com-
search engine, which started in 1998, now handles panies that have been acquired, rather than companies
about 80% of all search queries [25]. e Google search that have acquired other companies. For the expres-
interface and results page display has not signicantly sion last ve years, the system needs to determine the
changed since the rst version. However, in the current relevant range of years, taking into account the present
version, Google oers spelling correction for misspelled year. e query then needs to be matched against a huge
words and incorporates basic semantic search capabili- amount of unstructured data to nd the pieces of infor-
ties that can improve search accuracy by analysing the mation that are relevant to the users request. is pro-
meaning of terms in a search query context [26]. e cess is called information retrieval, and involves search-
Google success story shows that a large volume of data ing and ranking relevant documents. To generate a list
and ecient indexing techniques can deliver satisfac- of companies, the system also needs to recognise a par-
tory results using a statistical approach to language pro- ticular string of words in a document represents a com-
cessing. pany name, using a process called named entity recogni-
For more sophisticated information requests, it is essen- tion. A more demanding challenge is matching a query
tial to integrate deeper linguistic knowledge to facili- in one language with documents in another language.
tate text interpretation. Experiments using lexical re- Cross-lingual information retrieval involves automati-
sources such as machine-readable thesauri or ontologi- cally translating the query into all possible source lan-
cal language resources (e. g., WordNet for English or the guages and then translating the results back into the
Romanian WordNet [27]) have demonstrated improve- users target language.
ments in nding pages using synonyms of the original
Now that data is increasingly found in non-textual for-
search terms, such as energie atomic [atomic energy] or
mats, there is a need for services that deliver multime-
energie nuclear [atomic power or nuclear energy], or
dia information retrieval by searching images, audio les
even more loosely related terms.
and video data. In the case of audio and video les,
a speech recognition module must convert the speech
content into text (or into a phonetic representation)
The next generation of search engines will
have to include much more sophisticated that can then be matched against a user query.
language technology.
In Romania, natural language-based search technolo-
gies are not considered for industrial applications yet.
e next generation of search engines will have to in- Instead, open source based technologies like Lucene
clude much more sophisticated language technology, are oen used by search-focused companies to pro-
especially to deal with search queries consisting of a vide the basic search infrastructure. However, research
question or other sentence type rather than a list of key- groups from Alexandru Ioan Cuza University of Iasi
words. For the query Give me a list of all companies that (UAIC) and RACAI have developed dierent modules
were taken over by other companies in the last ve years, that constitute the backbones of a semantic search tool,
a syntactic as well as semantic analysis is required. e such as part-of-speech tagger, syntactic parsers, seman-
system also needs to provide an index to quickly retrieve tic parsers, named-entity recognisers, indexing tools,
relevant documents. A satisfactory answer will require multimedia information retrieval, etc. However, their
syntactic parsing to analyse the grammatical structure coverage and outreach are fairly limited so far.

61
Web Pages

Pre-processing Semantic Processing Indexing

Matching
&
Relevance

Pre-processing Query Analysis

User Query Search Results

4: Web search

At RACAI, a part-of-speech tagger able to identify in a given sentence, the dierent roles entities play. For
the lemma (dictionary form) and the part of speech of instance, for the sentence above, the system identies
words in texts is available as a web service [28]. For Maria as the doer of the action and a ticket for the bands
instance, if the users query for a web search contains concert as the good being purchased. Similarly, in the ex-
evenimente (events), the root (or lemmatised form) of ample below:
the word can be used instead for search, i. e., eveniment
(event) [29]. (17) Maria i-a luat fr ezitare bilet pentru a-i

Another module developed by researchers both at vedea trupa preferat.

UAIC and RACAI is a named-entity recogniser, which, Mary bought a ticket without hesitation to

given a text containing persons, companies, organia- see her favourite band.

tions, events, etc. (all referred as named-entities), iden-


without hesitation represents the manner in which Mary
ties these entities in the text. For the example:
bought the ticket, and to see her faourite band repre-

(16) Maria i-a luat bilet la concertul trupei din sents the reason for the acquisition of her ticket. is

var de la Paris. system was developed on the basis of a corpus of anno-

Mary bought a ticket for the bands concert tated semantic roles [31], built in order to align the Ro-

this summer in Paris. manian language to the semantic resources existing for
English.
this system recognises Maria as a female person, this Recently, a group of researchers at UAIC have tackled
summer as a temporal reference, and Paris as a place. automatic image detection and annotation, in order to
A semantic parser developed at UAIC [30] is also avail- develop a web search image tool [32]. However, this sys-
able for the Romanian language, being able to identify, tem is still in an incipient stage.

62
4.2.3 Speech Interaction text transcriptions. Restricting utterances usually forces
people to use the voice user interface in a rigid way and
Speech interaction is one of the many application ar-
can damage user acceptance; but the creation, tuning
eas that depend on speech technology, i. e., technolo-
and maintenance of rich language models will signi-
gies for processing spoken language. Speech interaction
cantly increase costs. VUIs that employ language mod-
technology is used to create interfaces that enable users
els and initially allow users to express their intent more
to interact in spoken language instead of using a graph-
exibly prompted by a How may I help you? greeting
ical display, keyboard and mouse. Today, these voice
tend to be automated and are better accepted by users.
user interfaces (VUI) are used for partially or fully au-
tomated telephone services provided by companies to Companies tend to use utterances pre-recorded by pro-
customers, employees or partners. Business domains fessional speakers for generating the output of the voice
that rely heavily on VUIs include banking, supply chain, user interface. For static utterances where the word-
public transportation, and telecommunications. Other ing does not depend on particular contexts of use or
uses of speech interaction technology include interfaces personal user data, this can deliver a rich user experi-
to car navigation systems and the use of spoken language ence. But more dynamic content in an utterance may
as an alternative to the graphical or touchscreen inter- suer from unnatural intonation because dierent parts
faces in smartphones. of audio les have simply been strung together. rough
Speech interaction technology comprises four tech- optimisation, todays TTS systems are getting better at
nologies: producing naturally-sounding dynamic utterances.
Interfaces in speech interaction have been considerably
1. Automatic speech recognition (ASR) determines
standardised during the last decade in terms of their var-
which words are actually spoken in a given sequence
ious technological components. ere has also been a
of sounds uttered by a user.
strong market consolidation in speech recognition and
2. Natural language understanding analyses the syntac-
speech synthesis. e national markets in the G20 coun-
tic structure of a users utterance and interprets it ac-
tries (economically resilient countries with high popu-
cording to the system in question.
lations) have been dominated by just ve global play-
3. Dialogue management determines which action to
ers, with Nuance (USA) and Loquendo (Italy) being the
take given the users input and the system function-
most prominent players in Europe. In 2011, Nuance an-
ality.
nounced the acquisition of Loquendo, which represents
4. Speech synthesis (text-to-speech or TTS) trans- a further step in market consolidation.
forms the systems reply into sounds for the user.
e speech recognition and analysis eld is one of the
One of the major challenges of ASR systems is to ac- less represented in Romania. On the Romanian TTS
curately recognise the words a user utters. is means market, there are solutions commercialised by inter-
restricting the range of possible user utterances to a national companies (like MBROLA or IVONA), but
limited set of keywords, or manually creating language with reduced accuracy and uency. Car equipments
models that cover a large range of natural language ut- and telecommunications companies, such as Continen-
terances. Using machine learning techniques, language tal and Orange, have recently started to allocate re-
models can also be generated automatically from speech sources for specialised departments for speech process-
corpora, i. e., large collections of speech audio les and ing, adapting existing solutions to their specic needs.

63
Speech Output Speech Synthesis Phonetic Lookup &
Intonation Planning
Natural Language
Understanding &
Dialogue
Speech Input Signal Processing Recognition

5: Speech-based dialogue system

On the other side, research in this direction is per- in the 1980s. Yet machine translation (MT) still can-
formed at University of Bucharest and at the Institute not meet its initial promise of across-the-board auto-
for Computer Science within the Romanian Academy, mated translation.
Iasi Branch. Most researchers focus on text to speech e most basic approach to machine translation is the
synthesis, while the speech interpretation area is not so automatic replacement of the words in a text written
well developed yet. in one natural language with the equivalent words of
another language. is can be useful in subject do-
mains that have a very restricted, formulaic language
Speech interaction is the basis for interfaces that such as weather reports. However, in order to produce a
allow a user to interact with spoken language. good translation of less restricted texts, larger text units
(phrases, sentences or even whole passages) need to be
Looking ahead, there will be signicant changes, due to matched to their closest counterparts in the target lan-
the spread of smartphones as a new platform for man- guage.
aging customer relationships, in addition to landline e major diculty is that human language is ambigu-
phones, internet and e-mail. is will also aect the ous. Ambiguity creates challenges on multiple levels,
way in which speech interaction technology is used. In such as word sense disambiguation at the lexical level (a
the long run, there will be fewer telephone-based VUIs, jaguar is a brand of car or an animal) or the prepositional
and spoken language apps will play a far more central phrase attachment at the syntactic level, for example:
role as a user-friendly input for smartphones. is will
(18) Poliistul a vzut omul cu telescopul.
be largely driven by stepwise improvements in the accu-
e policeman saw the man
racy of speaker-independent speech recognition via the
with the telescope.
speech dictation services already oered as centralised
services to smartphone users. (19) Poliistul a vzut omul cu arma.
e policeman saw the man with the gun.
4.2.4 Machine Translation
One way to build an MT system is to use linguistic rules.
e idea of using digital computers to translate natural For translations between closely related languages, a
languages goes back to 1946 and was followed by sub- translation using direct substitution may be feasible in
stantial funding for research during the 1950s and again cases such as the above example. However, rule-based

64
Source Text Text Analysis (Formatting,
Morphology, Syntax, etc.)
Statistical
Machine Translation Rules
Translation

Target Text Text Generation

6: Machine translation (left: statistical; right: rule-based)

(or linguistic knowledge-driven) systems oen analyse tary, so that nowadays researchers focus on hybrid ap-
the input text and create an intermediary symbolic rep- proaches that combine both methodologies. One such
resentation from which the target language text can be approach uses both knowledge-driven and data-driven
generated. systems, together with a selection module that decides
e success of these methods is highly dependent on on the best output for each sentence. However, results
the availability of extensive lexicons with morphologi- for sentences longer than, say, 12 words, will oen be
cal, syntactic, and semantic information, and large sets far from perfect. A more eective solution is to com-
of grammar rules carefully designed by skilled linguists. bine the best parts of each sentence from multiple out-
is is a very long and therefore costly process. puts; this can be fairly complex, as corresponding parts
In the late 1980s when computational power increased of multiple alternatives are not always obvious and need
and became cheaper, interest in statistical models for to be aligned.
machine translation began to grow. Statistical models ere is still a huge potential for improving the qual-
are derived from analysing bilingual text corpora, paral- ity of MT systems. e challenges involve adapting lan-
lel corpora, such as the Europarl parallel corpus, which guage resources to a given subject domain or user area,
contains the proceedings of the European Parliament in and integrating the technology into workows that al-
21 European languages. ready have term bases and translation memories. An-
Given enough data, statistical MT works well enough other problem is that most of the current systems are
to derive an approximate meaning of a foreign language English-centred and only support a few languages from
text by processing parallel versions and nding plausible and into Romanian. is leads to friction in the trans-
patterns of words. Unlike knowledge-driven systems, lation workow and forces MT users to learn dierent
however, statistical (or data-driven) MT systems oen lexicon coding tools for dierent systems.
generate ungrammatical output. Data-driven MT is ad- Evaluation campaigns help to compare the quality of
vantageous because less human eort is required and MT systems, the dierent approaches and the status
it can also cover special particularities of the language of the systems for dierent language pairs. Figure 7
(e. g., idiomatic expressions) that are oen ignored in (p. 29), which was prepared during the Euromatrix+
knowledge-driven systems. project, shows the pair-wise performances obtained for
e strengths and weaknesses of knowledge-driven and 22 of the 23 ocial EU languages (Irish was not com-
data-driven machine translation tend to be complemen- pared). e results are ranked according to a BLEU

65
score, which indicates higher scores for better transla- international projects like STAR and ACCURAT, are
tions [33]. A human translator would normally achieve dedicated to this eld [36, 37].
a score of around 80 points.
e best results (in green and blue) were achieved for
languages that benet from considerable research eort
4.3 OTHER APPLICATION AREAS
in coordinated programmes and the existence of many Building language technology applications involves a
parallel corpora (e. g., English, French, Dutch, Spanish range of subtasks that do not always surface at the level
and German). e languages with poorer results are of interaction with the user, but they provide signicant
shown in red. ese languages either lack such develop- service functionalities behind the scenes of the system
ment eorts or are structurally very dierent from other in question. ey all form important research issues
languages (e. g., Hungarian, Maltese and Finnish). that have now evolved into individual sub-disciplines of
e machine translation eld is among the most attrac- computational linguistics.
tive elds in language technologies in the eyes of indus- uestion answering, for example, is an active area of re-
trials. us, companies such as Language Weaver work search for which annotated corpora have been built and
on translating from/to Romanian using various linguis- scientic competitions have been initiated. e con-
tic techniques. e major online translation systems in- cept of question answering goes beyond keyword-based
clude Romanian as both source and target language, and searches (in which the search engine responds by de-
a multitude of online dictionaries are available for Ro- livering a collection of potentially relevant documents)
manian. and enables users to ask a concrete question to which the
system provides a single answer. For example:

Question: How old was Neil Armstrong when he


At its basic level, Machine Translation simply
substitutes words in one natural language with stepped on the moon?
words in another language. Answer: 38.

While question answering is obviously related to the


Important research eorts were and continue to be core area of web search, it is nowadays an umbrella term
dedicated to Machine Translation with Romanian as for such research issues as which dierent types of ques-
a source or target language by Romanian researchers tions exist, and how they should be handled; how a set
from dierent centres. Good results are reported for of documents that potentially contain the answer can be
an experiment of Statistical Machine Translation for analysed and compared (do they provide conicting an-
EnglishRomanian pair in terms of comparison with swers?); and how specic information (the answer) can
contemporary performance of Google Translate for the be reliably extracted from a document without ignoring
same pair [35]. the context.
Moreover, at RACAI there are already 5 years of experi- uestion answering is in turn related to information ex-
menting in MT with dierent approaches like Example- traction (IE), an area that was extremely popular and in-
Based Machine Translation, Statistical Machine Trans- uential when computational linguistics took a statis-
lation, extracting Machine Translation data from com- tical turn in the early 1990s. IE aims to identify spe-
parable corpora, etc. Two PhD eses, accompanied by cic pieces of information in specic classes of docu-
various papers and supported by dierent national or ments, such as the key players in company takeovers as

66
reported in newspaper stories. Another common sce- which is really incomprehensible if no explanation is
nario that has been studied is reports on terrorist in- provided of who is she or him. One way to increase the
cidents. e task here consists of mapping appropri- coherence of such summaries is to rstly derive the dis-
ate parts of the text to a template that species the per- course structure of the text and to guide the selection
petrator, target, time, location and results of the in- of the sentences to be included into the summary by a
cident. Domain-specic template-lling is the central score that considers both the relevance of the sentence in
characteristic of IE, which makes it another example a discourse tree and the coherence of the text, as given by
of a behind the scenes technology that forms a well- solving anaphoric references [38]. For the summary ex-
demarcated research area, which in practice needs to be ample above, solving anaphoric references means iden-
embedded into a suitable application environment. tifying she as Hera and him as Hercules. us, the
Text summarisation and text generation are two bor- provided summary becomes readable:
derline areas that can act either as standalone applica-
Hera sent a two-headed serpent to attack Hercules.
tions or play a supporting role. Summarisation attempts
to give the essentials of a long text in a short form, and
e UAIC summariser adopted this method, yielding
is one of the features available in Microso Word. It
good summaries for relatively short initial texts [39].
mostly uses a statistical approach to identify the im-
is direction is further developed at UAIC by intro-
portant words in a text (i. e., words that occur very fre-
ducing semantic information in the automatic process
quently in the text in question but less frequently in gen-
of summary building [40].
eral language use) and determine which sentences con-
An alternative approach, for which some research has
tain the most of these important words. ese sen-
been carried out, is to generate brand new sentences that
tences are then extracted and put together to create the
do not exist in the source text.
summary. In this very common commercial scenario,
summarisation is simply a form of sentence extraction,
and the text is reduced to a subset of its sentences. Language technology applications often provide
A drawback of this approach is that it ignores the refer- signicant service functionalities behind the
scenes of larger software systems.
ential expressions that could occur in the initial text and
be kept in the summary. us, due to sentence elimina-
tion, their antecedents may not be present anymore, re- is requires deeper understanding of the text, which
sulting in incomprehensive reading. For example, con- means that so far this approach is far less robust. is
sider the following text to be summarised: method can also be applied in the case of very large texts,
such as a whole novel, where neither the determination
Hercules, of all of Zeuss illegitimate children seemed of most signicant sentences based on occurrences of
to be the focus of Heras anger. She sent a two-headed frequent words, nor building discourse structures could
serpent to attack him when he was just an infant. be of help. In these cases, other methods, mainly ex-
panding a collection of predened exible summary
e summary of this very short fragment, using the sen- patterns (based for instance on the genre of the novel,
tence elimination method, could be: or on some data on the main characters of the novel, a
time and place positioning, and a rather shallow sketch
She sent a two-headed serpent to attack him. of the initiation of the action) could be applied.

67
On the whole, a text generator is rarely used as a stand- form, allowing complex searches, but also a much more
alone application but is embedded into a larger soware facile editing and continuous updating activity [42].
environment, such as a clinical information system that More useful access to the lexicographic material of a lan-
collects, stores and processes patient data. Creating re- guage is facilitated by semantic networks in the form
ports is just one of many applications for text summari- of wordnets. e Romanian WordNet has been under-
sation. going development for eight years and has more than
57,000 synsets in which almost 60,000 literals occur.
ey are distributed in four parts of speech: nouns,
For Romanian, research in most text technologies
is much less developed than for English. verbs, adjectives and adverbs. Each synset contains a
set of words (with associated sense numbers) that are
synonyms. e synsets are the nodes of the network,
For the Romanian language, research in these text tech-
while its arcs are the semantic relations between synsets:
nologies is much less developed than for the gEnglish
hyponymy (the is-a relation), meronymy, entailment,
language. uestion answering, information extraction,
cause, and others. e Romanian WordNet is aligned
and summarisation have been the focus of numerous
to the Princeton WordNet [43], the oldest and largest
open competitions in the USA since the 1990s, pri-
wordnet. e synsets have DOMAINS labels: each
marily organised by the government-sponsored organ-
synset is labelled with the name of the domain in which
isations DARPA and NIST. ese competitions have
it is used. Moreover, Romanian WordNet is aligned
signicantly improved the start-of-the-art, but their fo-
to the largest freely available ontology, SUMO&MILO
cus has mostly been on the English language. How-
[44]. It is also used in various applications developed for
ever, Romanian teams from UAIC and RACAI have
Romanian: uestion Answering, Word Sense Disam-
participated aer 2006 at question answering competi-
biguation, Machine Translation.
tions with good results [41]. e main remaining draw-
back is the small size of annotated corpora or other re- An application developed at the Human Language En-
sources for these tasks. Summarisation systems, when gineering Laboratory is an experiment of a database for
using purely statistical methods, are oen to a good word associations for Romanian vocabulary [45]. One
extent language-independent, and thus prototypes are essential tasks for cognitive scientists is to map out the
available also for Romanian. At UAIC, a summarisation rich networks of associations that exist between words.
tool based on discourse structure and anaphora resolu- Such a network is of a great importance for several elds
tion, developed for Romanian texts, is available. such as natural language processing, computational lin-
Adjacent domains recently attacked by Romanian re- guistics, lexicography and others.
search teams include computational lexicology, e- A dierent domain in which UAIC researchers have
learning, and sentiment/opinion analysis. been involved is the e-learning domain, by incorporat-
A consortium of ve research institutes and one univer- ing multilingual language technology tools and seman-
sity (UAIC) has recently been involved in transforming tic web techniques for improving the retrieval of learn-
the esaurus Dictionary of the Romanian Language ing material. e developed technology facilitates per-
(about 35 volumes, from 1913 onwards) in electronic sonalised access to knowledge within learning manage-
form. e main objective was to transform the approx. ment systems and support cooperation in content man-
13.000 pages of the Dictionary in structured electronic agement.

68
e newest domain of interest in the natural language initiated as part of the Faculty of Computer Science
processing eld is sentiment/opinion analysis. us, at the Alexandru Ioan Cuza University of Iai. Still,
having a text, the soware identies if the text has a pos- a consolidated higher education system in natural lan-
itive or negative emotional load. Research in this di- guage processing and computational linguistics is yet to
rection, for the Romanian language, started at RACAI be congured.
with the use of SentiWordNet, a sentiment annotation
e most representative centres in computational
of the WordNet [46]. At UAIC, research in this di-
linguistics dealing with Romanian language are in
rection involved collaboration with a private organisa-
Bucharest, Iai, Cluj, Timioara and Craiova, in Ro-
tion, Intelligentics, in order to develop a system able
mania, and Chiinev in the Republic of Moldova.
to monitor the Web and extract users opinion (forum,
Among the multitude of universities and research cen-
blogs, social networks, etc.) about dierent products
tres where teams work in this domain, we can mention
[47]. At the Human Language Engineering Labora-
the Romanian Academy Research Institute for Arti-
tory in the Republic of Moldova, the work on senti-
cial Intelligence in Bucharest; the Romanian Academy
ment analysis lead to the translation of WordNet-Aect
Institute for Computer Science in Iai; the Depart-
[48], that contains information about the emotions that
ment of Computer Science at the Alexandru Ioan
words convey, into Romanian and Russian. WordNet-
Cuza University of Iai; the Faculty of Mathematics-
Aect has been developed starting from the WordNet
Informatics of the Babe-Bolyai University of Cluj-
lexical knowledge base. Aective labels were manually
Napoca; the Institute of Mathematics and Computer
assigned to WordNet synsets for nouns, adjectives, verbs
Science, Academy of Sciences of the Republic of
and adverbs which convey aective meaning. Words la-
Moldova; the Human Language Engineering Labora-
belled with the aective tag were further divided into
tory within the Applied Informatics Department, Fac-
six emotional categories: joy, fear, anger, sadness, dis-
ulty of Computers, Informatics and Microelectronics at
gust and surprise. WordNet-Aect is freely available for
the Technical University of the Republic of Moldova,
research purposes [50].
etc. Some of these centres work in common national
and international projects in the LT domain.

4.4 EDUCATIONAL e common meeting points of most researchers in


the LT domain are, besides international conferences
PROGRAMMES abroad, a series of international events that intend to
Language technology is a very interdisciplinary eld bring together young students and mature professionals,
that involves the combined expertise of linguists, com- linguists and computer scientists, which are held period-
puter scientists, mathematicians, philosophers, psy- ically in Romania: the ConsILR events Consortium
cholinguists, and neuroscientists among others. As a re- for the Digitalisation of Romanian Language [51], the
sult, it has not acquired a clear, independent existence in EUROLAN series of international summer schools, the
the Romanian faculty system. Many universities in Ro- SPED conferences Speech Technology and Human-
mania and in the Republic of Moldova recently intro- Computer Dialogue, the KEPT conferences Knowl-
duced natural language processing and computational edge Engineering: Principles and Techniques, ECIT
linguistics courses at bachelor, master and PhD level. the European Conferences on Intelligent Systems and
Since 2001, a master in computational linguistics was Technologies, etc.

69
Computational linguistics is an exotic topic and is ei- nian Language an online repository of recorded Ro-
ther located in the computer science faculties or in the manian voices. As for research programs, UAIC and
humanities, focusing therefore either on the linguistic RACAI have been involved in several national or in-
aspects, or on the engineering ones, the research topics ternational research programs, intended to develop ex-
only partially overlapping. Another major drawback of isting or new language technologies. Among these,
this landscape is the minor involvement of ICT compa- some European funded projects are worth mention-
nies in LT research (although they have recently begun ing: ACCURAT-RO (Analysis and evaluation of Com-
to be more present in the educational life). parable Corpora for Under Resourced Areas of ma-
chine Translation), See-ERANET (Machine Transla-
tion Systems for Balkan Languages), the FP7 project
4.5 NATIONAL PROJECTS AND CLARIN (Interoperable Linguistic Resources Infras-
tructure for Romanian), BALKANET (which built a
INITIATIVES
network of aligned wordnets for Balkan languages),
e industries using and providing LT in Romania are the FP6 project LT4eL (Language Technology for e-
certainly important and vital (BitDefender, Continen- Learning), the INTAS project RoLTech (Platform For
tal, Nokia, etc.), but a better cooperation between them Romanian Language Technology: Resources, Tools and
and the research institutes and universities is necessary, Interfaces), Roric-Ling, ALEAR project (Articial Lan-
as the latter are the most actively involved in research in guage Evolution for Autonomous Robots), the PSP-
this domain. An important issue is the ezoteric charac- ICT projects METANET4U (Enhancing Multilingual
ter of LT, which could be solved through a good mar- European Infrastructure) and ATLAS (Applied Tech-
keting strategy. Language industry is not a signicant nologies for Content Management Systems Using Nat-
employer in Romanian, rather few companies working ural Language), etc. Some nationally funded projects
in the Information Communication Technology (ICT) also existed, such as: STAR (A System for Machine
domain having already developed LT departments. Translation for Romanian), SIR-RESDEC (Open Do-
Previous national programs have led to an initial im- main uestion Answering System for Romanian and
pulse, but subsequent nancial aid missing or not at- English), ROTEL (intelligent systems for the Seman-
tractive enough has lead to a loss of interest from major tic Web, based on the logic of ontologies and NLP),
ICT players and young researchers, formed by universi- eDTLR (e Romanian esaurus Dictionary in elec-
ties and the Academy. One of the programs of collabo- tronic form), among others.
ration between industry and education that has a good
impact and results in Romania is the MSDN Academic e market for language technologies can only be esti-
Alliance, oering students free access to dierent Mi- mated and will most probably get a boost by mobile ap-
croso technologies. pliances, the Apple iPad and similar products, (educa-
e main research laboratories conducting activities in tional) games, etc.
LT in Romania are RACAI in the Romanian Academy,
Bucharest; the Department of Computer Science of the As we have seen, previous programmes have led to the
Alexandru Ioan Cuza University in Iasi, and the Insti- development of a number of LT tools and resources for
tute of Computer Science of the Romanian Academy, the Romanian language. e following section sum-
also in Iasi, which hosts the Voiced Sounds of Roma- marises the current state of LT support for Romanian.

70
4.6 AVAILABILITY OF TOOLS e LT tools for Romanian cover wide domains
for the sentence semantics and information retrieval
AND RESOURCES elds, while being relatively domain-restricted for
Figure 7 provides a rating for language technology sup- the other tasks.
port for the Romanian language. is rating of existing
From the existing LT tools for Romanian, the ma-
tools and resources was generated by leading experts in
ture ones are freely available.
the eld who provided estimates based on a scale from 0
(very low) to 6 (very high) using seven criteria. If the dierent tools are not necessarily further main-
tained, the few resources for Romanian have good
e key results for Romanian language technology can
quality and are mostly sustainable.
be summed up as follows:
Since most tools are based on language models or
Even if, in general, all LT elds are covered, there are machine learning techniques, their adaptability is
three elds that are not yet considered for the Roma- generally good, which is not the case for language re-
nian language by researchers: language generation, sources.
dialogue management systems, and multimodal cor- Many of these tools, resources and data formats do
pora building. not meet industry standards and cannot be sustained
Although dierent parsing technologies are avail- eectively. A concerted programme is required to
able for the Romanian language, a reference Tree- standardise data formats and APIs.
bank corpus, to be used as benchmark when testing
e scores dierent experts gave to the same LT eld
automated parses, is yet unsatisfactory.
were usually relatively similar, mostly on availabil-
Speech processing is currently much less mature than ity, which suggest that the existing instruments and
LT for written text, both in terms of corpora and in- resources for Romanian are widely disseminated.
struments. Sometimes however, concerning sustainability and
If relatively signicant work can be seen in NLP coverage, the expert gave scores that dier by more
elds such as tokenisation, sentence semantics or than half the total score. e main areas of disagree-
question answering systems, LT elds dealing with ment were: reference corpora, semantics corpora,
more complex phenomena, such as deep syntactic grammars, and ontological resources.
analysis or advanced discourse processing still need e raw containing information about language
more attention. models may be slightly debatable, since some experts
Resources for the Romanian language are less repre- gave scores considering the written language models,
sented than instruments, although they are essential while others considered models for Romanian spo-
for testing the designed tools. ken language and gave low scores.
With some exception, as the Web services for basic A legally unclear situation restricts the usage of dig-
language processing, morphological analysis, ques- ital texts, such as those published online by news-
tion answering tools and machine translation sys- papers, for empirical linguistics and language tech-
tems, the existing tools for the Romanian language nology research, for example, to train statistical lan-
are not completely freely available, nor out of the box guage models. Together with politicians and pol-
systems. icy makers, researchers should try to establish laws

71
Sustainability

Adaptability
Availability

Coverage
uantity

Maturity
uality
Language Technology: Tools, Technologies and Applications
Speech Recognition 2 1 1.8 1.4 2 2 2
Speech Synthesis 1 1 1.2 1.4 2 2 1
Grammatical analysis 4 3.5 4 3.6 4.5 3.5 4
Semantic analysis 3.3 3 3 3 3.6 4 4
Text generation 0 0 0 0 0 0 0
Machine translation 3 4 3.2 2.4 4 4 4
Language Resources: Resources, Data and Knowledge Bases

Text corpora 2 2 2.4 2.4 3 2.5 3


Speech corpora 3 2 2.4 1.2 3 3 3
Parallel corpora 4 5 3.2 2.4 5 5 4
Lexical resources 4 3 3.6 3.2 5 4.5 4
Grammars 2 2 2.4 1.6 2 3 3

7: State of language technology support for Romanian

or regulations that enable researchers to use publicly and one underlying technology (text analysis), as well
available texts for language-related R&D activities. as basic resources needed for building LT applications.
e languages were categorised using the following ve-
In a number of specic areas of Romanian language
point scale:
research, we have soware with limited functionality
available today. Obviously, further research eorts are 1. Excellent support
required to meet the current decit in processing texts 2. Good support
on a deeper semantic level and to address the lack of re-
3. Moderate support
sources such as parallel corpora for machine translation.
4. Fragmentary support
5. Weak or no support
4.7 CROSS-LANGUAGE
Language Technology support was measured according
COMPARISON to the following criteria:
e current state of LT support varies considerably from Speech Processing: uality of existing speech recog-
one language community to another. In order to com- nition technologies, quality of existing speech synthesis
pare the situation between languages, this section will technologies, coverage of domains, number and size of
present an evaluation based on two sample applica- existing speech corpora, amount and variety of available
tion areas (machine translation and speech processing) speech-based applications.

72
Machine Translation: uality of existing MT tech- up new opportunities for tackling a broader range of
nologies, number of language pairs covered, coverage of advanced application areas, including high-quality ma-
linguistic phenomena and domains, quality and size of chine translation.
existing parallel corpora, amount and variety of available
MT applications.
4.8 CONCLUSIONS
Text Analysis: uality and coverage of existing text
analysis technologies (morphology, syntax, semantics), In this series of white papers, we have made an impor-
coverage of linguistic phenomena and domains, amount tant eort by assessing the language technology support
and variety of available applications, quality and size of for 30 European languages, and by providing a high-
existing (annotated) text corpora, quality and coverage leel comparison across these languages. By identifying
of existing lexical resources (e. g., WordNet) and gram- the gaps, needs and decits, the European language tech-
mars. nology community and its related stakeholders are now
in a position to design a large scale research and develop-
Resources: uality and size of existing text corpora,
ment programme aimed at building a truly multilingual,
speech corpora and parallel corpora, quality and cover-
technology-enabled communication across Europe.
age of existing lexical resources and grammars.
e results of this white paper series show that there is a
Figures 8 to 11 show that LT resources and tools for Ro-
dramatic dierence in language technology support be-
manian have started to be developed, but do not reach
tween the various European languages. While there are
the quality and coverage of comparable resources and
good quality soware and resources available for some
tools for the English language, which is in the lead in
languages and application areas, others, usually smaller
almost all LT areas. And there are still plenty of gaps in
languages, have substantial gaps. Many languages lack
English language resources with regard to high quality
basic technologies for text analysis and the essential re-
applications.
sources. Others have basic tools and resources but the
For speech processing, although at international level implementation of for example semantic methods is still
current technologies perform well enough to be success- far away. erefore a large-scale eort is needed to at-
fully integrated into a number of industrial applications tain the ambitious goal of providing high-quality lan-
such as spoken dialogue and dictation systems, the Ro- guage technology support for all European languages,
manian language lacks a good representation in this do- for example through high quality machine translation.
main. However, text analysis components and language In the case of the Romanian language, we can be
resources already cover the linguistic phenomena of Ro- cautiously optimistic about the current state of lan-
manian to a certain extent and form part of many appli- guage technology support. Research in universities and
cations involving mostly shallow natural language pro- academia from Romania and the Republic of Moldova
cessing, e. g. spelling correction and authoring support. was successful in designing particular high quality so-
For building more sophisticated applications, such as ware, as well as models and theories widely applicable.
machine translation, there is a clear need for resources However, the scope of the resources and the range of
and technologies that cover a wider range of linguistic tools are still very limited when compared to English,
aspects and enable a deep semantic analysis of the input and they are simply not sucient in quality and quan-
text. By improving the quality and coverage of these ba- tity to develop the kind of technologies required to sup-
sic resources and technologies, we shall be able to open port a truly multilingual knowledge society.

73
However, it is nearly impossible to come up with sus- align Romanian to the standards of other European lan-
tainable and standardised solutions given the current guages.
relatively low level of linguistic resources. ere is a Our ndings lead to the conclusion that the only way
tremendous need for linguistic resources, from raw texts forward is to make a substantial eort to create language
on Romanian to heavily annotated data, where partic- technology resources for Romanian, as a means to push
ular linguistic phenomena are highlighted by markings forward research, innovation and development. e
contributed by experts. Since the best known source need for large amounts of data and the extreme com-
of raw texts are electronic copies of printed publica- plexity of language technology systems makes it vital to
tions, an awareness campaign addressing the publishing develop an infrastructure and a coherent research organ-
houses, in order to persuade them to donate part of their isation to spur greater sharing and cooperation.
textual productions for research purposes, is very much Finally, there is a lack of continuity in research and
necessary [52]. development funding. Short-term coordinated pro-
Technologies already developed and optimised for the grammes tend to alternate with periods of sparse or zero
English language cannot be simply transferred to handle funding.
Romanian. English-based systems for parsing (syntac- e long term goal of META-NET is to enable the cre-
tic and grammatical analysis of sentence structure) typi- ation of high-quality language technology for all lan-
cally perform far less well on Romanian texts, due to the guages. is requires all stakeholders in politics, re-
specic characteristics of the Romanian language. search, business, and society to unite their eorts.
Language generation and dialogue management sys- e resulting technology will help tear down existing
tems are LT elds where much research is still needed barriers and build bridges between Europes languages,
for the Romanian language. Speech technologies and paving the way for political and economic unity through
corpora should also be closely considered in order to cultural diversity.

74
Excellent Good Moderate Fragmentary Weak/no
support support support support support

English Czech Basque Croatian


Dutch Bulgarian Icelandic
Finnish Catalan Latvian
French Danish Lithuanian
German Estonian Maltese
Italian Galician Romanian
Portuguese Greek
Spanish Hungarian
Irish
Norwegian
Polish
Serbian
Slovak
Slovene
Swedish

8: Speech processing: state of language technology support for 30 European languages

Excellent Good Moderate Fragmentary Weak/no


support support support support support

English French Catalan Basque


Spanish Dutch Bulgarian
German Croatian
Hungarian Czech
Italian Danish
Polish Estonian
Romanian Finnish
Galician
Greek
Icelandic
Irish
Latvian
Lithuanian
Maltese
Norwegian
Portuguese
Serbian
Slovak
Slovene
Swedish

9: Machine translation: state of language technology support for 30 European languages

75
Excellent Good Moderate Fragmentary Weak/no
support support support support support

English Dutch Basque Croatian


French Bulgarian Estonian
German Catalan Icelandic
Italian Czech Irish
Spanish Danish Latvian
Finnish Lithuanian
Galician Maltese
Greek Serbian
Hungarian
Norwegian
Polish
Portuguese
Romanian
Slovak
Slovene
Swedish

10: Text analysis: state of language technology support for 30 European languages

Excellent Good Moderate Fragmentary Weak/no


support support support support support

English Czech Basque Icelandic


Dutch Bulgarian Irish
French Catalan Latvian
German Croatian Lithuanian
Hungarian Danish Maltese
Italian Estonian
Polish Finnish
Spanish Galician
Swedish Greek
Norwegian
Portuguese
Romanian
Serbian
Slovak
Slovene

11: Speech and text resources: state of support for 30 European languages

76
5

ABOUT META-NET

META-NET is a Network of Excellence partially e main focus of this activity is to build a coherent
funded by the European Commission. e network and cohesive LT community in Europe by bringing to-
currently consists of 54 research centres in 33 European gether representatives from highly fragmented and di-
countries [53]. META-NET forges META, the Multi- verse groups of stakeholders. e present White Paper
lingual Europe Technology Alliance, a growing commu- was prepared together with volumes for 29 other lan-
nity of language technology professionals and organisa- guages. e shared technology vision was developed in
tions in Europe. META-NET fosters the technological three sectorial Vision Groups. e META Technology
foundations for a truly multilingual European informa- Council was established in order to discuss and to pre-
tion society that: pare the SRA based on the vision in close interaction
with the entire LT community.
makes communication and cooperation possible
META-SHARE creates an open, distributed facility
across languages;
for exchanging and sharing resources. e peer-to-
grants all Europeans equal access to information and
peer network of repositories will contain language data,
knowledge regardless of their language;
tools and web services that are documented with high-
builds upon and advances functionalities of net-
quality metadata and organised in standardised cate-
worked information technology.
gories. e resources can be readily accessed and uni-
e network supports a Europe that unites as a sin- formly searched. e available resources include free,
gle digital market and information space. It stimulates open source materials as well as restricted, commercially
and promotes multilingual technologies for all Euro- available, fee-based items.
pean languages. ese technologies support automatic META-RESEARCH builds bridges to related techno-
translation, content production, information process- logy elds. is activity seeks to leverage advances in
ing and knowledge management for a wide variety of other elds and to capitalise on innovative research that
subject domains and applications. ey also enable in- can benet language technology. In particular, the ac-
tuitive language-based interfaces to technology rang- tion line focuses on conducting leading-edge research in
ing from household electronics, machinery and vehi- machine translation, collecting data, preparing data sets
cles to computers and robots. Launched on 1 February and organising language resources for evaluation pur-
2010, META-NET has already conducted various activ- poses; compiling inventories of tools and methods; and
ities in its three lines of action META-VISION, META- organising workshops and training events for members
SHARE and META-RESEARCH. of the community.
META-VISION fosters a dynamic and inuential
stakeholder community that unites around a shared vi-
sion and a common strategic research agenda (SRA). oce@meta-net.eu http://www.meta-net.eu

77
A

REFERINE REFERENCES
BIBLIOGRAFICE

[1] Aljoscha Burchardt, Markus Egg, Kathrin Eichler, Brigitte Krenn, Jrn Kreutel, Annette Lemllmann,
Georg Rehm, Manfred Stede, Hans Uszkoreit, and Martin Volk. Die Deutsche Sprache im Digitalen Zeital-
ter e German Language in the Digital Age. META-NET White Paper Series. Georg Rehm and Hans
Uszkoreit (Series Editors). Springer, 2012.

[2] Aljoscha Burchardt, Georg Rehm, and Felix Sasaki. e Future European Multilingual Information Society
Vision Paper for a Strategic Research Agenda (Societatea informaional european multilingv a viitorului
Dezvoltarea unei agende strategice de cercetare), 2011.
http://www.meta-net.eu/vision/reports/meta-net-vision-paper.pdf.

[3] Directorate-General Information Society & Media of the European Commission (Directoratul general pen-
tru Societatea Informaional i Media al Comisiei Europene). User Language Preferences Online (Prefe-
rinele lingvistice online ale utilizatorilor), 2011. http://ec.europa.eu/public_opinion/flash/fl_313_en.pdf.

[4] European Commission (Comisia European). Multilingualism: an Asset for Europe and a Shared Commit-
ment (Multilingvism: un avantaj pentru Europa i un angajament comun), 2008.
http://ec.europa.eu/languages/pdf/comm2008_en.pdf.

[5] Directorate-General of the UNESCO (Directoratul General UNESCO). Intersectoral Mid-term Strategy
on Languages and Multilingualism (Strategie intersectorial pe termen mediu privind limbile i multilingvis-
mul), 2007. http://unesdoc.unesco.org/images/0015/001503/150335e.pdf.

[6] Directorate-General for Translation of the European Commission (Directoratul General pentru Traduceri al
Comisiei Europene). Size of the Language Industry in the EU (Dimensiunile industriei limbajului in UE),
2009. http://ec.europa.eu/dgs/translation/publications/studies.

[7] Ioana Vintil-Rdulescu. Limba romn din perspective integrrii n Uniunea European (Romanian lan-
guage from the perspective of its integration in the European Union).
http://www.unibuc.ro/ro/limba_romn_din_perspectiva_integrrii_europene.

[8] Institutul Naional de Statistic (National Institute of Statistics). Anuar statistic 2009 (Statistical Yearbook
2009), 2009. http://www.insse.ro/cms/files/Anuarstatistic/02/02Populatie_en.pdf.

79
[9] Biroul Naional de Statistic al Republicii Moldova (Bureau of Statistics of the Republic of Moldova). Baz
de date statistic (Statistical database), 2011. http://statbank.statistica.md.

[10] Wikipedia. Romanian Diaspora (Diaspora romneasc), 2011.


http://en.wikipedia.org/wiki/Romanian_diaspora.

[11] Marius Sala editor. Enciclopedia limbii romne (Encyclopaedia of the Romanian Language), 2006. ediia a
2-a (2nd Edition).

[12] European Federation of National Institutions for Language (Federaia European a Institutelor Naionale de
Limb). Legal framework (Cadru legislativ), 2007.
http://www.efnil.org/documents/language-legislation-version-2007/romania.

[13] Grigore Brncu. Vocabularul autohton al limbii romne (e Autochthone Vocabulary of the Romanian
Language). Editura tiinic i Tehnic (Scientic and Technical Publishing House), 1983.

[14] Institutul Limbii Romne (Institute for the Romanian Language). Lectorate de limba romn (Romanian
language programs abroad). http://www.ilr.ro/plr.php?lmb=1.

[15] Miniwatts Marketing Group. Romania - Internet Usage Stats and Market Report (Statistici privind folosirea
internetului i raport al pieei - Romnia). http://www.internetworldstats.com/eu/ro.htm.

[16] Miniwatts Marketing Group. Internet Usage in the European Union (Folosirea internetului n Uniunea Eu-
ropean). http://www.internetworldstats.com/stats9.htm.

[17] Uniunea Latin (Latin Union). Limbile i culturile pe internet (Languages and cultures over the Internet).
http://dtil.unilat.org/LI/2007/index_ro.htm.

[18] Kai-Uwe Carstensen, Christian Ebert, Cornelia Ebert, Susanne Jekat, Hagen Langer, and Ralf Klabunde,
editors. Computerlinguistik und Sprachtechnologie: Eine Einfhrung (Lingistic computational i tehnologia
limbajului: o introducere). Spektrum Akademischer Verlag, 2009.

[19] Daniel Jurafsky and James H. Martin. Speech and Language Processing (Prelucrarea limbajului i a orbirii).
Prentice Hall, 2nd edition, 2009.

[20] Christopher D. Manning and Hinrich Schtze. Foundations of Statistical Natural Language Processing (Fun-
damentele procesrii statistice a limbajului natural). MIT Press, 1999.

[21] Language Technology World (LT World). http://www.lt-world.org/.

[22] Ronald Cole, Joseph Mariani, Hans Uszkoreit, Giovanni Battista Varile, Annie Zaenen, and Antonio Zam-
polli, editors. Survey of the State of the Art in Human Language Technology (Privire de ansamblu asupra
tehnologiei actuale a limbajului natural). Cambridge University Press, 1998.

80
[23] Dan Tu and Alexandru Ceauu. DIAC+: A Professional Diacritics Recovering System (diac+: Un sistem
profesional de recuperare a diacriticelor). In Proceedings of Language Resources and Evaluation Conference
LREC 2008, 2008.

[24] Institutul de Matematic i Informatic, Academia de tiine a Moldovei, Chiinu (Institute of Mathematics,
Computer Science, Academy of Sciences of Republic of Moldova, and Chisinau). Resurse refolosibile pentru
tehnologia limbajului romnesc (Reusable Resources for Romanian Language Technology).
http://www.math.md/elrr/.

[25] Spiegel Online. Google zieht weiter davon (Google nc i surclaseaz concurena), 2009.
http://www.spiegel.de/netzwelt/web/0,1518,619398,00.html.

[26] Juan Carlos Perez. Google Rolls out Semantic Search Capabilities (Google ncepe s foloseasc informaii
semantice pentru cutare), 2009. http://www.pcworld.com/businesscenter/article/161869/google_rolls_
out_semantic_search_capabilities.html.

[27] Dan Tu, Ion Radu, Luigi Bozianu, Alexandru Ceauu, and Dan tefnescu. Romanian Wordnet: Cur-
rent State, New Applications and Prospects (Wordnet-ul romnesc: stadiu actual, aplicaii i perspective). In
Proceedings of 4th Global WordNet Conference, GWC-2008, pages 441452, 2008.

[28] Research Institute for Articial Intelligence (Institutul de Cercetri pentru Inteligena Articial). Xml Web
Services (Servicii Web XML). www.racai.ro/WebServices.

[29] Dan Tu, Ion Radu, Alexandru Ceauu, and Dan tefnescu. RACAIs Linguistic Web Services (Serviciile
web lingvistice ale RACAI). In Proceedings of Language Resources and Evaluation Conference - LREC 2008,
2008.

[30] Diana Trandab. Mining Romanian Texts for semantic knowledge (Identicarea informaiilor semantice n
textele romneti). In Proceedings of ISDA 2011, Cordoba, Spain, 2011.

[31] Diana Trandab. Towards automatic cross-lingual transfer of semantic annotation (Transferul automat a
adnotrii semantice de la o limb la alta). In 6e Rencontres Jeunes Chercheurs en Recherche dInformation
RJCRI-CORIA 2011, 2011.

[32] Adrian Iene, Loredana Vamanu, and Cosmina Croitoru. UAIC at ImageCLEF 2009 Photo Annotation
Task (Participarea UAIC la adnotarea de imagini din cadrul ImageCLEF2009). In C. Peters et al. (Eds.):
CLEF 2009, LNCS 6242, Part II (Multilingual Information Access Evaluation Vol. II Multimedia Experi-
ments), pages 283286, 2010.

[33] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A Method for Automatic Evaluation
of Machine Translation (BLEU: O metod pentru evaluarea automat a traducerii automate). In Proceedings
of the 40th Annual Meeting of ACL, Philadelphia, PA, 2002.

81
[34] Philipp Koehn, Alexandra Birch, and Ralf Steinberger. 462 Machine Translation Systems for Europe (462 de
sisteme de traducere automat pentru Europa). In Proceedings of MT Summit XII, 2009.

[35] Daniel Marcu and Drago S. Munteanu. Statistical Machine Translation: An English-Romanian Experiment
(Traducere automat statistic: un experiment englez-romn), 2005. Invited talk.

[36] Dan Tu and Alexandru Ceauu. Factored Phrase-Based Statistical Machine Translation (Traducere au-
tomat statistic bazat pe grupuri factoriale). In Proceedings of the 5th Conference Speech Technology and
Human-Computer Dialogue SpeD 2009, 2009.

[37] Elena Irimia. EBMT experiments for the English-Romanian Language Pair (Experimente de EBMT pentru
perechea de limbi englez-romn). In International Joint Conference Intelligent Information Systems (IIS
2009), Krakw, Poland, 2009.

[38] Dan Cristea and Adrian Iene. If you want your talk be uent, think lazy! Grounding coherence properties
of discourse (Dac vrei sa vorbeti uent, gndete lene! Ancorarea proprietilor anaforei n discurs), 2011.
Invited Talk.

[39] Dan Cristea, Oana Postolache, and Ionu Pistol. Summarisation through Discourse Structure (Rezumare
folosind structura de discurs). In Computational Linguistics and Intelligent Text Processing, Proceedings of
CICLing 2005, LNSC, ol. 3406, pages 632644, 2005.

[40] Diana Trandab. Using semantic roles to improve summaries (Folosirea rolurilor semantice pentru m-
buntirea rezumatelor). In Proceedings of the 13th European Workshop on Natural Language Generation
ENLG2011, pages 164169, 2011.

[41] Adrian Iene, Diana Trandab, Alex Moruz, Ionu Pistol, Maria Husarciuc, and Dan Cristea. uestion
Answering on English and Romanian Languages (Sisteme de ntrebare-Rspuns pentru limbile englez i
romn). In Peters et al. (Eds.): CLEF 2009, LNCS 6241, Part I., pages 229236, 2010.

[42] Dan Cristea. Steps towards an electronic version of the esaurus Dictionary of the Romanian language
(Spre o versiune electronic a Dicionarului Tezaur al Limbii Romne). In Proceedings of the IVth National
Conference e Academic Days of the Academy of Technical Science of Romania, Agir Publishing House, 2009.

[43] Princeton University (Universitatea din Princeton). WordNet, a lexical database for English (WordNet, o
baz de date lexical pentru limba englez). http://wordnet.princeton.edu/.

[44] Institutul de Cercetri pentru Inteligena Articial (Research Institute for Articial Intelligence). WordNet-
ul romnesc (Romanian WordNet browser). http://www.racai.ro/wnbrowser/.

[45] Laboratorul de Inginerie a Limbajului Uman, Departamentul de Informatic Aplicat, Facultatea de Informa-
tic, Calculatoare i Microelectronic, Universitatea Tehnic a Republicii Moldova (Human Language Engi-
neering Laboratory, Applied Informatics Department, Computers, Informatics, Microelectronics Faculty,
and Technical University of Moldova). Dicionarul semantic bazat pe asociaii (Semantic dictionary based
on associations). http://lilu.fcim.utm.md/asociere.

82
[46] Dan Tu and Dan tefnescu. Experiments with a Dierential Semantics Annotation for Wordnet 3.0 (Ex-
perimente cu adnotare semantic diferenial pentru Wordnet 3.0). In Proceedings of the 2nd Workshop on
Computational Approaches to Subjectivity and Sentiment Analysis (ACL-WASSA2011), pages 1927, Port-
land, Oregon, USA, 2011.

[47] Alex Lucian Gnsc, Emanuela Boro, Adrian Iene, Diana Trandab, Mihai Toader, Marius Corci, Augusto
Perez, and Dan Cristea. Sentimatrix Multilingual Sentiment Analysis Service (Sentimatrix un serviciu
multilingv de analiz a sentimentelor). In Proceedings of the 2nd Workshop on Computational Approaches to
Subjectivity and Sentiment Analysis (ACL-WASSA2011), Portland, Oregon, USA, 2011.

[48] Marina Sokolova and Victoria Bobicev. Classication of Emotion Words in Russian and Romanian Languages
(Clasicarea cuvintelor emoionale n limbile rus i romn). In Proceedings of RANLP-2009, Borovets,
Bulgaria, 2009.

[49] Carlo Strapparava and Alessandro Valitutti. WordNet-Aect: an Aective Extension of WordNet (WordNet-
Aect: o extensie afectiv a Wordnet-ului). In Proceedings of the 4th International Conference on Language
Resources and Evaluation (LREC 2004), pages 10831086, Lisbon, Portugal, 2004.

[50] Fondazioe Bruno Kessler HLT Research Unit (Fundaia Bruno Kessler Unitatea de Cercetare n TLU). Word-
net - aect. http://wndomains.fbk.eu/wnaffect.html.

[51] Seria de Ateliere de lucru Instrumente i resurse lingvistice pentru limba romn (Workshop series on In-
struments and Tools for the Romanian Language Processing). Editura Universitii A.I. Cuza Iasi.

[52] Dan Cristea. Resurse lingvistice n ux continuu (Linguistic resources in a continuous ux in Romanian).
In Lucrrile Atelierului de lucru Instrumente i resurse lingistice pentru limba romn 2010 (Proceedings
of the Workshop Instruments and Tools for the Romanian Language Processing 2010), Bucharest, Romania,
2010.

[53] Georg Rehm and Hans Uszkoreit. Multilingual Europe: A challenge for language tech (o europ multilingv:
O provocare pentru tehnologiile limbajului). MultiLingual, 22(3):5152, April/May 2011.

[54] Jerrold H. Zar. Candidate for a Pullet Surprise. Journal of Irreproducible Results, page 13, 1994.

83
B

MEMBRII META-NET META-NET MEMBERS


Austria Austria Zentrum fr Translationswissenscha, Universitt Wien: Gerhard Budin

Belgia Belgium Computational Linguistics and Psycholinguistics Research Centre, University of


Antwerp: Walter Daelemans

Centre for Processing Speech and Images, University of Leuven: Dirk van Compernolle

Bulgaria Bulgaria Institute for Bulgarian Language, Bulgarian Academy of Sciences: Svetla Koeva

Cehia Czech Republic Institute of Formal and Applied Linguistics, Charles University in Prague: Jan Haji

Cipru Cyprus Language Centre, School of Humanities: Jack Burston

Croaia Croatia Institute of Linguistics, Faculty of Humanities and Social Science, University of Za-
greb: Marko Tadi

Danemarca Denmark Centre for Language Technology, University of Copenhagen:


Bolette Sandford Pedersen, Bente Maegaard

Estonia Estonia Institute of Computer Science, University of Tartu: Tiit Roosmaa, Kadri Vider

Elveia Switzerland Idiap Research Institute: Herv Bourlard

Finlanda Finland Computational Cognitive Systems Research Group, Aalto University: Timo Honkela

Department of Modern Languages, University of Helsinki: Kimmo Koskenniemi,


Krister Lindn

Frana France Centre National de la Recherche Scientique, Laboratoire dInformatique pour la M-


canique et les Sciences de lIngnieur and Institute for Multilingual and Multimedia
Information: Joseph Mariani

Evaluations and Language Resources Distribution Agency: Khalid Choukri

Germania Germany Language Technology Lab, DFKI: Hans Uszkoreit, Georg Rehm

Human Language Technology and Pattern Recognition, RWTH Aachen University:


Hermann Ney

Department of Computational Linguistics, Saarland University: Manfred Pinkal

Grecia Greece R.C. Athena, Institute for Language and Speech Processing: Stelios Piperidis

Irlanda Ireland School of Computing, Dublin City University: Josef van Genabith

Islanda Iceland School of Humanities, University of Iceland: Eirkur Rgnvaldsson

Italia Italy Consiglio Nazionale delle Ricerche, Istituto di Linguistica Computazionale Antonio
Zampolli: Nicoletta Calzolari

85
Human Language Technology Research Unit, Fondazione Bruno Kessler:
Bernardo Magnini

Letonia Latvia Tilde: Andrejs Vasijevs

Institute of Mathematics and Computer Science, University of Latvia: Inguna Skadia

Lituania Lithuania Institute of the Lithuanian Language: Jolanta Zabarskait

Luxemburg Luxembourg Arax Ltd.: Vartkes Goetcherian

Malta Malta Department Intelligent Computer Systems, University of Malta: Mike Rosner

Marea Britanie UK School of Computer Science, University of Manchester: Sophia Ananiadou

Institute for Language, Cognition and Computation, Center for Speech Technology
Research, University of Edinburgh: Steve Renals

Research Institute of Informatics and Language Processing, University of Wolverhamp-


ton: Ruslan Mitkov

Norvegia Norway Department of Linguistic, Literary and Aesthetic Studies, University of Bergen: Koen-
raad De Smedt

Department of Informatics, Language Technology Group, University of Oslo:


Stephan Oepen

Olanda Netherlands Utrecht Institute of Linguistics, Utrecht University: Jan Odijk

Computational Linguistics, University of Groningen: Gertjan van Noord

Polonia Poland Institute of Computer Science, Polish Academy of Sciences: Adam Przepirkowski,
Maciej Ogrodniczuk

University of d: Barbara Lewandowska-Tomaszczyk, Piotr Pzik

Department of Computer Linguistics and Articial Intelligence, Adam Mickiewicz


University: Zygmunt Vetulani

Portugalia Portugal University of Lisbon: Antnio Branco, Amlia Mendes

Spoken Language Systems Laboratory, Institute for Systems Engineering and Comput-
ers: Isabel Trancoso

Romnia Romania Research Institute for Articial Intelligence, Romanian Academy of Sciences:
Dan Tu

Faculty of Computer Science, University Alexandru Ioan Cuza of Iai: Dan Cristea

Serbia Serbia University of Belgrade, Faculty of Mathematics: Duko Vitas, Cvetana Krstev,
Ivan Obradovi

Pupin Institute: Sanja Vrane

Slovacia Slovakia udovt tr Institute of Linguistics, Slovak Academy of Sciences: Radovan Garabk

Slovenia Slovenia Joef Stefan Institute: Marko Grobelnik

Spania Spain Barcelona Media: Toni Badia, Maite Melero

86
Institut Universitari de Lingstica Aplicada, Universitat Pompeu Fabra: Nria Bel

Aholab Signal Processing Laboratory, University of the Basque Country:


Inma Hernaez Rioja

Center for Language and Speech Technologies and Applications, Universitat Politc-
nica de Catalunya: Asuncin Moreno

Department of Signal Processing and Communications, University of Vigo:


Carmen Garca Mateo

Suedia Sweden Department of Swedish, University of Gothenburg: Lars Borin

Ungaria Hungary Research Institute for Linguistics, Hungarian Academy of Sciences: Tams Vradi

Department of Telecommunications and Media Informatics, Budapest University of


Technology and Economics: Gza Nmeth, Gbor Olaszy

n jur de 100 de experi reprezentani ai rilor i limbilor reprezentate n META-NET au discutat i nalizat
rezultatele cheie i mesajele Seriei de rapoarte la o ntlnire META-NET care a avut loc la Berlin, Germania,
pe 2122 octombrie 2011. About 100 language technology experts representatives of the countries and
languages represented in META-NET discussed and nalised the key results and messages of the White Paper
Series at a META-NET meeting in Berlin, Germany, on October 21/22, 2011.

87
C

SERIA DE THE META-NET


STUDII META-NET WHITE PAPER SERIES
basc Basque euskara
bulgar Bulgarian
catalan Catalan catal
ceh Czech etina
croat Croatian hrvatski
danez Danish dansk
german German Deutsch
englez English English
estonian Estonian eesti
nlandez Finnish suomi
francez French franais
galiian Galician galego
greac Greek
islandez Icelandic slenska
irlandez Irish Gaeilge
italian Italian italiano
leton Latvian latvieu valoda
lituanian Lithuanian lietuvi kalba
maghiar Hungarian magyar
maltez Maltese Malti
olandez Dutch Nederlands
norvegian Bokml Norwegian Bokml bokml
norvegian Nynorsk Norwegian Nynorsk nynorsk
polonez Polish polski
portughez Portuguese portugus
romn Romanian romn
srb Serbian
slovac Slovak slovenina
sloven Slovene slovenina
spaniol Spanish espaol
suedez Swedish svenska

89
rs Soc
e Use iet
g

y
gu
Lan

Research
es
stri

Co
u
mm

d
In unit
ies

In everyday communication, Europes citizens, business Cetenii, partenerii de afaceri i politicienii europeni
partners and politicians are inevitably confronted with se confrunt n mod inevitabil n comunicarea de zi
language barriers. Language technology has the po- cu zi cu bariere lingvistice. Tehnologiile limbajului
tential to overcome these barriers and to provide inno- au potenialul de a depi aceste bariere i de a
vative interfaces to technologies and knowledge. This oferi interfee inovative pentru noile tehnologii i
white paper presents the state of language technology cunotine. Acest studiu prezint situaia sprijinului
support for the Romanian language. It is part of a se- acordat tehnologiilor limbajului pentru limba
ries that analyzes the available language resources and romn. El face parte dintr-o serie care analizeaz
technologies for 30 European languages. The analy- resursele i tehnologiile lingvistice disponibile
sis was carried out by META-NET, a Network of Excel- pentru 30 de limbi europene. Analiza a fost
lence funded by the European Commission. META-NET efectuat de ctre META-NET, o reea de excelen
consists of 54 research centres in 33 countries, who co- nanat de Comisia European. META-NET
operate with stakeholders from economy, government este format din 54 de centre de cercetare din
agencies, research organisations, non-governmental or- 33 de ri, care colaboreaz cu persoane cheie
ganisations, language communities and European uni- din economie, agenii guvernamentale, institute
versities. META-NETs vision is high-quality language de cercetare, organizaii non-guvernamentale,
technology for all European languages. comuniti lingvistice i universiti europene.
Viziunea META-NET este de a oferi tehnologii ale
limbajului de nalt calitate pentru toate limbile
europene.

Scriem un mesaj pe telefonul mobil i nici mcar nu suntem contieni de tehnologia care anticipeaz ce cuvinte
vrem s scriem. Suntem obinuii s trim ntr-o lume n care dispozitivul GPS ne poate arta drumul spre cas
spunndu-ne cnd s facem la stng sau la dreapta. Numeroase tehnologii dezvoltate n ultimii ani au implicaii
foarte concrete n viaa de zi cu zi a cetenilor din Uniunea European. Tehnologiile lingvistice reprezint un
element central al Uniunii Europene ntruct limbile nsei ocup un loc central n modul de funcionare a UE.
Leonard Orban (fost Comisar European pentru Multilingvism)

www.meta-net.eu
www.meta-net.eu

S-ar putea să vă placă și