Sunteți pe pagina 1din 31

Bilingual Terminology Extraction from TMX

A State-of-the-Art Overview

Chelo Vargas-Sierra, PhD


University of Alicante,
Spain
2

INDEX
Main points of this presentation

Key words Evaluation


Overview of terms BATE under evaluation
involved in the Measures for accuracy
process Quality in use model and tasks

1st point 2nd point 3rd point 4th point

Terminology and extractors Results


Terminology management Precision & Recall
Its timeline Parameters & Questionnaire
BATE (approaches, state of the art)
KEY WORDS
Terms involved in the process

Parallel corpus Alignment levels ATE & BATE Precision/Recall


TMX Paragraph, sentence and word Getting only terms and all terms
level

Gold standard Validation Usability Quality in use model


Exhaustive, manually created * Term validation facility Software used to achieve ISO standard
bilingual glossary * Which TCs are real terms? user’s objectives with
effectiveness, efficiency,
and satisfaction
2. Terminology &
Extractors
5

IMPORTANCE OF TERMINOLOGY
Translators were the first professionals to be aware of term-related issues

Identify and interpret


the terminology in the source text adequately IDENTIFY

Find and use


proper documentation and
information resources

RETRIEVE FIND
Retrieve and store
terminological data
6
6

TERMINOLOGY MANAGEMENT
In specialized translation
Time spent to solve terminological problems (Arntz 1993,
+40% Walker 1993).
7
7

TERMINOLOGY MANAGEMENT
In specialized translation

+40% Time spent to solve terminological problems (Arntz 1993, Walker 1993).

Managing terminology (extracting, validating, importing, adding, editing, deleting,


revising, updating, exporting, publishing) is a time-comsuming process.
8
8

TERMINOLOGY MANAGEMENT
In specialized translation

+40% Time spent to solve terminological problems (Arntz 1993, Walker 1993).

Managing terminology (extracting, validating, importing, adding, editing, deleting, revising,


updating, exporting, publishing) is a time-comsuming process.

Terminology work is “on backstage”, and customer or


employers may not be fully aware of their befefits for QA.
9
9

TERMINOLOGY MANAGEMENT
In specialized translation

+40% Time spent to solve terminological problems (Arntz 1993, Walker 1993).

Managing terminology (extracting, validating, importing, adding, editing, deleting, revising,


updating, exporting, publishing) is a time-comsuming process.

Terminology work is “on backstage”, and customer or employers may not be fully aware of
their befefits for QA.

Return on Investment (ROI) on terminology management


90% reported by some corporate studies (Childress, 2007;
Popiolek, 2015)
10
10

TERMINOLOGY MANAGEMENT
General model por project terminology
creation (Popiolek, 2015: 351)
Monolingual
extraction &
• List of terms extracted from ST validation
• List of terms to validate (accept or reject)
Extraction

Importing &
looking for
• List is added to a termbase equivalents
• List is translated and additional data added
Translation

• List approved by a person in charge of terminology


• When the client has requested there is an addtional
Approval step for client approval
11

TIMELINE in Terminology Management


with bilingual extraction
Preparation: TMX import
Preparing the files and import
them into the BATE

Bilingual extraction
List of candidate term pairs
extracted from TMX
12

Validation (& data entry)


- List of pair of terms to validate (accept
or reject terms and suggested
equivalents)
- Term by term and additional data are
added to a term base (Synchroterm)

Export/Import
- Export bilingual terms and additional
data in an available file format (.xls,
.txt, .TBX, …)
- Import output file to a TDB system
(to be integrated into a MT System)
13

Approval
Person in charge of terminology
or client

Finish
Ready to use
14

Bilingual Automatic Term Extractors


Two approaches (Foo, 2012)

EXTRACT-ALIGN
1ST step: monolingual terminology extraction
in both languages.
2nd step: cross-linguistic matching using
word-alignment or co-occurrence statistics to
find equivalents.

Commercial systems in this approach


15

Bilingual Automatic Term Extractors


Two approaches (Foo, 2012)

ALIGN-FILTER
1ST step: word-alignment on the
parallel texts.
2nd step: rank the aligned units to
finally select the most likely pair of
candidates (statistics)

TExSIS (Macken et al, 2013)


16

Bilingual Automatic Term Extractors


Academic / In-house
90s
2010 -2016
- English-French TERMIGHT (Dagan & Church, 1994)
- English-French (Kupiek, 1993) - French-Japanese (Morin et al 2010, from
- English-Dutch (Eijk, 1993) ACABIT, Daille, 2003): not bilingual, but
- English-French (Gaussier, 1995) multilingual
- English and Swedish (Ahrenberg et al., 1998) - Slovene and English, Luiz (Vintar, 2010);
- English and Swedish ITools suite (Foo &
Merkel, 2010)
2000-2009
- English and German (Gojun et al., 2012).
- English, French, German, Spanish, TTC
- French-German (Blank, 2000)
TermSuite (Daille, 2012)
- Japanese-English, MNH (Nakagawa & Mori, 2003)
- English-Spanish TBXTools (Oliver &
- Spanish-Basque, Elexbi (Hernaiz et al., 2006),
Vázquez, 2015) (under development)
from a TMX;
- Chinese, Czech, Dutch, English, French,
- Spanish-German, Autoterm (Haller, 2008);
German, Italian, Japanese, Korean, Polish,
- English-Spanish, Mutual Bilingual Term
Portuguese, Russian, Spanish: Sketch
Extractor (Ha et al, 2008)
Engine (Baisa et al 2015, Koval et al 2016)
- French-English, French-Italian and French-Dutch
(Lefever et al., 2009)
17

Bilingual Automatic Term Extractors


Other BATE (free / comercial)
90s BILINGUAL
MONOLINGUAL ATE 2010 -2016

- TermExtractor (Shimohata et al 2001) - Xerox Terminology Suite (2001)

- MemoQ's built-in term extractor - SDL Multiterm Extract


- Déjà Vu - Lexicon
- Synchroterm
- TermoStat Web: http://termostat.ling.umontreal.ca/
- CrossMining (Across)
- Yate (IULA)
- MultiTrans Term Extractor
- Okapi
- Similis™ (by Lingua et Machina™)
- TerMine:
http://www.nactem.ac.uk/software/termine/
- Anchovy (by Swordfish)
- TerminologyExtractor: https://goo.gl/yA2Cuf - Araya Term Extractor
- PRoMT
- FiveFilters (web-based): http://fivefilters.org/term-
extraction/ - Analysis software: Sketch Engine
- Concordace programs: WordSmith Tools, (terminology extraction from TMX)
AntConc (free), …
3. Evaluation
BATE UNDER EVALUATION 19

Sketch Engine

SIMILIS
Synchroterm

Multiterm Extract
20

MAIN FEATURES

Multiterm Extract SynchroTerm Similis SkE Araya

Import TMX Trados TMX

Extraction config.

Extraction scores

Validation facility

Term base indexation

Export to TBX (xls, txt…) Others Others


21

MEASURES FOR ACCURACY


EXTRACTED NON-EXTRACTED

𝐴
PRECISION =
TERMS 𝐴+𝐶
A B
𝐴
RECALL =
𝐴+𝐵

NO TERMS C D
QUALITY IN USE MODEL
Characteristics (ISO-IEC 25010: 2011)

Satisfaction
degree to which user needs are
satisfied when a software is
Efficiency used in a specified context of use Freedom from risk
resources expended in no risk for the security of
relation to the accuracy and users, software, context or the
completeness environment

Effectiveness Context coverage


accuracy and completeness degree to which the
with which user achieves product understands the
objectives complete context of its
usage. Flexibility
23

6 TASKS TO EVALUATE
when performing bilingual extraction

TMX IMPORT VALIDATION EXPORTATION


Importing the source file Selecting the real terms. Exporting the final result for
later use in CAT Systems

CONFIGURATION EXTRACTION RECORD CREATION


Setting up the Performing the Creating and managing
extraction project extraction to get a term entries
bilingual list
4. Results
PRECISION & RECALL IN %
70.00

62.33

60.00

51.61

50.00
45.42
43.33

40.00

30.00 28.30

21.29

20.00
14.85

10.66
10.00

0.00
PRECISION RECALL

Sketch MTE Synchr Similis

EXTRACTED NON-EXTRACTED

TERMS NO TERMS TERMS NO TERMS


GOLD
TCs PRECISION RECALL
STANDARD
A C B D

Sketch 283 717 370 653 1,000 28,30 43,34

MTE 97 813 556 910 10,66 14,85

SynchroT. 407 1505 246 1,912 21,29 62,33

Similis 337 405 316 742 45,42 51,61

25
PARAMETERS
Characteristics and sub-characteristics to be measured METRICS
EFFECTIVENESS Value between 0 (minimum) and 5 (maximum) (EFE1+EFE2+EFE3)/3
EFE1.- Degree of accuracy – precision of tasks & results
(P1+P7+P13+P19+P25+P31)/6
EFE2.- Degree of completeness (tasks are accomplished and
(P2+P8+P14+P20+P26+P32)/6
results are not missing)
EFE3.- Frequency of errors
(P3+P9+P15+P21+P27+P33)/6
EFICIENCY Value between 0 (minimum) and 5 (maximum) (EFI2+EFI3+EFI4)/3
EFI1.- Time spent in the accomplishment of the task.
(TM1+TM2+TM3+TM4+TM5+TM6)
EFI2.- Need to use additional sources (material, software, etc.)
(P4+P10+P16+P22+P28+P34)/6
for the task
EFI3.- Productivity – effort exerted by the user to carry out the
(P5+P11+P17+P23+P29+P35)/6
task
EFI4.- Need to consult the software Help to perform the task
(P6+P12+P18+P24+P30+P36)/6
SATISFACTION Value between 0 (minimum) and 5 (maximum)
SAT1.- Usefulness
(P37+P38+P39)/3
SAT2.- Trust
SAT3.- Pleasure
CONTEXT COVERAGE Value between 0 (minimum) and 5 (maximum)
COB1.- Context of use (P40+P41+P42)/3
COB2.- Flexibility

26
QUESTIONNAIRE
42 questions grouped by tasks

27
FINAL RESULTS FOR QUALITY IN USE
18.00

16.00 15.67

13.83
14.00
13.00 12.83

12.00

10.00

8.00

6.00
4.44 4.22 4.33
4.06 4.11 4.00
4.00 3.72 3.50
3.33 3.11
3.00 3.00 3.00 3.00 3.00

2.00 1.50

0.00
EFFECTIVENESS EFFICIENCY SATISFACTION CONTEXT COVERAGE TOTAL QIU

Sketch MTE Synchr Similis

RESULTS FOR EXTRACTION & VALIDATION


30
26
25
25 24
21
20
20
16
15 14
13

10

0
EXTRACTION VALIDATION

Sketch MTE Synchr Similis 28


29

CONCLUSIONS

• Managing terminology still takes a lot of time and effort, even in


this increasingly computerized profession.
• Research on automatic terminology extraction has been
around for more than 20 years and significant enhancements
concerning bilingual extraction and bilingual corpora
exploitation have been introduced.
• I briefly described the BATE under evaluation and illustrated
some results obtained for accuracy and with the QIU model.
• Results make it clear that much more work has to be done for
BATE to be considered of real help to translators and
terminologists, mainly due to poor accuracy results.
References
• Baisa, Vit, Barbora Ulipová, and Michal Cukr. 2015. “Bilingual Terminology Extraction in Sketch Engine.” In 9th
Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN 2015), 61–67.
• Childress, Mark D. 2007. “Terminology Work Saves More Time than It Cost.” Multilingual, no. April/May: 43–46.
• Foo, Jody. 2012. Computational Terminology : Exploring Bilingual and Monolingual Term Extraction.
• Foo, Jody; Merkel, Magnus. 2010. “Computer Aided Term Bank Creation and Standardization. Building Stardardize
Term Banks through Automated Term Extraction and Advanced Editing Tools.” In Terminology in Everyday Life,
edited by Marcel Thelen and Fireda Steurs, 163–80. John Benjamins Publishing Company. doi:
10.1075/tlrp.13.12foo.
• Kovář, Vojtěch, Vít Baisa, and Miloš Jakubíček. 2016. “Sketch Engine for Bilingual Lexicography.” International
Journal of Lexicography 29 (3): 339–52. doi:10.1093/ijl/ecw029.
• Macken, Lieve, Els Lefever, and Veronique Hoste. 2013. “TExSIS: Bilingual Terminology Extraction from Parallel
Corpora Using Chunk-Based Alignment.” Terminology 19 (2013): 1–30. doi:10.1075/term.19.1.01mac.
• Oliver, Antoni, and M. Vazquez. 2015. “TBXTools: A Free, Fast and Flexible Tool for Automatic Terminology
Extraction.” In Proceedings of Recent Advances in Natural Language Processing, 473–79.
• Popiolek, Monika. 2015. “Terminology Management within a Translation Quality Assurance Process.” In Handbook
of Terminology (Volume 1), edited by Hendrik J Kockaert and Frieda Steurs, 341–59. John Benjamins Publishing
Company. doi:10.1075/hot.1.ter6.
• Sauron, Véronique. 2002. “Tearing out the Terms : Evaluating Terms Extractors.” In Translating and the Computer
24: Proceedings from the Aslib Conference, 21-22 November 2002.
• Vintar, Špela. 2010. “Bilingual Term Recognition revisited<BR> The Bag-of-Equivalents Term Alignment Approach
and Its Evaluation.” Terminology 16 (2010): 141–58. doi:10.1075/term.16.2.01vin.
Many thanks for your attention
Chelo Vargas-Sierra

University of Alicante Phone & Fax Social Media


IULMA Direct Line: +34 965903438 @chelovargas
Campus de San Vicente Fax: +34 965903800
Apdo. 99 chelo.vargas@ua.es
03080 Alicante

S-ar putea să vă placă și