Sunteți pe pagina 1din 65

Extragerea informației

Извлечение информации
Information Extraction

If information is power and riches, then it is not the amount that gives the
value, but access at the right time and in the most suitable form.
Plan
1. Pentru ce? Necesitatea și Importanța
2. Ce anume? Descrierea, detalierea
3. Cum? Metode, instrumente
4. Aplicații. Soft (free downloadable or API)
5. Viitorul
Exemplu
Exemplu
Exemplu
Exemplu
• Plecata din Halifax (estul Canadei), pe 1 iulie,
masina, acoperita cu celule solare si semanind
la forma cu o farfurie zburatoare, a sosit la
sfirsitul saptaminii trecute la Vancouver, in vest,
depasind vechiul record mondial, de 4.000 de
kilometri, transmite AFP.

• Nicolae Badea, presedintele clubului "Dinamo"


Bucuresti, tilharit in noaptea de 16 iulie, a fost
audiat, ieri, timp de peste cinci ore, de catre
procurorii Parchetului Curtii de Apel Bucuresti.
Exemplu
"Ciinii rosii" i-au promis antrenorului Cornel Dinu
un frumos cadou de ziua sa de nastere,
calificarea in turul urmator Dinamo vrea un
miracol Rezultatul inregistrat in partida tur cu
Polonia Varsovia, 3-4, a scazut simtitor sansele
de calificare ale dinamovistilor in turul trei
preliminar al Ligii Campionilor. Mai grav este
faptul ca a diminuat considerabil posibilitatea ca
"ros-albii" sa evolueze macar in Cupa UEFA.
Plan
1. Pentru ce? Necesitatea și Importanța
2. Ce anume? Descrierea, detalierea
3. Cum? Metode, instrumente
4. Aplicații. Soft (free downloadable or API)
5. Viitorul
Definiția
• Extragerea informației (Information Extraction IE) este
sarcina de a extrage în mod automat informația
structurată din cea nestructurată și / sau documente
semi-structurate în format electronic (machine-readable).

• În cele mai multe cazuri, această activitate se referă la


analiza textelor utilizînd metodele procesării limbajului
natural (NLP).

• Recent, extragerea informației a inclus și procesarea


documentelor multimedia, cum ar fi adnotare automată și
extragerea conținutului de imagini / audio / video.
NER - Named Entities Recognition

The decision by the independent MP Andrew


Wilkie to withdraw his support for the minority
Labor government sounded dramatic but it
should not further threaten its stability.
When, after the 2010 election, Wilkie, Rob
Oakeshott, Tony Windsor and the Greens
agreed to support Labor, they gave just two
guarantees: confidence and supply.
NER - Named Entities Recognition

Găsirea: The decision by the independent MP


Andrew Wilkie to withdraw his support
for the minority Labor government
sounded dramatic but it should not
further threaten its stability.
When, after the 2010 election, Wilkie,
Rob Oakeshott, Tony Windsor and
the Greens agreed to support Labor,
they gave just two guarantees:
confidence and supply.
NER - Named Entities Recognition

Clasificarea: The decision by the independent MP


Person Andrew Wilkie to withdraw his
Location support for the minority Labor
government sounded dramatic but it
Date
should not further threaten its stability.
Organization
When, after the 2010 election, Wilkie,
Wilkie
etc.
Rob Oakeshott,
Oakeshott Tony Windsor and
the Greens agreed to support Labor,
Labor
they gave just two guarantees:
confidence and supply.
Exemplu
• Plecata din Halifax (estul Canadei),
Clasificarea: pe 1 iulie, masina, acoperita cu celule
solare si semanind la forma cu o farfurie
Person zburatoare, a sosit la sfirsitul
Location saptaminii trecute la Vancouver, in
Date vest, depasind vechiul record mondial,
de 4.000 de kilometri, transmite AFP.
Organization
etc.
Nicolae Badea, presedintele clubului
"Dinamo" Bucuresti, tilharit in
noaptea de 16 iulie, a fost audiat, ieri,
timp de peste cinci ore, de catre
procurorii Parchetului Curtii de
Apel Bucuresti.
Exemplu
"Ciinii rosii"
rosii i-au promis antrenorului
Cornel Dinu un frumos cadou de ziua sa
Clasificarea:
de nastere, calificarea in turul urmator
Person
Dinamo vrea un miracol Rezultatul
Location
inregistrat in partida tur cu Polonia
Date
Organization Varsovia, 3-4, a scazut simtitor sansele
de calificare ale dinamovistilor in turul trei
etc.
preliminar al Ligii Campionilor. Mai grav
este faptul ca a diminuat considerabil
posibilitatea ca "ros-albii" sa evolueze
macar in Cupa UEFA.
Clase de entități
• Person Contractul pentru proiectarea, execuția și
finalizarea celor 3,3 kilometri "restanți" ai
• Organization secțiunii București - Ploiești a Autostrăzii
• Location București - Brașov a fost semnat, valoarea
acestuia ridicându-se la 129,181 milioane
• Date de lei, fără TVA, potrivit unui anunț al
• Time Companiei Naționale de Autostrăzi și
Drumuri Naționale din România
• Percentage (CNADNR), postat vineri seară pe pagina
• Monetary amount  de Facebook a Ministerului Transporturilor
(MT).
sâmbătă, 19 Dec 2015, 02:21
Clase de entități
• Person Trimisul special al AGERPRES,
• Organization Florentina Peia, transmite: Românul
rănit în atacul de la Tel Aviv a fost
• Location externat și s-a întors vineri în țară,
• Date președintele Klaus Iohannis urmând
• Time să stabilească împreună cu acesta
• Percentage ziua în care se vor întâlni, au
• precizat reprezentanți ai
Monetary amount 
Administrației Prezidențiale.
vineri, 11 Mar 2016, 11:55 
Clase de entități
• Person Vânzările pe plan global ale
• Organization brandului Volkswagen au scăzut în
februarie cu 4,7%, la 394.000
• Location vehicule, pe fondul declinului cererii
• Date în SUA, America de Sud și regiunea
• Time Asia Pacific, transmit EFE și
• Percentage Reuters.
• Monetary amount  vineri, 11 Mar 2016, 11:48 
MUC
Message Understanding Competition
Message Understanding Conference
Text
Conference Year Topic (Domain)
Source
Mil.
MUC-1 1987 Fleet Operations
reports
Mil.
MUC-2 1989 Fleet Operations
reports
News
MUC-3 1991 Terrorist activities in Latin America
reports
News
MUC-4 1992 Terrorist activities in Latin America
reports
News
MUC-5 1993 Corporate Joint Ventures, Microelectronic production
reports
News Negotiation of Labor Disputes and Corporate
MUC-6 1995
reports Management Succession
News
MUC-7 1997 Airplane crashes, and Rocket/Missile Launches
reports
Message Understanding Competition
Message Understanding Conference
MUC - 7
Exemple
The Named Entity task
<ENAMEX TYPE="ORGANIZATION">Taga
Three subtasks: Co.</ENAMEX>

1. Entity names <ENAMEX TYPE="LOCATION">North and


1. Organizations South America</ENAMEX>
2. Persons
3. Locations <TIMEX TYPE="DATE" ALT="1987">all of
1987</TIMEX>
2. Temporal expressions <TIMEX TYPE="DATE">from 1990 through
1. Dates 1992</TIMEX>
2. Times
<NUMEX TYPE="MONEY">10- and 20-
3. Number expressions dollar</NUMEX> bills
1. Monetary values
2. Percentages <NUMEX TYPE="MONEY">175 to 180 million
Canadian dollars</NUMEX>
Cazuri complexe
"Mips Vice President John Hime"

<ENAMEX TYPE="ORGANIZATION">Mips</ENAMEX> Vice


President <ENAMEX TYPE="PERSON">John Hime</ENAMEX>

"Treasury Secretary“

<ENAMEX TYPE="ORGANIZATION">Treasury</ENAMEX> Secretary

"the U.S. Vice President“

the <ENAMEX TYPE="LOCATION">U.S.</ENAMEX> Vice President

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
Cazuri complexe

"Arthur Anderson Consulting"


<ENAMEX TYPE="ORGANIZATION">Arthur Anderson Consulting</ENAMEX>

"Boston Chicken Corp."


<ENAMEX TYPE="ORGANIZATION">Boston Chicken Corp.</ENAMEX>

"U.S. Fish and Wildlife Service"


<ENAMEX TYPE="ORGANIZATION">U.S. Fish and Wildlife
Service</ENAMEX>

"Northern California"
<ENAMEX TYPE="LOCATION">Northern California</ENAMEX>

"West Texas"
<ENAMEX TYPE="LOCATION">West Texas</ENAMEX>

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
Cazuri complexe

"California's Silicon Valley“

<ENAMEX TYPE="LOCATION">California</ENAMEX>'s <ENAMEX


TYPE="LOCATION">Silicon Valley</ENAMEX>

"Canada's Parliament“

<ENAMEX TYPE="LOCATION">Canada</ENAMEX>'s <ENAMEX


TYPE="ORGANIZATION">Parliament</ENAMEX>

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
Cazuri complexe

"Big Blue" [alias for International Business Machines Corp.]

<ENAMEX TYPE="ORGANIZATION">Big Blue</ENAMEX>

"Big Board" [alias for New York Stock Exchange]

<ENAMEX TYPE="ORGANIZATION">Big Board</ENAMEX>

"Mr. Fix-It" [nickname for candidate for head of the CIA]

Mr. <ENAMEX TYPE="PERSON">Fix-It</ENAMEX>

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
Message Understanding Competition
Message Understanding Conference
MUC - 7

Information Extraction task


1. Template Element
Exemple

Fred Kilian, the squadron commander of the F-14 pilot in the Nashville crash that
killed five people last week.

ENTITY
ENT_NAME: "Fred Kilian"
ENT_TYPE: PERSON
ENT_DESCRIPTOR: "NAVY SQUADRON COMMANDER“

LOCATION
LOCALE: "Berry Field"
LOCALE_TYPE: AIRPORT
COUNTRY: "United States"
Message Understanding Competition
Message Understanding Conference
MUC - 7

Information Extraction task


2. Template Relation

Exemple

EMPLOYEE_OF =
PERSON:=
ENT_NAME: "JOHN O'NEIL"
ENT_TYPE: PERSON
ENT_CATEGORY: PER_CIV
ORGANIZATION: =
ENT_NAME: "N.Y. Times News Service"
ENT_TYPE: ORGANIZATION
ENT_CATEGORY: ORG_CO
Relation Extraction (RE)
– EmployeeOf(Steve Jobs, Apple)
a relation between a person and an organisation,
extracted from ‘Steve Jobs works for Apple’
– LocatedIn(Smith, New York)
a relation between a person and location,
extracted from ‘Mr. Smith gave a talk at the conference in
New York’,
– SubsidiaryOf(TVN, ITI Holding)
a relation between two companies,
extracted from ‘Listed broadcaster TVN said its parent
company, ITI Holdings, is considering various options for
the potential sale.

Jakub Piskorski and Roman Yangarber. Information Extraction: Past, Present and Future. Chapter 2.
Message Understanding Competition
Message Understanding Conference
MUC - 7

Information Extraction task


3. Scenario Template

Aceasta sarcină a fost definită în dependență de domeniul textelor. În cazul


MUC-7 au fost analizate rapoartele de lansare a vehiculelor aeriene cu
informația respectivă:
1. sarcina utilă a acestui vehicul
2. data și locul de lansare
3. tipul de misiune
4. funcția
Message Understanding Competition
Message Understanding Conference
MUC - 7

Information Extraction task JOINT-VENTURE-1


3. Scenario Template •Relationship: TIE-UP
•Entities: “Bridgestone Sport Co.” , “a
Bridgestone Sports Co. said Friday local concern”, “a Japanese trading
it had set up a joint venture in house”
Taiwan with a local concern and a •Joint Ent: “Bridgestone Sports Taiwan
Japanese trading house to produce
Co.”
golf clubs to be supplied to Japan.
•Activity: ACTIVITY-1
The joint venture, Bridgestone
Sports Taiwan Co., capitalized at •Amount: NT$20 000 000
20 million new Taiwan dollars, will ACTIVITY-1
start production in January 1990 •Activity: PRODUCTION
with production of 20,000 iron and •Company: “Bridgestone Sports
“metal wood” clubs a month. Taiwan Co.”
•Product: “iron and ‘metal wood’ clubs”
•Start date: DURING: January 1990
Message Understanding Competition
Message Understanding Conference
MUC - 7
Coreference Task
Detectarea co-referințelor

Exemple

The sleepy boys and girls enjoy their breakfast.

<COREF ID="1" MIN="boys and girls">The sleepy boys and girls</COREF> enjoy
<COREF ID="2" REF="1" TYPE="IDENT">their</COREF> breakfast.

Edna Fribble and Sam Morton addressed the meeting yesterday. Ms. Fribble
discussed coreference, and Mr. Morton discussed unnamed entities.

<COREF ID="1">Edna Fribble</COREF> and <COREF ID="2">Sam


Morton</COREF> addressed the meeting yesterday. <COREF ID="3" REF="1"
TYPE="IDENT" MIN="Fribble">Ms. Fribble</COREF> discussed coreference,
and <COREF ID="4" REF="2" TYPE="IDENT" MIN="Morton">Mr.
Morton</COREF> discussed unnamed entities.
Message Understanding Competition
Message Understanding Conference
MUC - 7

Coreference Task
Detectarea co-referințelor

Exemple

"Ciinii rosii"
rosii i-au promis antrenorului Cornel Dinu un frumos cadou de
ziua sa de nastere, calificarea in turul urmator
Dinamo vrea un miracol
Rezultatul inregistrat in partida tur cu Polonia Varsovia,
Varsovia 3-4, a scazut
simtitor sansele de calificare ale dinamovistilor in turul trei preliminar al
Ligii Campionilor. Mai grav este faptul ca a diminuat considerabil
posibilitatea ca "ros-albii" sa evolueze macar in Cupa UEFA.
Message Understanding Competition
Message Understanding Conference
MUC - 7

Coreference Task
Detectarea co-referințelor

Exemple

"Ciinii rosii"
rosii i-au promis antrenorului Cornel Dinu un frumos cadou de
ziua sa de nastere, calificarea in turul urmator
Dinamo vrea un miracol
Rezultatul inregistrat in partida tur cu Polonia Varsovia,
Varsovia 3-4, a scazut
simtitor sansele de calificare ale dinamovistilor in turul trei preliminar al
Ligii Campionilor. Mai grav este faptul ca a diminuat considerabil
posibilitatea ca "ros-albii" sa evolueze macar in Cupa UEFA.
Event Extraction (EE)
Un email cu anunțul oficial despre un meeting:

COMPUTER SYSTEMS LABORATORY COLLOQUIUM


4:15PM, Wednesday, June 02, 2004
NEC Auditorium, Gates Computer Science Building B03
http://ee380.stanford.edu[1]
Topic: Automating CapCom….

Un email cu invitația neoficială:

“hey i also wanted to invite you to the final software faire for cs194. It's
where we will be showing off our senior project. It's next wed from
like 12 -4 i think or something.”
Julie A. Black, Nisheeth Ranjan. Automated Event Extraction from Email
Rezultate MUC-7

Co-
reference
Named
Entites
Scenario
Templates
Template
Elements

Elaine Marsh, Dennis Perzanowski. MUC-7 EVALUATION OF IE TECHNOLOGY: Overview of Results, 1998
Plan
1. Pentru ce? Necesitatea și Importanța
2. Ce anume? Descrierea, detalierea
3. Cum? Metode, instrumente
4. Aplicații. Soft (free downloadable or API)
5. Viitorul
Metodele utilizate pentru
extragerea informației
1. Deterministe
1. Expresii regulate
2. Automate finite
3. Utilizarea cunoștințelor suplimentare
2. Statistice
1. Metodele de instruire a calculatorului
2. Modele statistice în baza secvențelor
1. Lanț Markov
2. Conditional Random Fields
Aici urmează prezentarea cu
expresii regulate

[a-zA-Z0-9]+@[a-z]+(\.[a-z]+)*\.([a-z]{2,})

olivier@mailbidon.com
olivier@mailbidon.ca
8@mailbidon.com
@mailbidon.com
olivier@mailbidon
Exemple:
crearea semiautomată a expresiilor
regulate

Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan.


Regular Expression Learning for Information Extraction
Metodele utilizate pentru
extragerea informației
1. Deterministe
1. Expresii regulate
2. Automate finite
3. Utilizarea cunoștințelor suplimentare
2. Statistice
1. Metodele de instruire a calculatorului
2. Modele statistice în baza secvențelor
1. Lanț Markov
2. Conditional Random Fields
Gazeteers
• A gazetteer is a geographical dictionary or directory used in
conjunction with a map or atlas.
Multilingual IE (NER)
• The highly multilingual media analysis application
Europe Media Monitor (EMM) makes extensive use of
name dictionaries, including not only large lists of
person, organisation and location names, but also many
spelling variants for the same named entity, both within
the same language and across languages.
• The lists were produced automatically by analysing over
100,000 news articles per day in over twenty languages.
• A large part of EMM’s vocabulary lists is made publicly
available for download as part of JRC-Names.
STEINBERGER Ralf; JACQUET GUILLAUME; DELLA ROCCA Leonida.
Creation and use of multilingual named entity variant dictionaries
Multilingual IE (NER)
• As developers of a highly multilingual named entity recognition (NER) system, we
face an evaluation resource bottleneck problem:
• 1. we need evaluation data in many languages,
• 2. the annotation should not be too time-consuming,
• 3. the evaluation results across languages should be comparable.
We solve the problem by automatically annotating the English version of a multi-parallel
corpus and by projecting the annotations into all the other language versions.
we used:
- a phrase-based statistical machine translation system,
- a lookup of known names from a multilingual name database,
- perfect string matching,
- perfect consonant signature matching,
- edit distance similarity.
The resulting annotated parallel corpus will be made available for reuse.

STEINBERGER Ralf; JACQUET GUILLAUME; DELLA ROCCA Leonida.


Creation and use of multilingual named entity variant dictionaries
Multilingual IE (NER)
• Multi-word entities, such as organisation names, are
frequently written in many different ways. We have
previously automatically identified over one million
acronym pairs in 22 languages, consisting of their short
form (e.g. EC) and their corresponding long forms (e.g.
European Commission, European Union Commission).
• In order to evaluate the named entity variant clusters
automatically, with minimal human annotation effort we
experimented with Wikipedia redirection tables and we
show that this method produces reasonable results.
Utilizarea contextului
"Ciinii rosii"
rosii i-au promis antrenorului
Cornel Dinu un frumos cadou de ziua sa
de nastere, calificarea in turul urmator
Dinamo vrea un miracol Rezultatul
inregistrat in partida tur cu Polonia
Varsovia, 3-4, a scazut simtitor sansele
de calificare ale dinamovistilor in turul trei
preliminar al Ligii Campionilor. Mai grav
este faptul ca a diminuat considerabil
posibilitatea ca "ros-albii" sa evolueze
macar in Cupa UEFA.
Utilizarea contextului
Domeniul SPORT
"Ciinii rosii"
rosii i-au promis antrenorului
Persoane: Cornel Dinu un frumos cadou de ziua sa
jucător de nastere, calificarea in turul urmator
antrenor Dinamo vrea un miracol Rezultatul
referee inregistrat in partida tur cu Polonia
Organizații: Varsovia, 3-4, a scazut simtitor sansele
echipa de calificare ale dinamovistilor in turul trei
organizator preliminar al Ligii Campionilor. Mai grav
Evenimente: este faptul ca a diminuat considerabil
meci ? posibilitatea ca "ros-albii" sa evolueze
macar in Cupa UEFA.
campionat
Utilizarea contextului
• This hiking route from Ceahlau Massif is located on the
north side of the mountain. The starting point is in Durau
Resort (800 m), continues to Fantanele Chalet (1220 m),
Panaghia Stone, Toaca Peak and Dochia Chalet (1750
m). The marking of the route is a red stripe.
• The necessary time to make this route is between 3h
and 3h and 30 min and has a length of 7,5 km. The level
difference is about 950 m. From Fantanele Chalet you
can also go to Duruitoarea Waterfall by following a
yellow triangle marked trail.
Utilizarea contextului
the girl cried. " MR. Morgan, it's the best
come out on"? "Mr. Flannagan is away
Her voice dropped. " MR. Flannagan has been away 
He shook his head. " MR. Manuel did that in the war. 
A Bay State supporter said, " MR. Hearst's fight 
construction of Chain Bridges. " MR. Palmer will attend to any 
 and said loudly, " MR. Chairman"! Cady Partlow's head 
in rolled-up shirt sleeves. " MR. Ferrell , chairman of the board 
 "That's right". " MR. Wycoff's car is waiting 
Lucius Beebe's book, " MR. Pullman's Elegant Palace Car", 
an affectionate smile. " MR. Skyros too smart 
laid it in the basket. " MR. Jack sets store by that". 
 I said, " MR. McKenzie, it is as authentic 
 who had ordered it? " MR. Gross, concerning the formation 
Garth hesitated. " MR. Hohlbein and I have noticed 
the much quoted remark: " MR. Green indeed writes 
of 1162 Sixth Avenue. " MR. Miller was in the shop", 
June 21, announced. " MR. Wolfe had been in 
 a politician. " MR. Jones, you may recall 
Utilizarea contextului
the girl cried. " MR. Morgan, it's the best
come out on"? "Mr. Flannagan is away
Putem utiliza
Her voice dropped. " MR. Flannagan has been away  expresii regulate
He shook his head. " MR. Manuel did that in the war. 
A Bay State supporter said, " MR. Hearst's fight  Mr. [A-Z][a-z]+
construction of Chain Bridges. " MR. Palmer will attend to any 
 and said loudly, " MR. Chairman"! Cady Partlow's head 
in rolled-up shirt sleeves. " MR. Ferrell , chairman of the board 
 "That's right". " MR. Wycoff's car is waiting 
Lucius Beebe's book, " MR. Pullman's Elegant Palace Car", 
an affectionate smile. " MR. Skyros too smart 
laid it in the basket. " MR. Jack sets store by that". 
 I said, " MR. McKenzie, it is as authentic 
 who had ordered it? " MR. Gross, concerning the formation 
Garth hesitated. " MR. Hohlbein and I have noticed 
the much quoted remark: " MR. Green indeed writes 
of 1162 Sixth Avenue. " MR. Miller was in the shop", 
June 21, announced. " MR. Wolfe had been in 
 a politician. " MR. Jones, you may recall 
Plan
1. Pentru ce? Necesitatea și Importanța
2. Ce anume? Descrierea, detalierea
3. Cum? Metode, instrumente
4. Aplicații. Soft (free downloadable or API)
5. Viitorul
Completarea bazelor de cunoștințe
TextToLogic 
a semi-automatic IE system.
Aplicații
http://cogcomp.cs.illinois.edu/page/demo_view/ner
Aplicații
http://cogcomp.cs.illinois.edu/page/demo_view/ner
Aplicații
http://cogcomp.cs.illinois.edu/page/demo_view/ner
Aplicații
http://cogcomp.cs.illinois.edu/page/demo_view/ner
Aplicații
http://nlp.stanford.edu/software/CRF-NER.shtml

• Stanford NER is a Java implementation of a Named


Entity Recognizer.  Included with the download are good
named entity recognizers for English, particularly for the
3 classes (PERSON, ORGANIZATION, LOCATION)
•  The software provides a general implementation of
linear chain Conditional Random Field (CRF) sequence
models. 
Aplicații
http://nlp.stanford.edu/software/CRF-NER.shtml

• This hiking route from Ceahlau Massif is located on the north side of the


mountain. The starting point is in Durau Resort (800 m), continues to
Fantanele Chalet (1220 m), Panaghia Stone, Toaca Peak and Dochia Chalet
(1750 m). The marking of the route is a red stripe. The necessary time to
make this route is between 3h and 3h and 30 min and has a length of 7,5 km.
The level difference is about 950 m. From Fantanele Chalet you can also go
to Duruitoarea Waterfall by following a yellow triangle marked trail.

Potential tags:
  ORGANIZATION
  LOCATION
  PERSON
• https://www.quora.com/Natural-Language-
Processing/What-APIs-and-libraries-can-
extract-dates-times-places-and-other-
logistical-information-from-unstructured-
text
Aplicații
http://services.gate.ac.uk/annie/

Named Entity
Recognition with
ANNIE
GATE annotation tool
The 2 Minute Guide to GATE
• Take one large pile of text (documents, emails, tweets, patents,
papers, transcripts, blogs, comments, acts of parliament, and so
on and so forth) -- call this your corpus.
• Pick a structured description of interesting things in the text (a
telephone directory, or chemical taxonomy, or something from
the Linked Data cloud) -- call this your ontology.
• Use GATE Teamware to mark up a gold standard example set of
annotations of the corpus (1.) relative to the ontology (2.).
• Use GATE Developer to build a semantic annotation pipeline to
do the annotation job automatically and measure performance
against the gold standard.
• Take the pipeline from 4. and apply it to your text pile using GATE
Cloud (or embed it in your own systems using GATE Embedded).
• Use GATE Mimir to store the annotations relative to the ontology
in a multiparadigm index server. (For techies: this sits in the
backroom as a RESTful web service.)
• Use Ontotext KIM to add semantic search, knowledge facet
search, ontology browsing, entity popularity graphing, time series
graphing, annotation structure search and (last but not least)
boolean full text search. (More techy stuff: mash up these types of
search with your existing UIs.)
Plan
1. Pentru ce? Necesitatea și Importanța
2. Ce anume? Descrierea, detalierea
3. Cum? Metode, instrumente
4. Aplicații. Soft (free downloadable or API)
5. Viitorul

http://social-book-search.humanities.uva.nl/#/mining
  
The European Commission’s Joint Research Centre (JRC) in Ispra,
Italy, is looking for a trainee to support the JRC’s Europe Media
Monitor (EMM) team with a variety of Language Technology-related
tasks.
EMM gathers and analyses reports from traditional and social media in
dozens of languages by clustering related news items; categorising
them; extracting information such as:
entities (persons, organisations, locations),
events (who did what to whom, where and when),
quotations by and about people;
identifying sentiment;
linking related news clusters over time and across languages.

Methods used are mostly hybrid: machine learning tools are used to
gather evidence, learn vocabulary and rules, but the results are
usually controlled and optimised through human intervention.
The public EMM applications can be accessed at the URLs
http://emm.newsbrief.eu/overview.html 
and http://emm.newsexplorer.eu.
We are looking to fill a traineeship position in the field of
‘Multilingual Text Analysis’.
 If you are interested, please follow the instructions provided at
the URL listed below.

 URL of call:          http://recruitment.jrc.ec.europa.eu/
Call number:      2016-IPR-G-000-6713
Title of call:         Multilingual Text Analysis
Deadline:             21 March 2016 (Brussels time)
Starting date:     1 June 2016
Duration:             5 months
Eligibility
requirements:   https://ec.europa.eu/jrc/en/working-with-
us/jobs/temporary-positions/jrc-trainees
 
 
The successful trainee will carry out any
of the following tasks:
 
a)    use third-party software to carry out a terminology use study,
which includes comparing occurrences of terms and their variants in
English, French and Spanish;
b)    gather existing definitions of important terms from the internet;
c)    improve the JRC’s existing entity-oriented sentiment analysis tools,
then analyse large quantities of sentiment data and its change over
time with the purpose of identifying opinion change patterns and
trends;
d)    contribute to the semi-automatic classification of entities in JRC’s
multilingual entity database;
e)    contribute to improving the recognition of multilingual organisation
names;
f)    annotation of linguistic data and/or evaluation of automatic text
analysis results;
g)    contribute to writing a scientific publication.
Qualifications:

Essential:
 
·         A degree (or an almost completed degree) in computational linguistics,
computer science or related areas;
·         Programming skills;
·         Good command of oral and written English (level B2).
 
Advantage:
 
·         Knowledge of further foreign languages;
·         Proven advanced programming skills, especially in Java;
·         Good knowledge of Language Technology-related tools and methods;
·         The proven ability to work independently and as part of a team.

S-ar putea să vă placă și