Documente Academic
Documente Profesional
Documente Cultură
Извлечение информации
Information Extraction
If information is power and riches, then it is not the amount that gives the
value, but access at the right time and in the most suitable form.
Plan
1. Pentru ce? Necesitatea și Importanța
2. Ce anume? Descrierea, detalierea
3. Cum? Metode, instrumente
4. Aplicații. Soft (free downloadable or API)
5. Viitorul
Exemplu
Exemplu
Exemplu
Exemplu
• Plecata din Halifax (estul Canadei), pe 1 iulie,
masina, acoperita cu celule solare si semanind
la forma cu o farfurie zburatoare, a sosit la
sfirsitul saptaminii trecute la Vancouver, in vest,
depasind vechiul record mondial, de 4.000 de
kilometri, transmite AFP.
"Treasury Secretary“
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
Cazuri complexe
"Northern California"
<ENAMEX TYPE="LOCATION">Northern California</ENAMEX>
"West Texas"
<ENAMEX TYPE="LOCATION">West Texas</ENAMEX>
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
Cazuri complexe
"Canada's Parliament“
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
Cazuri complexe
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
Message Understanding Competition
Message Understanding Conference
MUC - 7
Fred Kilian, the squadron commander of the F-14 pilot in the Nashville crash that
killed five people last week.
ENTITY
ENT_NAME: "Fred Kilian"
ENT_TYPE: PERSON
ENT_DESCRIPTOR: "NAVY SQUADRON COMMANDER“
LOCATION
LOCALE: "Berry Field"
LOCALE_TYPE: AIRPORT
COUNTRY: "United States"
Message Understanding Competition
Message Understanding Conference
MUC - 7
Exemple
EMPLOYEE_OF =
PERSON:=
ENT_NAME: "JOHN O'NEIL"
ENT_TYPE: PERSON
ENT_CATEGORY: PER_CIV
ORGANIZATION: =
ENT_NAME: "N.Y. Times News Service"
ENT_TYPE: ORGANIZATION
ENT_CATEGORY: ORG_CO
Relation Extraction (RE)
– EmployeeOf(Steve Jobs, Apple)
a relation between a person and an organisation,
extracted from ‘Steve Jobs works for Apple’
– LocatedIn(Smith, New York)
a relation between a person and location,
extracted from ‘Mr. Smith gave a talk at the conference in
New York’,
– SubsidiaryOf(TVN, ITI Holding)
a relation between two companies,
extracted from ‘Listed broadcaster TVN said its parent
company, ITI Holdings, is considering various options for
the potential sale.
Jakub Piskorski and Roman Yangarber. Information Extraction: Past, Present and Future. Chapter 2.
Message Understanding Competition
Message Understanding Conference
MUC - 7
Exemple
<COREF ID="1" MIN="boys and girls">The sleepy boys and girls</COREF> enjoy
<COREF ID="2" REF="1" TYPE="IDENT">their</COREF> breakfast.
Edna Fribble and Sam Morton addressed the meeting yesterday. Ms. Fribble
discussed coreference, and Mr. Morton discussed unnamed entities.
Coreference Task
Detectarea co-referințelor
Exemple
"Ciinii rosii"
rosii i-au promis antrenorului Cornel Dinu un frumos cadou de
ziua sa de nastere, calificarea in turul urmator
Dinamo vrea un miracol
Rezultatul inregistrat in partida tur cu Polonia Varsovia,
Varsovia 3-4, a scazut
simtitor sansele de calificare ale dinamovistilor in turul trei preliminar al
Ligii Campionilor. Mai grav este faptul ca a diminuat considerabil
posibilitatea ca "ros-albii" sa evolueze macar in Cupa UEFA.
Message Understanding Competition
Message Understanding Conference
MUC - 7
Coreference Task
Detectarea co-referințelor
Exemple
"Ciinii rosii"
rosii i-au promis antrenorului Cornel Dinu un frumos cadou de
ziua sa de nastere, calificarea in turul urmator
Dinamo vrea un miracol
Rezultatul inregistrat in partida tur cu Polonia Varsovia,
Varsovia 3-4, a scazut
simtitor sansele de calificare ale dinamovistilor in turul trei preliminar al
Ligii Campionilor. Mai grav este faptul ca a diminuat considerabil
posibilitatea ca "ros-albii" sa evolueze macar in Cupa UEFA.
Event Extraction (EE)
Un email cu anunțul oficial despre un meeting:
“hey i also wanted to invite you to the final software faire for cs194. It's
where we will be showing off our senior project. It's next wed from
like 12 -4 i think or something.”
Julie A. Black, Nisheeth Ranjan. Automated Event Extraction from Email
Rezultate MUC-7
Co-
reference
Named
Entites
Scenario
Templates
Template
Elements
Elaine Marsh, Dennis Perzanowski. MUC-7 EVALUATION OF IE TECHNOLOGY: Overview of Results, 1998
Plan
1. Pentru ce? Necesitatea și Importanța
2. Ce anume? Descrierea, detalierea
3. Cum? Metode, instrumente
4. Aplicații. Soft (free downloadable or API)
5. Viitorul
Metodele utilizate pentru
extragerea informației
1. Deterministe
1. Expresii regulate
2. Automate finite
3. Utilizarea cunoștințelor suplimentare
2. Statistice
1. Metodele de instruire a calculatorului
2. Modele statistice în baza secvențelor
1. Lanț Markov
2. Conditional Random Fields
Aici urmează prezentarea cu
expresii regulate
[a-zA-Z0-9]+@[a-z]+(\.[a-z]+)*\.([a-z]{2,})
olivier@mailbidon.com
olivier@mailbidon.ca
8@mailbidon.com
@mailbidon.com
olivier@mailbidon
Exemple:
crearea semiautomată a expresiilor
regulate
Potential tags:
ORGANIZATION
LOCATION
PERSON
• https://www.quora.com/Natural-Language-
Processing/What-APIs-and-libraries-can-
extract-dates-times-places-and-other-
logistical-information-from-unstructured-
text
Aplicații
http://services.gate.ac.uk/annie/
Named Entity
Recognition with
ANNIE
GATE annotation tool
The 2 Minute Guide to GATE
• Take one large pile of text (documents, emails, tweets, patents,
papers, transcripts, blogs, comments, acts of parliament, and so
on and so forth) -- call this your corpus.
• Pick a structured description of interesting things in the text (a
telephone directory, or chemical taxonomy, or something from
the Linked Data cloud) -- call this your ontology.
• Use GATE Teamware to mark up a gold standard example set of
annotations of the corpus (1.) relative to the ontology (2.).
• Use GATE Developer to build a semantic annotation pipeline to
do the annotation job automatically and measure performance
against the gold standard.
• Take the pipeline from 4. and apply it to your text pile using GATE
Cloud (or embed it in your own systems using GATE Embedded).
• Use GATE Mimir to store the annotations relative to the ontology
in a multiparadigm index server. (For techies: this sits in the
backroom as a RESTful web service.)
• Use Ontotext KIM to add semantic search, knowledge facet
search, ontology browsing, entity popularity graphing, time series
graphing, annotation structure search and (last but not least)
boolean full text search. (More techy stuff: mash up these types of
search with your existing UIs.)
Plan
1. Pentru ce? Necesitatea și Importanța
2. Ce anume? Descrierea, detalierea
3. Cum? Metode, instrumente
4. Aplicații. Soft (free downloadable or API)
5. Viitorul
http://social-book-search.humanities.uva.nl/#/mining
The European Commission’s Joint Research Centre (JRC) in Ispra,
Italy, is looking for a trainee to support the JRC’s Europe Media
Monitor (EMM) team with a variety of Language Technology-related
tasks.
EMM gathers and analyses reports from traditional and social media in
dozens of languages by clustering related news items; categorising
them; extracting information such as:
entities (persons, organisations, locations),
events (who did what to whom, where and when),
quotations by and about people;
identifying sentiment;
linking related news clusters over time and across languages.
Methods used are mostly hybrid: machine learning tools are used to
gather evidence, learn vocabulary and rules, but the results are
usually controlled and optimised through human intervention.
The public EMM applications can be accessed at the URLs
http://emm.newsbrief.eu/overview.html
and http://emm.newsexplorer.eu.
We are looking to fill a traineeship position in the field of
‘Multilingual Text Analysis’.
If you are interested, please follow the instructions provided at
the URL listed below.
URL of call: http://recruitment.jrc.ec.europa.eu/
Call number: 2016-IPR-G-000-6713
Title of call: Multilingual Text Analysis
Deadline: 21 March 2016 (Brussels time)
Starting date: 1 June 2016
Duration: 5 months
Eligibility
requirements: https://ec.europa.eu/jrc/en/working-with-
us/jobs/temporary-positions/jrc-trainees
The successful trainee will carry out any
of the following tasks:
a) use third-party software to carry out a terminology use study,
which includes comparing occurrences of terms and their variants in
English, French and Spanish;
b) gather existing definitions of important terms from the internet;
c) improve the JRC’s existing entity-oriented sentiment analysis tools,
then analyse large quantities of sentiment data and its change over
time with the purpose of identifying opinion change patterns and
trends;
d) contribute to the semi-automatic classification of entities in JRC’s
multilingual entity database;
e) contribute to improving the recognition of multilingual organisation
names;
f) annotation of linguistic data and/or evaluation of automatic text
analysis results;
g) contribute to writing a scientific publication.
Qualifications:
Essential:
· A degree (or an almost completed degree) in computational linguistics,
computer science or related areas;
· Programming skills;
· Good command of oral and written English (level B2).
Advantage:
· Knowledge of further foreign languages;
· Proven advanced programming skills, especially in Java;
· Good knowledge of Language Technology-related tools and methods;
· The proven ability to work independently and as part of a team.