Sunteți pe pagina 1din 27

AMADEUS

Architectures Machines And Devices for Efficient Ubiquitous Systems

AMADEUS TUTORIAL
Architectures Machines And Devices for Efficient Ubiquitous Systems

Combining Question Answering


and
Information Retrieval

Suresh Manandhar and Thimal Jayasooriya


University of York

UDA
UDA– Ubiquitous Digital
– Ubiquitous Agents
Digital Agents
Search Engines
Question: How tall is Mt Everest?
– IR could give following answers:
• “Mt Everest was first climbed by Hilary”
• “Mt Everest is part of the Himalayan range”
• “Susan Armstrong the 28 yr old rep from New York
climbed the 8800m high Mt Everest”
• … plus a large number of irrelevant answers

• Search engines are not so good at reasoning with syntax


and semantics

UDA – Ubiquitous Digital Agents


Beyond Search Engines
• Want shallow understanding of Natural Language to
support a range of applications:
– Document clustering
– Topic based searching
– Question based searching
– Interfacing with DB backends
– Linking multiple related documents

• Intelligent searching to aid


– scientific discovery, document organisation, automatic
extraction of knowledge from textual data

UDA – Ubiquitous Digital Agents


Ingredients for building QA systems
• Spelling correction/checking

• PoS tagging

• Shallow Parsing

• Deep parsing and Logical form generation

• Lexical reasoning

• Matching

• … other kinds of reasoning

UDA – Ubiquitous Digital Agents


Part of Speech (POS) tagging
• PoS tagging is the pre-step to syntactic analysis
• Given I went to the bank output:
I-pronoun went-verb-past the-det bank-noun-sg

• Notice bank can be verb/noun

• I can be numeric or pronoun


• And there can be unknown words:
I went to Pokhara

• Task of PoS tagger is to assign the “correct” PoS tags

UDA – Ubiquitous Digital Agents


POS tagging
• current taggers employ HMMs trained on large corpora

• trigram-based p(t0 | t-2 t-1) taggers are the current state-


of-the-art - accuracy > 96%

• Q-A systems: lots of unknown words or known words used


unusually…e.g. “what i want is ... ?”

• low tagging accuracy on unknown words - best systems ~


84% to 88%

• far less tagged data on questions

UDA – Ubiquitous Digital Agents


Parsing and Grammar formalisms
• constructing canonical logical forms from sentences
• a good grammatical formalism will allow mapping:
“John bought a toy” / “A toy was bought by John”
…to roughly the same semantic representation

exists(Y): toy(Y) & buy(‘John’,Y)

• common syntactic phenomena include:


– relative clauses – The man that Mary liked went home.
– co-ordinations – Bill and Mary got married.
– question constructions – What book did John buy

UDA – Ubiquitous Digital Agents


Parsing… Pure CFGs not sufficient
• most automatically extracted grammars employ pure
CFGs

• need information on moved phrases to generate


semantics

• dependency information is crucial

• most machine learnt grammars focus on raw crossing


bracket rates

UDA – Ubiquitous Digital Agents


Parsing Issues: PP attachment ambiguity
• prepositional phrase attachment ambiguity
e.g. “I drank scotch on ice”

• better to leave PPs unattached rather than guessing


wrong

• combine both position-based and meaning-based


matching

• hybrid representations that combine logical form,


syntactic structure and string/position based
representations needed

UDA – Ubiquitous Digital Agents


Recollect: Search Engines
Question: How tall is Mt Everest?
– IR could give following answers:
• “Mt Everest was first climbed by Hilary”
• “Mt Everest is part of the himalayan range”
• “Susan Amstrong the 28 yr old rep from New York
climbed the 8800m high Mt Everest”
• … plus a large number of irrelevant answers

• Search engines are not so good at reasoning with


syntax and semantics

UDA – Ubiquitous Digital Agents


Matching: Getting the right answer
Matching answers with questions
• “Susan Amstrong the 28 yr old rep from New York
climbed the 8800m high Mt Everest”

Reasoning using logical form and lexical relations:


– Meaning representation: X = Mt_Everest & tall(X, ?Y)
– “How tall …” – asking for a numeric measure
– “tall” is related to “height/high”
– “the 8800m high Mt Everest”:
high(X,Y) & Y=8800m & X = Mt_Everest

UDA – Ubiquitous Digital Agents


Matching – Lexical Relations
• Lexical relations are semantic relations between
words:
– Synonym : (human – person)
– Antonym: (tall – short)
– Hyponym: (BMW – car)
– Meronym: (door – house)
– Entailment: (fire – smoke)
– ….. plus many more

• Matching algorithm computes the semantic distance


between the Question and the Answer.

UDA – Ubiquitous Digital Agents


Matching and reasoning
• reasoning crucial to Q-A

• WordNet provides:
– hypernym, synonym, antonym, meronym, etc.

• (but) common relations required for Q-A tasks missing:


– noun-adjective (benefit-beneficial)
– verb-noun (punish-punishment)
– entailment (penalty-punishment)
– telic (hammer-break)

• limited current research on learning of semantic


relations

UDA – Ubiquitous Digital Agents


Search engines

Current state of the art

– Google uses backlinks to determine the most relevant


pages

– Most of the other search engines use keyword


scanning techniques

– Online directories, such as DMOZ, use human editors


to sort and rank content

UDA – Ubiquitous Digital Agents


Beyond Search Engines
Essential ingredients

– A better syntactic understanding of document


contents

– More efficient means of grouping “similar” or related


elements together

– A better understanding of “relevance” to the user … A


better query interface ?

UDA – Ubiquitous Digital Agents


The ideal situation
• Error free disambiguation of natural language in
documents

• Categorization of documents by subject and intent of the


author; rather than by scanned keywords

• The opportunity to clarify the information needs of


individual users; to closer match what they want

UDA – Ubiquitous Digital Agents


Possible dimensions for queries …
Who is the First Lord of
the Treasury
of the United Kingdom ?

First
FirstLords
Lordsof
ofthe
theTreasury
Treasury Prime
PrimeMinisters
Ministers

Head
Headof
ofState
Statein
inthe
theUK
UK

UDA – Ubiquitous Digital Agents


Are Document Dimensions an answer ?
In datawarehousing terminology, a dimension is

“a structure that categorizes data in order to enable end


users to answer questions”

Charles Bachman urged programmers to think in terms of


multi-dimensional space as far back as 1973

Mothè experimented with document metadata in


dimensional space (2001 and 2003)

Roelleke used the accessibility dimension to determine


relevance within a document

UDA – Ubiquitous Digital Agents


How are dimensions created ?
• Cleanse and tokenize source data

• Shallow parse the source data to resolve some syntactic


ambiguity

• Extract a series of unique terms, words or phrases

• Determine the similarity between individual terms

• Organize similar terms into dimensions, groups of


semantically related elements

UDA – Ubiquitous Digital Agents


Analysing source documents
The source data was the sample newswire articles from
TREC-11, 3 gigabytes of XML formatted data, consisting
of around 20,000 articles

• Stripping XML formatting

• Detecting sentence boundaries in articles

• POS tagging individual sentences

• Named Entity and Coreference annotation

UDA – Ubiquitous Digital Agents


NE annotation
Named
NamedEntity
Entity Named
NamedEntity
Entity
(Person
(PersonName)
Name) (Location
(Locationname)
name)

Kenneth Joseph Lenihan, a New York research


sociologist who helped refine the scientific methods
used in criminology, died May 25 at his home in
Manhattan.

Named
NamedEntity
Entity
Named
NamedEntity
Entity (Temporal
(Temporalentity)
entity)
(Location
(Locationname)
name)

UDA – Ubiquitous Digital Agents


Semantic distance
• Semantic distance uses the concept of relatedness, or
the semantic similarity between two lexical concepts

• Grouping synonyms together seems intuitive.


i.e.: humans ≈ people ≈ beings

• But surprisingly, other lexical concepts such as


meronyms, hyponyms, hypernyms, troponyms and even
antonyms can also be semantically close.

• Different semantic distance algorithms for Wordnet


quantify “relatedness” in different ways
(Budanitsky2001)

UDA – Ubiquitous Digital Agents


Semantic distance – continued
Dimensions are found by setting an inclusion distance, an
experimentally derived figure for semantic distance

The inclusion distance differs between algorithms; and can


sometimes even differ depending on the dataset

All terms which are within the specified inclusion distance


are grouped in the same dimension

Terms within a dimension serve as a starting point for


searching related concepts

UDA – Ubiquitous Digital Agents


Dimension example

#dimension(duck:verb)

Synonyms Antonyms Troponyms Hypernyms


avoid, move straighten quibble avoid
crouch, elude unbend
sidestep

UDA – Ubiquitous Digital Agents


Issues with dimensions
• Search space explosion (at least thrice the number of
documents are returned)

• Stored semantic knowledge is not sufficiently granular:


– “duck” has 85 different entries in Roget’s thesaurus; 47
verb definitions, 21 noun definitions and 18 uses as an
adjective.
– However, the term database stores only the part of speech
tag. Thus, all 47 uses of “duck” as a verb are clumped
together

• Wordnet is not sufficiently rich in lexical relations, nor


sufficiently inclusive of modern language idioms

UDA – Ubiquitous Digital Agents


The way forward
• Adding Natural language processing techniques to search
is the answer
– Processing capabilities allow NLP techniques to be included
without significant degradation of speed
– Richer lexicons and language resources are being developed
– People are continually asking harder questions of available
information resources; keyword searches no longer satisfy
end users!

• IBM’s WebFountain and the MOMINIS research project are


two of several research initiatives to bring focused
crawling and natural language processing techniques to
search

UDA – Ubiquitous Digital Agents


Thanks for listening

UDA – Ubiquitous Digital Agents

S-ar putea să vă placă și