QA IR Tutorial

AMADEUS
Architectures Machines And Devices for Efficient Ubiquitous Systems
AMADEUS TUTORIAL
Architectures Machines And Devices for Efficient Ubiquitous Systems
Combining Question Answering

and
Information Retrieval
Suresh Manandhar and Thimal Jayasooriya

University of York
UDA
UDA– Ubiquitous Digital
– Ubiquitous Agents
Digital Agents
Search Engines
Question: How tall is Mt Everest?
– IR could give following answers:
• “Mt Everest was first climbed by Hilary”
• “Mt Everest is part of the Himalayan range”
• “Susan Armstrong the 28 yr old rep from New York
climbed the 8800m high Mt Everest”
• … plus a large number of irrelevant answers
• Search engines are not so good at reasoning with syntax

and semantics
UDA – Ubiquitous Digital Agents

Beyond Search Engines
• Want shallow understanding of Natural Language to
support a range of applications:
– Document clustering
– Topic based searching
– Question based searching
– Interfacing with DB backends
– Linking multiple related documents
• Intelligent searching to aid

– scientific discovery, document organisation, automatic
extraction of knowledge from textual data

Ingredients for building QA systems
• Spelling correction/checking
• PoS tagging
• Shallow Parsing
• Deep parsing and Logical form generation
• Lexical reasoning
• Matching
• … other kinds of reasoning

Part of Speech (POS) tagging
• PoS tagging is the pre-step to syntactic analysis
• Given I went to the bank output:
I-pronoun went-verb-past the-det bank-noun-sg
• Notice bank can be verb/noun
• I can be numeric or pronoun

• And there can be unknown words:
I went to Pokhara
• Task of PoS tagger is to assign the “correct” PoS tags

POS tagging
• current taggers employ HMMs trained on large corpora
• trigram-based p(t0 | t-2 t-1) taggers are the current state-

of-the-art - accuracy > 96%
• Q-A systems: lots of unknown words or known words used

unusually…e.g. “what i want is ... ?”
• low tagging accuracy on unknown words - best systems ~

84% to 88%
• far less tagged data on questions

Parsing and Grammar formalisms
• constructing canonical logical forms from sentences
• a good grammatical formalism will allow mapping:
“John bought a toy” / “A toy was bought by John”
…to roughly the same semantic representation
exists(Y): toy(Y) & buy(‘John’,Y)
• common syntactic phenomena include:

– relative clauses – The man that Mary liked went home.
– co-ordinations – Bill and Mary got married.
– question constructions – What book did John buy

Parsing… Pure CFGs not sufficient
• most automatically extracted grammars employ pure
CFGs
• need information on moved phrases to generate

semantics
• dependency information is crucial
• most machine learnt grammars focus on raw crossing

bracket rates

Parsing Issues: PP attachment ambiguity
• prepositional phrase attachment ambiguity
e.g. “I drank scotch on ice”
• better to leave PPs unattached rather than guessing

wrong
• combine both position-based and meaning-based

matching
• hybrid representations that combine logical form,

syntactic structure and string/position based
representations needed

Recollect: Search Engines
Question: How tall is Mt Everest?
– IR could give following answers:
• “Mt Everest was first climbed by Hilary”
• “Mt Everest is part of the himalayan range”
• “Susan Amstrong the 28 yr old rep from New York
• … plus a large number of irrelevant answers
• Search engines are not so good at reasoning with

syntax and semantics

Matching: Getting the right answer
Matching answers with questions
• “Susan Amstrong the 28 yr old rep from New York
Reasoning using logical form and lexical relations:

– Meaning representation: X = Mt_Everest & tall(X, ?Y)
– “How tall …” – asking for a numeric measure
– “tall” is related to “height/high”
– “the 8800m high Mt Everest”:
high(X,Y) & Y=8800m & X = Mt_Everest

Matching – Lexical Relations
• Lexical relations are semantic relations between
words:
– Synonym : (human – person)
– Antonym: (tall – short)
– Hyponym: (BMW – car)
– Meronym: (door – house)
– Entailment: (fire – smoke)
– ….. plus many more
• Matching algorithm computes the semantic distance

between the Question and the Answer.

Matching and reasoning
• reasoning crucial to Q-A
• WordNet provides:
– hypernym, synonym, antonym, meronym, etc.
• (but) common relations required for Q-A tasks missing:

– noun-adjective (benefit-beneficial)
– verb-noun (punish-punishment)
– entailment (penalty-punishment)
– telic (hammer-break)
• limited current research on learning of semantic

relations

Search engines
Current state of the art
– Google uses backlinks to determine the most relevant

pages
– Most of the other search engines use keyword

scanning techniques
– Online directories, such as DMOZ, use human editors

to sort and rank content

Beyond Search Engines
Essential ingredients
– A better syntactic understanding of document

contents
– More efficient means of grouping “similar” or related

elements together
– A better understanding of “relevance” to the user … A

better query interface ?

The ideal situation
• Error free disambiguation of natural language in
documents
• Categorization of documents by subject and intent of the

author; rather than by scanned keywords
• The opportunity to clarify the information needs of

individual users; to closer match what they want

Possible dimensions for queries …
Who is the First Lord of
the Treasury
of the United Kingdom ?
First
FirstLords
Lordsof
ofthe
theTreasury
Treasury Prime
PrimeMinisters
Ministers
Head
Headof
ofState
Statein
inthe
theUK
UK

Are Document Dimensions an answer ?
In datawarehousing terminology, a dimension is
“a structure that categorizes data in order to enable end

users to answer questions”
Charles Bachman urged programmers to think in terms of

multi-dimensional space as far back as 1973
Mothè experimented with document metadata in

dimensional space (2001 and 2003)
Roelleke used the accessibility dimension to determine

relevance within a document

How are dimensions created ?
• Cleanse and tokenize source data
• Shallow parse the source data to resolve some syntactic

ambiguity
• Extract a series of unique terms, words or phrases
• Determine the similarity between individual terms
• Organize similar terms into dimensions, groups of

semantically related elements

Analysing source documents
The source data was the sample newswire articles from
TREC-11, 3 gigabytes of XML formatted data, consisting
of around 20,000 articles
• Stripping XML formatting
• Detecting sentence boundaries in articles
• POS tagging individual sentences
• Named Entity and Coreference annotation

NE annotation
Named
NamedEntity
Entity Named
NamedEntity
Entity
(Person
(PersonName)
Name) (Location
(Locationname)
name)
Kenneth Joseph Lenihan, a New York research

sociologist who helped refine the scientific methods
used in criminology, died May 25 at his home in
Manhattan.
Named
NamedEntity
Entity
Named
NamedEntity
Entity (Temporal
(Temporalentity)
entity)
(Location
(Locationname)
name)

Semantic distance
• Semantic distance uses the concept of relatedness, or
the semantic similarity between two lexical concepts
• Grouping synonyms together seems intuitive.

i.e.: humans ≈ people ≈ beings
• But surprisingly, other lexical concepts such as

meronyms, hyponyms, hypernyms, troponyms and even
antonyms can also be semantically close.
• Different semantic distance algorithms for Wordnet

quantify “relatedness” in different ways
(Budanitsky2001)

Semantic distance – continued
Dimensions are found by setting an inclusion distance, an
experimentally derived figure for semantic distance
The inclusion distance differs between algorithms; and can

sometimes even differ depending on the dataset
All terms which are within the specified inclusion distance

are grouped in the same dimension
Terms within a dimension serve as a starting point for

searching related concepts

Dimension example
#dimension(duck:verb)
Synonyms Antonyms Troponyms Hypernyms

avoid, move straighten quibble avoid
crouch, elude unbend
sidestep

Issues with dimensions
• Search space explosion (at least thrice the number of
documents are returned)
• Stored semantic knowledge is not sufficiently granular:

– “duck” has 85 different entries in Roget’s thesaurus; 47
verb definitions, 21 noun definitions and 18 uses as an
adjective.
– However, the term database stores only the part of speech
tag. Thus, all 47 uses of “duck” as a verb are clumped
together
• Wordnet is not sufficiently rich in lexical relations, nor

sufficiently inclusive of modern language idioms

The way forward
• Adding Natural language processing techniques to search
is the answer
– Processing capabilities allow NLP techniques to be included
without significant degradation of speed
– Richer lexicons and language resources are being developed
– People are continually asking harder questions of available
information resources; keyword searches no longer satisfy
end users!
• IBM’s WebFountain and the MOMINIS research project are

two of several research initiatives to bring focused
crawling and natural language processing techniques to
search

Thanks for listening

QA IR Tutorial

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

QA IR Tutorial

Încărcat de

Drepturi de autor:

Formate disponibile

AMADEUS

Architectures Machines And Devices for Efficient Ubiquitous Systems

Combining Question Answering

Suresh Manandhar and Thimal Jayasooriya

• Search engines are not so good at reasoning with syntax

UDA – Ubiquitous Digital Agents

• Intelligent searching to aid

UDA – Ubiquitous Digital Agents

• Deep parsing and Logical form generation

• … other kinds of reasoning

UDA – Ubiquitous Digital Agents

• Notice bank can be verb/noun

• I can be numeric or pronoun

• Task of PoS tagger is to assign the “correct” PoS tags

UDA – Ubiquitous Digital Agents

• trigram-based p(t0 | t-2 t-1) taggers are the current state-

• Q-A systems: lots of unknown words or known words used

• low tagging accuracy on unknown words - best systems ~

• far less tagged data on questions

UDA – Ubiquitous Digital Agents

exists(Y): toy(Y) & buy(‘John’,Y)

• common syntactic phenomena include:

UDA – Ubiquitous Digital Agents

• need information on moved phrases to generate

• dependency information is crucial

• most machine learnt grammars focus on raw crossing

UDA – Ubiquitous Digital Agents

• better to leave PPs unattached rather than guessing

• combine both position-based and meaning-based

• hybrid representations that combine logical form,

UDA – Ubiquitous Digital Agents

• Search engines are not so good at reasoning with

UDA – Ubiquitous Digital Agents

Reasoning using logical form and lexical relations:

UDA – Ubiquitous Digital Agents

• Matching algorithm computes the semantic distance

UDA – Ubiquitous Digital Agents

• (but) common relations required for Q-A tasks missing:

• limited current research on learning of semantic

UDA – Ubiquitous Digital Agents

Current state of the art

– Google uses backlinks to determine the most relevant

– Most of the other search engines use keyword

– Online directories, such as DMOZ, use human editors

UDA – Ubiquitous Digital Agents

– A better syntactic understanding of document

– More efficient means of grouping “similar” or related

– A better understanding of “relevance” to the user … A

UDA – Ubiquitous Digital Agents

• Categorization of documents by subject and intent of the

• The opportunity to clarify the information needs of

UDA – Ubiquitous Digital Agents

UDA – Ubiquitous Digital Agents

“a structure that categorizes data in order to enable end

Charles Bachman urged programmers to think in terms of

Mothè experimented with document metadata in

Roelleke used the accessibility dimension to determine

UDA – Ubiquitous Digital Agents

• Shallow parse the source data to resolve some syntactic

• Extract a series of unique terms, words or phrases

• Determine the similarity between individual terms

• Organize similar terms into dimensions, groups of

UDA – Ubiquitous Digital Agents