Tutorial: Lucene - Toolkit For Information Retrieval

Data Mining
Spring 2015
Lucene: An Open source Toolkit
for Information Retrieval
Lucene: an Open Source IR Toolkit

We are going to learn
How to index documents using Lucene
How to represent query using Lucene
How to process query using Lucene
Reference
Lucene in Action, 2nd Edition (by Michael
McCandless, Erik Hatcher, Otis Gospodneti)
Lucene is freely available at (

http://lucene.apache.org/core/) under Apache
Open Source License
Lucene Structure
Document
field1
field2
field3
Query
addDocument(
)
Hits
(Matching Docs)
search(
)
IndexWriter
IndexSearcher
Lucene Index
Indexing in Lucene
Lucene indexer is based on Inverted Index
Concept (remember posting lists)
Indexing in Lucene (Example)

Directory directory = new RAMDirectory();
Analyzer analyzer = new SimpleAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer, true,
IndexWriter.MaxFieldLength.LIMITED);
addDoc(writer, "Lucene in Action"); // first document
addDoc(writer, "Lucene for Dummies"); // second document
addDoc(writer, "Managing Gigabytes"); // third document
addDoc(writer, "The Art of Computer Science"); //fourth document
writer.close();
private static void addDoc(IndexWriter writer, String value) throws IOException
{
Document doc = new Document();
doc.add(new Field(contents", value, Field.Store.YES,Field.Index.ANALYZED));
writer.addDocument(doc);
}
Field Options
Field.Index.ANALYZED: This breaks the fields
value into a stream of separate tokens and makes
each token searchable. This option is useful for
normal text fields (body, title, abstract, etc.).
Field.Index.NOT_ANALYZED: Do index the field,
and stores the entire Fields value as a single. Useful
in case of storing URLs field.
Field.Index.No: Dont make this fields value
available for searching.
Deleting Documents
Lucene provides a variety of functions for
deleting indexed documents.
deleteDocuments(Term): delete all documents
containing provided term.
deleteDocuments(Term []): delete all documents
containing any of the terms in the provided array.
deleteDocuments(Query): delete all documents
matching the provided query.
deleteDocuments(Query []): delete all documents
matching any of the queries in the provided array.
deleteAll()
Deleting Single Document

If you want to delete a single document, then
the best solution is to store unique IDs of
documents in a separate field
IndexWriter writer = getWriter();

writer.deleteDocuments(computer);
Expressing Query in Lucene

void main(. args)
{
String querystr = hello world;
QueryParser parser = new QueryParser("contents",
analyzer);
Query query = parser.parse(querystr);
}
Query Expression
The default operator in Lucene is OR.
The Query = hello world is equivalent to
hello OR world.
The class QueryParser parses the query
expression
In Lucene we can specify the following kind
of query expressions.
Query Expression
Query Expression
Processing Query
After expressing query, we need to create a Searcher to search the index. Then we need to instantiate a TopScoreDocCollector to collect the top 10
scoring hits.
Searcher searcher = new IndexSearcher(directory);
TopDocs topDocs = searcher.search(query, 10);
System.out.println(topDocs.totalHits + " total results");
for (int i = 0; i < topDocs.totalHits; i++) {
ScoreDoc match = topDocs.scoreDocs[i];
Document doc = searcher.doc(match.doc);
System.out.println(doc.get("contents"));
}
How Lucene Scores Documents

By default Lucene (ranking strategy) is based
on vector space model
Explaining Rank Scores

With Explanation class, Lucene can explain the rank scores of documents with the help of different factors.
Searcher searcher = new IndexSearcher(directory);
TopDocs topDocs = searcher.search(query, 10);
System.out.println(topDocs.totalHits + " total results");
for (int i = 0; i < topDocs.totalHits; i++) {
ScoreDoc match = topDocs.scoreDocs[i];
Explanation explanation = searcher.explain(query,
Document doc = searcher.doc(match.doc);
System.out.println(doc.get("contents"));
System.out.println(explanation.toString());
}
match.doc);
Explaining Rank Scores

Tutorial: Lucene - Toolkit For Information Retrieval

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Tutorial: Lucene - Toolkit For Information Retrieval

Încărcat de

Drepturi de autor:

Formate disponibile

Data Mining

Lucene: an Open Source IR Toolkit

Lucene is freely available at (

Indexing in Lucene (Example)

Deleting Single Document

IndexWriter writer = getWriter();

Expressing Query in Lucene

How Lucene Scores Documents

Explaining Rank Scores

Explaining Rank Scores

S-ar putea să vă placă și