Sunteți pe pagina 1din 24

Lucene and Solr

Lucene
Doug Cutting
Created in 1999 Donated to Apache in 2001

Features
Highly scalable Java (1.4) Ports to many other languages No crawler No document parsing No PageRank

Lucene
Powered by Lucene
IBM Omnifind Y! Edition Technorati Wikipedia Internet Archive LinkedIn monster.com

Indexing

Logical structure
Index is collection of documents Documents are a collection of fields Fields are the content

Indexed terms stored in inverted index

Stored Stored verbatim for retrival with results Indexed Tokenized and made searchable

Physical structure
Multiple documents (with all fields) stored in segments All segments together make up the index
mergeFactor

IndexWriter is interface object for entire index

Indexing
aardvark Little Red Riding Hood hood little red riding robin 0 0 0 0 1 2 1 2 Robin Hood 0

Little Women
women zoo 2

Indexing

Analysis
Extract tokens from text (tokenizer)
Whitespace Hyphens

Manipulate or modify tokens (token filter)


Stemming Removal

Tokenizer / Token Filter chains are called analyzers

Indexing
LexCorp BFG-9000 WhitespaceTokenizer LexCorp BFG-9000

WordDelimiterFilter catenateWords=1 Lex Corp LexCorp LowercaseFilter lex corp lexcorp bfg 9000 BFG 9000

Searching

Query Creation
Query parser Manual query construction from terms
title:Bell author:Hemmingway^3.0

Query terms are analyzed


Same analyzer for indexing and searching on each field

Searching
LexCorp BFG-9000 WhitespaceTokenizer LexCorp BFG-9000 Lex corp bfg9000 WhitespaceTokenizer Lex corp bfg9000

WordDelimiterFilter catenateWords=1 Lex Corp LexCorp LowercaseFilter lex corp lexcorp bfg 9000 BFG 9000

WordDelimiterFilter catenateWords=0 Lex corp bfg 9000

LowercaseFilter lex corp bfg 9000

A Match!

Searching

Many query types


Term Phrase
bad wolf

Proximity
quick fox~4

Prefix
pla?e practic* practically) (plate or place or plane) (practice or practical or

Fuzzy (edit distance)


planting~0.75 roam~ (granting or planning) (default is 0.5)

Range
date:[05072007 TO 05232007] (inclusive) author: {king TO mason} (exclusive)

Searching

Multiple searchers at once


Thread safe

Additions or deletions to index are not reflected in already open searchers


Must be closed and reopened

Use commit or optimize on indexWriter

Lucene Sub-projects

Nutch
Web crawler with document parsing

Hadoop
Distributed data processor Implements MapReduce

Solr

Solr
Yonik Seeley
Developed at CNET Donated to Apache in 2006

Features
Servlet Web Administration Interface XML/HTTP, JSON Interfaces Faceting Schema to define types and fields Highlighting Caching Index Replication (Master / Slaves) Pluggable Java 5

Solr
Powered by Solr
Netflix CNET Smithsonian AOL:sports and music RightNow ?? Drupal module GameSpot

Configuration (solrconfig.xml)
<mainIndex> <useCompoundFile>false</useCompoundFile> <mergeFactor>10</mergeFactor> <maxBufferedDocs>1000</maxBufferedDocs> <maxMergeDocs>2147483647</maxMergeDocs> <maxFieldLength>10000</maxFieldLength> </mainIndex> <requestHandler name="standard" class="solr.StandardRequestHandler" /> <requestHandler name=custom" class="your.package.CustomRequestHandler" /> <autoCommit> <maxDocs>10000</maxDocs> <maxTime>1000</maxTime> </autoCommit> <queryResponseWriter name="xml" class="org.apache.solr.request.XMLResponseWriter" default="true"/>

Schema (schema.xml)
Fields
<uniqueKey>id</uniqueKey> <field name="products" type="text" indexed="true" stored=true"/> <field name="keywords" type="text_ws" indexed="true" stored=true/> <field name="keywordsSorted" type="text_sorted" indexed="true" stored="false"/> <field name="timestamp" type="date" indexed="true" stored="true" default="NOW"/> <dynamicField name="*_i" type="integer" indexed="true" stored="true"/> <dynamicField name="desc_*" type="string" indexed="true" stored="false"/> <copyField source=keywords" dest=keywordsSorted"/>

Schema
Analyzers
<fieldtype name="nametext" class="solr.TextField">

<analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldtype> <fieldtype name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldtype> <fieldtype name="myfieldtype" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="German" /> </analyzer> </fieldtype>

Insertion

HTTP POST to http://localhost:8983/solr/update/


<doc> <field name="employeeId">05991</field> <field name="office">Bridgewater</field> <field name="skills">Perl</field> <field name="skills">Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]]

<add>

</add>

Documents or fields can have boosts attached

Update / Delete
Inserting a document with already present uniqueKey will erase the original Deleting

By uniqueKey field
<delete><id>05991</id></delete>

By query
<delete><query>name:Anthony</query></delete>

<Commit/> <Optimize/>

Search

Core parameters
qt query type (request handler) wt writer type (response writer)

Common parameters
q sort start rows fq filters fl return fields

Search

Faceting
Available in StandardRequestHandler and DisMaxRequestHandler

Search
http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=1&facet.field=cat&facet.mincount=1&facet.field=inStock <response> <responseHeader> <status>0</status> <QTime>3</QTime> </responseHeader> <result numFound="4" start="0"/> <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="cat"> <int name="music">1</int> <int name="connector">2</int> <int name="electronics">3</int> </lst> <lst name="inStock"> <int name="false">3</int> <int name="true">1</int> </lst> </lst> </lst> </response>

Many more features

Replication
Master / Slave architecture for load balancing and backups

More-like-this Easy to add RequestHandlers and ResponseWriters Responses in many formats Hit highlighting

Sources

http://lucene.apache.org/ http://lucene.apache.org/solr/ http://people.apache.org/~yonik/presentations/

S-ar putea să vă placă și