Lucene and Solr

Lucene and Solr
Lucene
Doug Cutting
Created in 1999 Donated to Apache in 2001
Features
Highly scalable Java (1.4) Ports to many other languages No crawler No document parsing No PageRank
Lucene
Powered by Lucene
IBM Omnifind Y! Edition Technorati Wikipedia Internet Archive LinkedIn monster.com
Indexing
Logical structure
Index is collection of documents Documents are a collection of fields Fields are the content
Indexed terms stored in inverted index
Stored Stored verbatim for retrival with results Indexed Tokenized and made searchable
Physical structure
Multiple documents (with all fields) stored in segments All segments together make up the index
mergeFactor
IndexWriter is interface object for entire index
Indexing
aardvark Little Red Riding Hood hood little red riding robin 0 0 0 0 1 2 1 2 Robin Hood 0
Little Women
women zoo 2
Indexing
Analysis
Extract tokens from text (tokenizer)
Whitespace Hyphens
Manipulate or modify tokens (token filter)

Stemming Removal
Tokenizer / Token Filter chains are called analyzers
Indexing
LexCorp BFG-9000 WhitespaceTokenizer LexCorp BFG-9000
WordDelimiterFilter catenateWords=1 Lex Corp LexCorp LowercaseFilter lex corp lexcorp bfg 9000 BFG 9000
Searching
Query Creation
Query parser Manual query construction from terms
title:Bell author:Hemmingway^3.0
Query terms are analyzed

Same analyzer for indexing and searching on each field
Searching
LexCorp BFG-9000 WhitespaceTokenizer LexCorp BFG-9000 Lex corp bfg9000 WhitespaceTokenizer Lex corp bfg9000
WordDelimiterFilter catenateWords=1 Lex Corp LexCorp LowercaseFilter lex corp lexcorp bfg 9000 BFG 9000
WordDelimiterFilter catenateWords=0 Lex corp bfg 9000
LowercaseFilter lex corp bfg 9000
A Match!
Searching
Many query types

Term Phrase
bad wolf
Proximity
quick fox~4
Prefix
pla?e practic* practically) (plate or place or plane) (practice or practical or
Fuzzy (edit distance)

planting~0.75 roam~ (granting or planning) (default is 0.5)
Range
date:[05072007 TO 05232007] (inclusive) author: {king TO mason} (exclusive)
Searching

Multiple searchers at once

Thread safe
Additions or deletions to index are not reflected in already open searchers

Must be closed and reopened
Use commit or optimize on indexWriter
Lucene Sub-projects

Nutch
Web crawler with document parsing
Hadoop
Distributed data processor Implements MapReduce
Solr
Solr
Yonik Seeley
Developed at CNET Donated to Apache in 2006
Features
Servlet Web Administration Interface XML/HTTP, JSON Interfaces Faceting Schema to define types and fields Highlighting Caching Index Replication (Master / Slaves) Pluggable Java 5
Solr
Powered by Solr
Netflix CNET Smithsonian AOL:sports and music RightNow ?? Drupal module GameSpot
Configuration (solrconfig.xml)
<mainIndex> <useCompoundFile>false</useCompoundFile> <mergeFactor>10</mergeFactor> <maxBufferedDocs>1000</maxBufferedDocs> <maxMergeDocs>2147483647</maxMergeDocs> <maxFieldLength>10000</maxFieldLength> </mainIndex> <requestHandler name="standard" class="solr.StandardRequestHandler" /> <requestHandler name=custom" class="your.package.CustomRequestHandler" /> <autoCommit> <maxDocs>10000</maxDocs> <maxTime>1000</maxTime> </autoCommit> <queryResponseWriter name="xml" class="org.apache.solr.request.XMLResponseWriter" default="true"/>
Schema (schema.xml)
Fields
<uniqueKey>id</uniqueKey> <field name="products" type="text" indexed="true" stored=true"/> <field name="keywords" type="text_ws" indexed="true" stored=true/> <field name="keywordsSorted" type="text_sorted" indexed="true" stored="false"/> <field name="timestamp" type="date" indexed="true" stored="true" default="NOW"/> <dynamicField name="*_i" type="integer" indexed="true" stored="true"/> <dynamicField name="desc_*" type="string" indexed="true" stored="false"/> <copyField source=keywords" dest=keywordsSorted"/>
Schema
Analyzers
<fieldtype name="nametext" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldtype> <fieldtype name="text" class="solr.TextField"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldtype> <fieldtype name="myfieldtype" class="solr.TextField"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="German" /> </analyzer> </fieldtype>
Insertion
HTTP POST to http://localhost:8983/solr/update/

<doc> <field name="employeeId">05991</field> <field name="office">Bridgewater</field> <field name="skills">Perl</field> <field name="skills">Java</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]]
<add>
</add>
Documents or fields can have boosts attached
Update / Delete
Inserting a document with already present uniqueKey will erase the original Deleting
By uniqueKey field
<delete><id>05991</id></delete>
By query
<delete><query>name:Anthony</query></delete>
<Commit/> <Optimize/>
Search
Core parameters
qt query type (request handler) wt writer type (response writer)
Common parameters
q sort start rows fq filters fl return fields
Search
Faceting
Available in StandardRequestHandler and DisMaxRequestHandler
Search
http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=1&facet.field=cat&facet.mincount=1&facet.field=inStock <response> <responseHeader> <status>0</status> <QTime>3</QTime> </responseHeader> <result numFound="4" start="0"/> <lst name="facet_counts"> <lst name="facet_queries"/> <lst name="facet_fields"> <lst name="cat"> <int name="music">1</int> <int name="connector">2</int> <int name="electronics">3</int> </lst> <lst name="inStock"> <int name="false">3</int> <int name="true">1</int> </lst> </lst> </lst> </response>
Many more features
Replication
Master / Slave architecture for load balancing and backups
More-like-this Easy to add RequestHandlers and ResponseWriters Responses in many formats Hit highlighting
Sources

http://lucene.apache.org/ http://lucene.apache.org/solr/ http://people.apache.org/~yonik/presentations/

Lucene and Solr

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lucene and Solr

Încărcat de

Drepturi de autor:

Formate disponibile

Lucene and Solr

Indexed terms stored in inverted index

IndexWriter is interface object for entire index

Manipulate or modify tokens (token filter)

Tokenizer / Token Filter chains are called analyzers

Query terms are analyzed

WordDelimiterFilter catenateWords=0 Lex corp bfg 9000

LowercaseFilter lex corp bfg 9000

Many query types

Fuzzy (edit distance)

Multiple searchers at once

Additions or deletions to index are not reflected in already open searchers

Use commit or optimize on indexWriter

HTTP POST to http://localhost:8983/solr/update/

Documents or fields can have boosts attached

Many more features

http://lucene.apache.org/ http://lucene.apache.org/solr/ http://people.apache.org/~yonik/presentations/

S-ar putea să vă placă și