Search Engines Performance - Explained

About me
Over 12 years in software world Israeli Air Force Israel Discount Bank SAP
Team Leader System Architect

Java Eco-System Continuous Delivery Search Big Data Contact: @alonaizenberg, alonaizenberg.blogspot.com, alon.aizenberg@gmail. com
Search Engine Performance

explained
Apache Solr
In this talk we will take Apache Solr as an example to Search Engine, but the majority of concepts and mechanisms are true for most of the available products on the market
Agenda
History Market at a glance Anatomy of a typical search system Scenarios and problems Scaling the search scenario Handling large data-sets Handling request load Achieving high availability
History
1994 - Lycos 1995 - AltaVista, Yahoo! 1997 - Yandex 1998 - Google, MSN search 2000 - First lucene version (marks the raise of custom search implementations) 2006 - ask.com, AOL search 2009 - Bing
Market at a Glance
Many open source offerings: Apache Lucine, Apache Solr (built on lucine), Nutch, Sphinx, ElasticSearch (built on lucine), Xapian, many more... Some enterprise solutions: Google (Google Search Appliance, Google Mini) Sap (TREX, Enterprise search) IBM (OmniFind) Oracle (Oracle Secure Enterprise Search) Microsoft (FAST search server) Almost no standards: OpenSearch, Robot Exclusion Standard
Anatomy
of typical search system
Anatomy of typical search system
Anatomy of typical search system

How is data stored in the engine? Index file(s) Each index is a collection of Documents (we will see later that this is not really true) Document is a collection of data fields A field can be of any Data type (text, integer, boolean etc.) An index file has internal data structure mapping from terms to Documents (inverted index) Very similar to Data Base table How is information indexed? Indexing API allows for programs to index information in a transparent way Remote How is information retrieved / searched? Rich query language (like SQL) allowing complex search queries Remote
Solr index structure
Scenarios and their problems
2 main scenarios
Search scenario: search for a term Problems: How to execute a search on a big data-set, fast. How to scale the solution to serve any given number of concurrent requests. How to provide a highly available service. Indexing scenario: build indexes via add/delete/update document operations Problems: How to index a large number of documents, fast. We will discuss only search scenario, if we will have time, we will touch the indexing scenario too.
Scaling the Search Scenario

handilng large data-sets
Handling Large Data-Sets

Searching in an index is a function of data size. To process a big data-set efficiently, we have to break down the data into smaller parts, process them concurrently, and then combine the results. This principal is also called Map Reduce. In search, we split large indexes into shards, and search each shard concurrently. Concurrent search request processing can happen in the same machine on multiple CPUs, or on different machines
Handling Large Data-Sets - Map Reduce

2 steps process: "Map" step - The master node takes the input, partitions it up into smaller subproblems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node.
"Reduce" step - The master node then collects the answers to all the subproblems and combines them in some way to form the output the answer to the problem it was originally trying to solve.
Map Reduce Examples

Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. Generate Inverted Index: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.
Scaling Map Reduce

Scale up (vertically) approach. Add more CPU / Memory so more threads can run concurrently searching different parts of the index. Pros: Easy to implement on a single machine. Usually no performance compromises as we add more CPU/Memory (almost linear scalability) Cons: We want to use small and cheap machines, and run big scenarios. Large data sets, which cannot fit into one physical machine. a "Scale up"-only approach is not realistic, due to above cons.
Scaling Map Reduce

Scale out (horizontally) approach Split the data on multiple machines and run the search tasks in parallel on multiple nodes (a.k.a Sharding or Distributed search). Pros: Cheap machines. No limits on data sets. Cons: Complex implementation. Performance penalty for large clusters (not linear scalable). Each request is handled by all machines / shards. This approach is what most of the big projects implement.
Handling Large Data-Sets - Distributed Sharding
Handling Large Data-Sets - Distributed Sharding
Handling Large Data-Sets - Distributed Sharding Problems

The more data we have, the more index shards we will split our cluster into. Adding shards is not for free. Each shard brings performance penalty. Distributing the query to multiple nodes, is time and network bound. Waiting for results from all nodes. Not all nodes behave equally even when all nodes have same hardware specifications, and same amount of data indexed. Wasting time executing the reduce function. Does not scale in linear manner, more nodes = less performance gain from each new node.
Scaling the Search Scenario

handing request load
Handling Request Load

Now we know how to cope with a LOT of data, But how do we handle a LOT of users / search requests? Scale this solution horizontally (scale out again), by replicating each shard. Each Shard exists in one Master machine and multiple Slave machines (replicas of the Master). The Master is responsible for running indexing requests only. It has the most recent index instance. Slaves replicate the index(s) from the Master node, and serve search requests. Load balance the Slaves with standard hardware / software load balancing solutions. The more load / users you have, the more slaves you add to handle the search requests.
Handling Request Load - Replication
Handling Request Load - replication

Indexes may be composed of multiple sub-indexes, or segments. Each segment is a fully independent index, which could be searched separately. Common scenario: add a bulk of documents to an index shard on a master server. A new segment is created or altered (in remove document or update scenarios). Master takes a snapshot of the index state at a given time, marking the new/changed segments in the index. Slaves poll the Master, to see if any segment should be replicated. Segments are replicated to all slave nodes. A new 'view' is created for the new segment configuration.
Handling Request Load replication problems

As an index grows, it becomes more segmented. The search function becomes inefficient. Therefore, index optimization happens on all master and slave nodes, to merge compact the segments. Replication protocol can be selected and tuned including replication rules. Solr supports unix/rsync/script or Pure Java replication mechanisms.
Handling Request Load - Search Query
Handling Request Load - Search Query

User execute a search query Load balancer selects a slave node on one of the shards and forwards the request to it. The Shard gets the request, distributes it to other index shards, and executes the processing on its own piece of index. When all shards finish the processing, they send the results back to the node which got the original result. All results are sorted, and returned to the user.
Handling Request Load - Search Query Problems

The more users / requests we have, the more Slaves we can add. Adding slaves is not free. Each slave adds more network chat to the system. Each slave polls the Master for updates, putting load on the master and using bandwidth. Each slave replicates the index deltas, providing additional load on the master and network. The more you distribute the more performance overhead you get. Not linearly scalable.
Achieving high availability

If we have many slaves serving same search function, we can continue to serve search requests even if not all the slaves in a given Shard are available. Solr search API is http based. With http health checks on the http load balancer side, we can take out of the cluster the problematic slave nodes. The more slaves we have for each shard, the highly a system is available. We got search high availability for free.
Summary
To handle more data, split the system into shards. To handle more requests add more Slave nodes. We achieved highly available, and fully salable (data and load wise) search system.
Questions
?
Thank you

Search Engines Performance - Explained

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Search Engines Performance - Explained

Încărcat de

Drepturi de autor:

Formate disponibile

About me

Team Leader System Architect

Search Engine Performance

Anatomy of typical search system

Anatomy of typical search system

Solr index structure

Scenarios and their problems

Scaling the Search Scenario

Handling Large Data-Sets

Handling Large Data-Sets - Map Reduce

Map Reduce Examples

Scaling Map Reduce

Scaling Map Reduce

Handling Large Data-Sets - Distributed Sharding

Handling Large Data-Sets - Distributed Sharding

Handling Large Data-Sets - Distributed Sharding Problems

Scaling the Search Scenario

Handling Request Load

Handling Request Load - Replication

Handling Request Load - replication

Handling Request Load replication problems

Handling Request Load - Search Query

Handling Request Load - Search Query

Handling Request Load - Search Query Problems

Achieving high availability

Achieving high availability

Achieving high availability

S-ar putea să vă placă și