Sunteți pe pagina 1din 1
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System

2010-2011 Facebook Computer Science Clinic Project

Improving the Hive Database System

2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System
2010-2011 Facebook Computer Science Clinic Project Improving the Hive Database System

Query Execution Time [s]

Facebook

Facebook has over 500 million active users, who generate terabytes of data every day. Analyzing this data presents a challenge unsolvable with traditional database systems. Facebook chose to create the Hive distributed database system to address their needs.

Hive exists on top of Hadoop, an open source distributed computation framework.

It provides a familiar SQL-like syntax and table based storage to the users.

It enables high-throughput queries on massive datasets.

SELECT user_id, gender FROM Data WHERE age="22"

Hive: QL queries Hadoop: MapReduce Jobs file file file file file file file file file
Hive: QL queries
Hadoop: MapReduce Jobs
file
file
file
file
file
file
file
file
file
file
file
file
Answer!

Hive is a database layer on top of Hadoop, which uses MapReduce to distribute computation across multiple servers.

Indexing

Indexing is a technique that can be used to improve data lookup times in a table. An index is an auxiliary data structure that provides a faster means to access rows of a table by using the values of a particular set of columns as a key.

Hive supported a rudimentary indexing framework that only contained a single type of index. The Facebook clinic team worked on improving this framework by adding support for bitmap indexes, and by adding support for using indexes automatically when running queries.

Automatic Index Usage

Using indexes was difficult in the previous Hive system, as it required the user to understand how each type of index was implemented. Using an index to speed up a query required the user to:

• Query the index to produce an intermediate file of relevant regions of the table. The form of this intermediate query depends on the specific type of index being used.

• Set Hive to scan only the regions referenced by this intermediate file.

Speedup then comes from Hive having only to read in specific parts of the table to evaluate the query, instead of having to read in the entire table.

Automatic index usage allows users to benefit from indexes without having to understand implementation details. Our team worked on allowing indexes to be automatically used to speed up queries that contain WHERE clauses.

Automatic index usage was implemented as a stand alone optimization. It receives a graph of MapReduce jobs and works by:

• Determining if the graph came from a query that can be sped up using an index.

• Generating additional MapReduce jobs that do the intermediary work of querying the index. • Augmenting the original graph of jobs with the new jobs.

Query Evaluation

Hive works by compiling query statements into a directed acyclic graph of MapReduce jobs, which can then be run on the underlying Hadoop cluster. The compiler is divided into various sections that are shown in the figure below.

Our work on automatic index usage was focused on the Optimizer. The Optimizer applies a series of optimizations, each of which rearranges the graph of Map-Reduce jobs to improve query run-time.

the graph of Map-Reduce jobs to improve query run-time. Parser SELECT user_id, gender FROM Data WHERE
Parser
Parser

SELECT user_id, gender FROM Data WHERE age="22"

Semantic Analyzer
Semantic
Analyzer
Logical Plan Generator
Logical Plan
Generator

Parse Tree

OP OP OP Physical Plan Generator OP
OP
OP
OP
Physical Plan
Generator
OP
OP
OP

Logical Plan

OP OP OP Physical Plan Generator OP OP Logical Plan MR MR MR 1 3 MR
MR MR MR 1 3 MR 1 MR MR Optimizer Hadoop! 5 3 4 MR
MR
MR
MR
1 3
MR
1 MR
MR
Optimizer
Hadoop!
5
3
4
MR
MR
MR
2
2
4
Optimized Plan
Physical Plan

The Hive Compiler

Bitmap Indexing

Bitmap indexing is an indexing technique that is effective for columns that hold few distinct values. Examples of such columns that may be present in Facebook’s databases include genders and relationship statuses.

A bitmap index uses a series of binary bit vectors to represent a column. The index uses one bit vector for each possible value of the column. Each value in the vector represents a row and it is set to 1 if the row contains the value of the vector and 0 otherwise.

Bitmap indexes are powerful because bit vectors can be efficiently combined using bit-wise operations, quickly eliminating rows that need to be accessed in combination queries.

User Statistics

user_id gender browser os 100 Male Chrome Linux 101 Female Firefox Linux 102 Female Chrome
user_id
gender
browser
os
100
Male
Chrome
Linux
101
Female
Firefox
Linux
102
Female
Chrome
Windows
103
Male
Safari
Mac OS X
104
Male
Firefox
Windows
105
Female
Chrome
Linux
106
Male
Safari
Windows
107
Male
Firefox
Windows

gender

Male

Female

1 0 0 1 0 1 1 0 1 0 0 1 1 0 1
1
0
0
1
0
1
1
0
1
0
0
1
1
0
1
0

browser Chrome Firefox Safari

1 0 0 0 1 0 1 0 0 0 0 1 0 1 0
1
0
0
0
1
0
1
0
0
0
0
1
0
1
0
1
0
0
0
0
1
0
1
0

os Linux Windows OS X

1 0 0 1 0 0 0 1 0 0 0 1 0 1 0
1
0
0
1
0
0
0
1
0
0
0
1
0
1
0
1
0
0
0
1
0
0
1
0

Bitmap Index This shows the layout of a bitmap index. There is a bit vector column for every option in every column of the original table.

Male

Firefox

Windows

1 0 0 0 1 0 0 0 1 1 0 0 AND AND 1
1
0
0
0
1
0
0
0
1
1
0
0
AND
AND
1
1
1
0
0
0
1
0
1
1
1
1

Male, Firefox, & Windows

=

0 0 0 0 1 0 0 1
0
0
0
0
1
0
0
1

User Statistics

user_id gender browser os 100 Male Chrome Linux 101 Female Firefox Linux 102 Female Chrome
user_id
gender
browser
os
100
Male
Chrome
Linux
101
Female
Firefox
Linux
102
Female
Chrome
Windows
103
Male
Safari
Mac OS X
104
Male
Firefox
Windows
105
Female
Chrome
Linux
106
Male
Safari
Windows
107
Male
Firefox
Windows

Using the bitmap index eliminates many rows when joining predicates in the query:

SELECT * WHERE gender=Male AND browser=Firefox AND os=Windows

Deliverables

We have delivered the following products to Facebook:

Source code patches for our work on automatic indexing and bitmap indexing.

Documentation of our new features on their wiki

Benchmarking results shown below

Benchmarking Results

Facebook requested we conduct benchmark tests to test the efficacy of the indexing framework. We executed test queries on columns with different numbers of distinct values:

Many Distinct Values (e.g. user_id): Both indexing methods showed significant improvement. Compact was better than Bitmap, as expected.

Few Distinct Values (e.g. browser): Both index methods also showed significant improvement. Bitmap was better than Compact, as expected

Average Distinct Values (e.g. access_date): On a column with an average number of distinct values, if many results are returned from the query, indexing does not give any advantage. If few results are returned from the query, both index methods are helpful.

All tests were conducted using a 5GB table with 45 million rows from the generic TPC-H dataset, not user data from Facebook.

300

225

150

75

0

TPC-H dataset, not user data from Facebook. 300 225 150 75 0 No Indexes Compact Index
TPC-H dataset, not user data from Facebook. 300 225 150 75 0 No Indexes Compact Index
TPC-H dataset, not user data from Facebook. 300 225 150 75 0 No Indexes Compact Index
TPC-H dataset, not user data from Facebook. 300 225 150 75 0 No Indexes Compact Index

No Indexes Compact Index Bitmap Index

300 225 150 75 0 No Indexes Compact Index Bitmap Index Many Distinct Values Few Distinct
300 225 150 75 0 No Indexes Compact Index Bitmap Index Many Distinct Values Few Distinct
300 225 150 75 0 No Indexes Compact Index Bitmap Index Many Distinct Values Few Distinct
300 225 150 75 0 No Indexes Compact Index Bitmap Index Many Distinct Values Few Distinct

Many Distinct Values

Few Distinct Values

Many Values Returned

Few Values Returned

Execution time for different numbers of distinct values in column Col for queries of type

SELECT * FROM Data WHERE Col = val

Acknowledgments

Facebook Liaisons:

Jonathan Hsu ’01, John Sichi, Yongqiang He

Team Members:

Skye Berghel, Jeffrey Lym, Russell Melick, Marquis Wang

Faculty Advisor: Robert Keller