Documente Academic
Documente Profesional
Documente Cultură
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What is a column-store?
row-store
Date Store Product Customer Price
column-store
Date Store Product Customer Price
+ only need to read in relevant data - tuple writes require multiple accesses
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
The only fundamental difference is the storage layout However: we need to look at the big picture
different storage layouts proposed row-stores row-stores++ 90s 00s row-stores++ today column-stores converge?
70s
80s
How did we get here, and where we are heading Part 1 What are the column-specific optimizations? Part 2 How do we improve CPU efficiency when operating on Cs
Part 3
VLDB 2009 Tutorial Column-Oriented Database Systems 3
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l
Introduction to key features From DSM to column-stores and performance tradeoffs Column-store architecture overview Will rows and columns ever converge?
Part 2: Column-oriented execution Daniel Part 3: MonetDB/X100 and CPU efficiency Peter
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
One Size Fits All? - Part 2: Benchmarking Results Stonebraker et al. CIDR 2007
QUERY 2 SELECT account.account_number, sum (usage.toll_airtime), sum (usage.toll_price) FROM usage, toll, source, account WHERE usage.toll_id = toll.toll_id AND usage.source_id = source.source_id AND usage.account_id = account.account_id AND toll.type_ind in (AE. AA) AND usage.toll_price > 0 AND source.type != CIBER AND toll.rating_method = IS AND usage.invoice_date = 20051013 GROUP BY account.account_number
star schema
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
read pages containing entire rows one row = 212 columns! is this typical? (it depends)
What about vertical partitioning? (it does not work with ad-hoc queries)
VLDB 2009 Tutorial
read only columns needed in this example: 7 columns caveats: select * not any faster clever disk prefetching clever tuple reconstruction
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Why?
l
l l
Rows contain values from different domains => more entropy, difficult to dense-pack Columns exhibit significantly less entropy Male, Female, Female, Female, Male Examples:
1998, 1998, 1999, 1999, 1999, 2000
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Use multiple overlapping column collections Sorted columns compress better Range queries are faster Use sparse clustered indexes
What about heavily-indexed row-stores? (works well for single column access, cross-column joins become increasingly expensive)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Amortizes function-call cost, improves CPU cache performance Software-based: bitwise operations Hardware-based: SIMD Part 3
Avoid decompression: operate directly on compressed Delay decompression (and tuple reconstruction)
l
more in Part 2
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Stores vs Row-Stores: How Different are They Really? Abadi, Hachem, and Madden. SIGMOD 2008.
Time (sec)
original C-store
10
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Storage layout
compression
Part 2
column operators
Part 1
Part 2
Execution engine
vectorized operations l
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l
Introduction to key features From DSM to column-stores and performance tradeoffs Column-store architecture overview Will rows and columns ever converge?
Part 2: Column-oriented execution Daniel Part 3: MonetDB/X100 and CPU efficiency Peter
12
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
1990s: Commercialization through SybaseIQ Late 90s 2000s: Focus on main-memory performance
l l
10+ startups
13
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Proposed as an alternative to NSM 2 indexes: clustered on ID, non-clustered on value Speeds up queries projecting few columns Requires more storage value
ID
0100 0962 1000 .. 1 2 3 4 ..
14
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache Conscious Algorithms for from: Relational Query Processing. Shatdal, Kant, Naughton. VLDB 1994. Database Architecture Optimized for to: the New Bottleneck: Memory Access. Boncz, Manegold, Kersten. VLDB 1999.
DBMSs on a modern processor: and: Where does time go? Ailamaki, DeWitt, Hill, Wood. VLDB 1999.
Weaving Relations for Cache Performance. Ailamaki, DeWitt, Hill, Skounakis, VLDB 2001.
15
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Fractured mirrors
l
16
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l l
Main-memory Improve computational efficiency by avoiding expression interpreter DSM with virtual IDs natural choice Developed new query execution algebra Pointed out memory-wall in DBMSs Cache-conscious projections and joins
VLDB 2009 Tutorial Column-Oriented Database Systems 17
Initial contributions:
l l l
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Faster CPUs, larger memories, disk bandwidth limit Multi-terabyte Data Warehouses Read-optimized, fast multi-column access, disk/CPU efficiency, light-weight compression
C-store paper:
l
MonetDB/X100
l
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Optimized DSM in Fractured Mirrors, 2002 Apples-to-apples comparison Performance Tradeoffs in ReadOptimized Databases Harizopoulos, Liang, Abadi, Madden, VLDB06
l l
Follow-up study
Query Processing Techniques Fast Scans and Joins Using Flash for Solid State Drives Drives Shah, Harizopoulos, Tsirogiannis, Harizopoulos, Wiener, Graefe. DaMoN08 Shah, Wiener, Graefe, VLDB 2009 Tutorial Column-Oriented Database Systems 19 SIGMOD09
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Leaf nodes contain values Eliminate IDs, amortize header overhead Custom implementation on Shore
TID Column Data
sparse B-tree on ID 3
1 2 3 4 5
a1 a2 a3 a4 a5
a1 1
2 a1
a2
a3
4 4
a4
a5
a2 a3
a4 a5
Similar: storage density Efficient columnar comparable storage in B-trees Graefe. to column stores Sigmod Record 03/2007.
VLDB 2009 Tutorial Column-Oriented Database Systems 20
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
columns projected: 1 2 3 4 5
optimized DSM
Read in segments of M pages Merge segments in memory Becomes CPU-bound after 5 pages
VLDB 2009 Tutorial Column-Oriented Database Systems 21
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-scanner implementation
row scanner
column scanner
Joe 45
Joe 45
apply predicate(s)
S
Direct I/O Joe Sue prefetch ~100ms worth of data
S
#POS 45 #POS
1 Joe 45 2 Sue 37
apply predicate #1 45 37
22
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Scan performance
l l l l l
Large prefetch hides disk seeks in columns Column-CPU efficiency with lower selectivity Row-CPU suffers from memory stalls Memory stalls disappear in narrow tuples Compression: similar to narrow
23
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Time (s)
20
15
10
0 1 2 3 4 5 6 7 Columns Returned 8 9 10
Non-selective queries, narrow tuples, favor well-compressed rows Materialized views are a win Column-joins are Scan times determine early materialized joins covered in part 2!
VLDB 2009 Tutorial Column-Oriented Database Systems 24
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
(cpdb)
+++
_ = + ++
8 12 16 20 24 28 32 36
tuple width
l
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
time (sec)
Column 2
30 20 10 0 4 8 12 16 20 24 28 32
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
time (sec)
30 20 10 0 4
Column, 48 Row, 48
40 30 20 10 0 Column, 8 Row, 8 4 12 20 28
12
20
28
No prefetching hurts columns in single scans Under competing traffic, columns outperform rows for any prefetch size
VLDB 2009 Tutorial Column-Oriented Database Systems 27
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Performance
l l l
DSM vs. NSM: CPU performance trade offs in block-oriented query processing Boncz, Zukowski, Nes, DaMoN08
Benefit in on-the-fly conversion between NSM and DSM DSM: sequential access (block fits in L2), random in L1 NSM: random access, SIMD for grouped Aggregation
28
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Performance characteristics
l l
very fast random reads, slow random writes fast sequential reads and writes
Flash RAID: very tight bandwidth/cm3 packing (4GB/sec inside the box) useful for delta structures and logs still suboptimal if I/O block size > record size therefore column stores profit mush less than horizontal stores the larger the data, the deeper clustering one can exploit
Column-Oriented Database Systems 29
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
New-generation SSDs
l l
30K Read IOps, 3K Write Iops 250MB/s Read BW, 200MB/s Write
Very fast random reads, slower random writes Fast sequential RW, comparable to HDD arrays
l l
No expensive seeks across columns FlashScan and Flashjoin: PAX on SSDs, inside Postgres
Query Processing Techniques for Solid State Drives Tsirogiannis, Harizopoulos, Shah, Wiener, Graefe, SIGMOD09 mini-pages with no qualified attributes are not accessed
30
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
..to 2x slower
..to same
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l
Introduction to key features From DSM to column-stores and performance tradeoffs Column-store architecture overview Will rows and columns ever converge?
Part 2: Column-oriented execution Daniel Part 3: MonetDB/X100 and CPU efficiency Peter
32
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Architecture of a column-store
storage layout
l l l l
read-optimized: dense-packed, compressed organize in extends, batch updates multiple sort orders engine sparse indexes l block-tuple operators l new access methods l optimized relational operators system-level system-wide column support loading / updates scaling through multiple nodes transactions / redundancy
VLDB 2009 Tutorial Column-Oriented Database Systems 33
l l l l
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
C-Store
l l l l l l l l l l l
Compress columns No alignment Big disk blocks Only materialized views (perhaps many) Focus on Sorting not indexing Data ordered on anything, not just time Automatic physical DBMS design Optimize for grid computing Innovative redundancy Xacts but no need for Mohan Column optimizer and executor
VLDB 2009 Tutorial Column-Oriented Database Systems 34
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l l l l l
Projection (MV) is some number of columns from a fact table Plus columns in a dimension table with a 1-n join between Fact and Dimension table Stored in order of a storage key(s) Several may be stored! With a permutation, if necessary, to map between them Table (as the user specified it and sees it) is not stored! No secondary indexes (they are a one column sorted MV plus a permutation, if you really want one)
User view:
EMP (name, age, salary, dept) Dept (dname, floor)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Trickle Load
A B C
Memory based Unsorted / Uncompressed Segmented Low latency / Small quick inserts
VLDB 2009 Tutorial Column-Oriented Database Systems
(A B C | A)
36
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
37
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Data Warehousing
l l l
High end (clustering) Mid end/Mass Market Personal Analytics E.g. Proximity
Data Mining
l
l l
Semantic web data management Terabyte TREC SciDB initiative SLOAN Digital Sky Survey on MonetDB
Information retrieval
l
Scientific datasets
l l
38
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cantor (history) Sybase IQ SenSage (former Addamark Technologies) Kdb 1010data MonetDB C-Store/Vertica X100/VectorWise KickFire SAP Business Accelerator Infobright ParAccel Exasol
VLDB 2009 Tutorial Column-Oriented Database Systems 39
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l
Introduction to key features From DSM to column-stores and performance tradeoffs Column-store architecture overview Will rows and columns ever converge?
Part 2: Column-oriented execution Daniel Part 3: MonetDB/X100 and CPU efficiency Peter
40
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Store
TID Value
Product
TID Value
Customer
TID Value TID
Price
Value
1 2 3
1 2 3
1 2 3
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Store
TID Value
Product
TID Value
Customer
TID Value TID
Price
Value
01/01
1 2 3
1 2 3
42
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Experiments
l
Adjoined Dimension Column Index (ADC Index) to Improve Star Schema Query Performance. ONeil et. al. ICDE 2008.
l l
Implemented by professional DBA Original row-store plus 2 column-store simulations on same row-store product
250.0 200.0
Time (seconds)
Column-Stores vs Row-Stores: How Different are They Really? Abadi, Hachem, and Madden. SIGMOD 2008.
43
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Work well when workload is known ..and queries access disjoint sets of columns See automated physical design
Tuple Header TID Column Data
1 2 3
Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns
Column-Stores vs. Row-Stores: How Different Are They Really? Abadi, Madden, and Hachem. SIGMOD 2008.
VLDB 2009 Tutorial
Complete fact table takes up ~4 GB (compressed) Vertically partitioned tables take up 0.7-1.1 GB (compressed)
Column-Oriented Database Systems 44
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Tuple construction
l
SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.country = Canada GROUP BY store_name
Result of lower part of query plan is a set of TIDs that passed all predicates Need to extract SELECT attributes at these TIDs
l l
BUT: index maps value to TID You really want to map TID to value (i.e., a vertical partition)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
So.
l l
All indexes approach is a poor way to simulate a column-store Problems with vertical partitioning are NOT fundamental
l l l
Store tuple header in a separate partition Allow virtual TIDs Combine clustered indexes, vertical partitioning Might be possible, BUT: l Need better support for vertical partitioning at the storage layer l Need support for column-specific optimizations at the executer level l Full integration: buffer pool, transaction manager, ..
..unless new technology / new objectives change the game (SSDs, Massively Parallel Platforms, Energy-efficiency)
Column-Oriented Database Systems 46
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
End of Part 1
l
Introduction to key features From DSM to column-stores and performance tradeoffs Column-store architecture overview Will rows and columns ever converge?
Part 2: Column-oriented execution Daniel Part 3: MonetDB/X100 and CPU efficiency Peter
47
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Part 2 Outline
l
48
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Compression
l l
Higher data value locality in column stores Techniques such as run length encoding far more useful Can use extra space to store multiple copies of data in different sort orders
50
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Run-length Encoding
Quarter Product ID Price Q1 Q1 Q1 Q1 Q1 Q1 Q1 Q2 Q2 Q2 Q2 1 1 1 1 1 2 2 1 1 1 2 5 7 2 9 6 8 5 3 8 1 4
Column-Oriented Database Systems
Quarter
(value, start_pos, run_length)
Price 5 7 2 9 6 8 5 3 8 1 4
51
(Q1, 1, 300) (Q2, 301, 350) (Q3, 651, 500) (Q4, 1151, 600)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Integrating Compression and Execution in ColumnOriented Database Systems Abadi et. al, SIGMOD 06
Bit-vector Encoding
l
ID: 1 1 1 1 1 1 0 0 1 1 0 0
ID: 2 0 0 0 0 0 1 1 0 0 1 0
ID: 3 0 0 0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0
52
b[i] = 1 if c[i] = v
Good for columns with few unique values Each bit-vector can be further compressed if sparse
1 1 1 2 2 1 1 2 3
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Integrating Compression and Execution in ColumnOriented Database Systems Abadi et. al, SIGMOD 06
Dictionary Encoding
l
For each unique value create dictionary entry Dictionary can be per-block or per-column Column-stores have the advantage that dictionary entries may encode multiple values at once
Quarter Quarter Q1 Q2 Q4 Q1 Q3 Q1 Q1 Q1 Q2 Q4 Q3 Q3
0 1 3 0 2 0 0 0 1 3 2 2
+
Dictionary 0: Q1 1: Q2 2: Q3 3: Q4
OR
+
Dictionary 24: Q1, Q2, Q4, Q1 122: Q2, Q4, Q3, Q3 128: Q3, Q1, Q1, Q1
53
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Encodes values as b bit offset from chosen frame of reference Special escape code (e.g. all bits set to 1) indicates a difference larger than can be stored in b bits
l
Price 45 54 48 55 51 53 40 50 49 62 52 50
Price
Frame: 50
-5 4 -2 5 1 3 40 0 -1 62 2 0
54
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Differential Encoding
l l
Encodes values as b bit offset from previous value Special escape code (just like frame of reference encoding) indicates a difference larger than can be stored in b bits
l
Time 5:00 5:02 5:03 5:03 5:04 5:06 5:07 5:08 5:10 5:15 5:16 5:16
Improved Word-Aligned Binary Compression for Text Indexing Ahn, Moffat, TKDE06
55
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Is the data numerical and exhibit good locality? yes Frame of Reference Encoding no Leave Data Uncompressed
yes
no
Bit-vector Compression
Dictionary Compression
OR
Heavyweight Compression
56
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l l
Modern disk arrays can achieve > 1GB/s 1/3 CPU for decompression 3GB/s needed
Lightweight compression schemes are better Even better: operate directly on compressed data
VLDB 2009 Tutorial Column-Oriented Database Systems 57
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Integrating Compression and Execution in ColumnOriented Database Systems Abadi et. al, SIGMOD 06
I/O - CPU tradeoff is no longer a tradeoff Reduces memoryCPU bandwidth requirements Opens up possibility of operating on multiple records at once
58
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Integrating Compression and Execution in ColumnOriented Database Systems Abadi et. al, SIGMOD 06
2 0 0 1 0 0 0 1 0 0 1 0
3 0 1 0 0 0 1 0 0 1 0 1
ProductID, COUNT(*))
1 0 0 1 1 0 0 1 0 0 0
(1, 3) (2, 1) (3, 2) SELECT ProductID, Count(*) FROM table WHERE (Quarter = Q2) GROUP BY ProductID
59
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Integrating Compression and Execution in ColumnOriented Database Systems Abadi et. al, SIGMOD 06
Selection Operator
60
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Row-store:
l l
Column projection involves removing unneeded columns from tuples Generally done as early as possible Operation is almost completely opposite from a row-store Column projection involves reading needed columns from storage and extracting values for a listed set of tuples
Column-store:
l l
l l
62
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID
Construct (4,1,4) 2 1 3 1
storeID
2 3 3 3
7 13 42 80
Need to construct ALL tuples Need to decompress data Poor memory bandwidth utilization
prodID
custID price
63
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
AGG
Data Source custID
0 1 0 1
QUERY: SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID
AND 4 4 4 4
prodID
2 1 3 1
storeID
4 4 4 4
2 1 3 1
2 3 3 3
7 13 42 80
64
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
AGG
Data Source custID Data Source price
0 1 0 1
SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID
AND AND 1 1 1 1 0 1 0 1 4 4 4 4 2 1 3 1 2 3 3 3 7 13 42 80
65
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
AGG
Data Source custID Data Source price
3 3 2 0 3 1 3 0 3 1
custID
13 1 80 1 7 0 13 1 42 0 80 1
price
SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID
Data Source
Data Source
AND
0 1 0 1
4 4 4 4
2 1 3 1
2 3 3 3
7 13 42 80
66
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
AGG 3 1 1 93
Data Source custID Data Source price
SELECT custID,SUM(price) FROM table WHERE (prodID = 4) AND (storeID = 1) AND GROUP BY custID
AND
AGG 4 4 4 4 2 1 3 1 2 3 3 3 7 13 42 80
3 1 3 1
13 1 80 1
67
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Materialization Strategies in a Column-Oriented DBMS Abadi, Myers, DeWitt, and Madden. ICDE 2007.
QUERY: SELECT C1, SUM(C2) FROM table WHERE (C1 < CONST) AND (C2 < CONST) GROUP BY C1
l
Time (seconds)
Early Materialization
Late Materialization
68
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Materialization Strategies in a Column-Oriented DBMS Abadi, Myers, DeWitt, and Madden. ICDE 2007.
QUERY: SELECT C1, SUM(C2) FROM table WHERE (C1 < CONST) AND (C2 < CONST) GROUP BY C1
Time (seconds)
Early Materialization
Late Materialization
69
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
B, C, E, H, F
Project Join G=K
F K
Scan
G F, K
Project Join A=D
B, C, E, G, H
Join 1 A=D
G D
Scan
E, H
A, B, C
Scan
D, E, G, H
Scan
A
Scan
Column-Oriented Database Systems
R1 (A, B, C) R2 (D, E, G, H)
VLDB 2009 Tutorial
R1 (A, B, C) R2 (D, E, G, H)
70
70
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
B, C, E, H, F
Project Join G=K
F K
Scan
G F, K
B, C, E, G, H Early
Late Project
Join A=D
materialization Join 1
A=D Scan R3 (F, K, L)
materialization (F, K, L) R3
G D
Scan
71
E, H
A, B, C
Scan
D, E, G, H
Scan
A
Scan
Column-Oriented Database Systems
R1 (A, B, C) R2 (D, E, G, H)
VLDB 2009 Tutorial
R1 (A, B, C) R2 (D, E, G, H)
71
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
1 2 3
custID
Facts
VLDB 2009 Tutorial
Customers
Column-Oriented Database Systems 72
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
73
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
(4,1,4)
12 1 11 1
6 1 2 1
2 3 1 3
7 13 42 80
1 2 3
custID
Late materialized join causes out of order probing of projected columns from the inner relation
prodID
Facts
VLDB 2009 Tutorial
Customers
Column-Oriented Database Systems 74
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Nave LM join about 2X slower than EM join on typical queries (due to random I/O)
l
75
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Invisible Join
Column-Stores vs Row-Stores: How Different are They Really? Abadi, Madden, and Hachem. SIGMOD 2008.
Designed for typical joins when data is modeled using a star schema
l
One (fact) table is joined with multiple dimension tables select c_nation, s_nation, d_year, sum(lo_revenue) as revenue from customer, lineorder, supplier, date where lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_orderdate = d_datekey and c_region = 'ASIA and s_region = 'ASIA and d_year >= 1992 and d_year <= 1997 group by c_nation, s_nation, d_year order by d_year asc, revenue desc;
Typical query:
76
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Invisible Join
custkey region nation 1 ASIA CHINA 2 ASIA INDIA 3 ASIA INDIA 4 EUROPE FRANCE
Column-Stores vs Row-Stores: How Different are They Really? Abadi, Madden, and Hachem. SIGMOD 2008.
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
orderkey custkey suppkey orderdate revenue 1 3 1 01011997 43256 2 3 2 01011997 33333 3 4 3 01021997 12121 4 1 1 01021997 23233 5 4 2 01021997 45456 6 1 2 01031997 43251 7 3 2 01031997 34235
Column-Stores vs Row-Stores: How Different are They Really? Abadi et. al. SIGMOD 2008. 1 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 1
& & = +
1 0 0 1 0 0 0
custkey 3 3 4 1 4 1 3
VLDB 2009 Tutorial
1 1 0 1 0 1 1
suppkey 1 2 3 1 2 2 2
1 0 1 1 0 0 0
1 1 1 1 1 1 1
78
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
custkey 3 3 4 1 4 1 3 suppkey 1 2 3 1 2 2 2 orderdate 01011997 01011997 01021997 01021997 01021997 01031997 01031997
+ + +
1 0 0 1 0 0 0
3 1
+ +
= =
1997 1997 1997
INDIA CHINA
1 0 0 1 0 0 0
1 1
RUSSIA RUSSIA
1 0 0 1 0 0 0
01011997 01021997
JOIN
1997 1997
79
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Invisible Join
custkey region nation 1 ASIA CHINA 2 ASIA INDIA 3 ASIA INDIA 4 EUROPE FRANCE
Column-Stores vs Row-Stores: How Different are They Really? Abadi, Madden, and Hachem. SIGMOD 2008.
Range [1-3]
(between-predicate rewriting)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Invisible Join
l
Bottom Line
l
Many data warehouses model data using star/snowflake schemes Joins of one (fact) table with many dimension tables is common Invisible join takes advantage of this by making sure that the table that can be accessed in position order is the fact table for each join Position lists from the fact table are then intersected (in position order) This reduces the amount of data that must be accessed out of order from the dimension tables Between-predicate rewriting trick not relevant for this discussion
Column-Oriented Database Systems 81
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
custkey 3 3 4 1 4 1 3 suppkey 1 2 3 1 2 2 2 orderdate 01011997 01011997 01021997 01021997 01021997 01031997 01031997
+ + +
1 0 0 1 0 0 0
3 1
+ +
= =
1997 1997 1997
INDIA CHINA
RUSSIA RUSSIA
1 0 0 1 0 0 0
01011997 01021997
JOIN
1997 1997
82
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Jive/Flash Join
3 1
INDIA CHINA
Fast Joins using Join Indices. Li and Ross, VLDBJ 8:1-24, 1999.
Query Processing Techniques for Solid State Drives. Tsirogiannis, Harizopoulos et. al. SIGMOD 2009.
83
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Jive/Flash Join
1.
2.
3.
4.
Add column with dense ascending integers from 1 Sort new position list by second column Probe projected column in order using new sorted position list, keeping first column from position list around Sort new result by first column
1 2
3 1
+ +
= =
2 1
INDIA CHINA
2 1
1 3
CHINA INDIA
1 2
INDIA CHINA
84
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Jive/Flash Join
l
Bottom Line
l
Sort join index Probe projected columns in order Sort result using an added column LM has the extra sorts (EM accesses all columns in order) LM only has to fit join columns into memory (EM needs join columns and all projected columns)
LM vs EM tradeoffs:
l l
Results in big memory and CPU savings (see part 3 for why there is CPU savings)
l l
LM only has to materialize relevant columns In many cases LM advantages outweigh disadvantages
LM would be a clear winner if not for those pesky sorts can we do better?
85
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Radix Cluster/Decluster
l
l l
We just want to access the storage blocks in order (we dont mind random access within a block) So do a radix sort and stop early By stopping early, data within each block is accessed out of order, but in the order specified in the original join index
l
Database Architecture Optimized for the New Bottleneck: Memory Access VLDB99 Generic Database Cost Models for Hierarchical Memory Systems, VLDB02 (all Manegold, Boncz, Kersten)
86
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Radix Cluster/Decluster
l
Bottom line
l
Both sorts from the Jive join can be significantly reduced in overhead Only been tested when there is sufficient memory for the entire join index to be stored three times
l
Technique is likely applicable to larger join indexes, but utility will go down a little Dont want to use radix cluster/decluster if you have variablewidth column values or compression schemes that can only be decompressed starting from the beginning of the block
87
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
LM vs EM joins
l
Invisible, Jive, Flash, Cluster, Decluster techniques contain a bag of tricks to improve LM joins Research papers show that LM joins become 2X faster than EM joins (instead of 2X slower) for a wide array of query types
88
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
For queries with selective predicates, aggregations, or compressed data, use late materialization For joins:
l
Research papers:
l
Always use late materialization Inner table to a join often materialized before join (reduces system complexity): Some systems will use LM only if columns from inner table can fit entirely in memory
Commercial systems:
l
89
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l
l l
l
l
l
l
Conclusions
future work
90
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Architecture
Elements: l Storage
l
l l
Pipelined SIMD
92
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Metrics
93
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Metrics
94
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
DRAM Metrics
95
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
96
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Hazards
l
Data hazards
l l
Control Hazards
l l l
Out-of-order execution addresses data hazards l control hazards typically more expensive
VLDB 2009 Tutorial Column-Oriented Database Systems 97
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SIMD
Same operation applied on a vector of values MMX: 64 bits, SSE: 128bits, AVX: 256bits SSE, e.g. multiply 8 short integers
Column-Oriented Database Systems 98
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
102
ivan
350 37 *50 7 30
next()
102 ivan
37 7
350
PROJECT next()
101 102 alice ivan 37 22 > 30 ? 22 TRUE 37 FALSE
SELECT id, name (age-30)*50 AS bonus FROM employee WHERE age > 30
SELECT next()
102 101 ivan alice 37 22
SCAN
99
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
102
ivan
350 37 *50 7 30
next()
Operators
Iterator interface -open() -next(): tuple -close()
PROJECT next()
37 22 > 30 ?
100
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
102
ivan
350 37 *50 7 30
next()
102 ivan
37 7
350
Primitives
Provide computational functionality All arithmetic allowed in expressions, e.g. Multiplication
7 * 50
PROJECT next()
101 102 alice ivan 37 22 > 30 ? 22 TRUE 37 FALSE
SELECT next()
102 101 ivan alice 37 22
SCAN
mult(int,int) int
101
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Data hazards
l l
Control Hazards
l l l
SIMD
work on one tuple at a time Large Tree/Hash Structures Code footprint of all operators in query plan exceeds L1 cache Data-dependent conditions PROJECT
Operators
Iterator interface -open() -next(): tuple -close()
SELECT
SCAN Next() late binding method calls DBMSs On A Modern Processor: Where Does Time Go? Tree, List, Hash traversal Complex NSM record navigation Ailamaki, DeWitt, Hill, Wood, VLDB99
VLDB 2009 Tutorial Column-Oriented Database Systems
102
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
? 26.2s 28.1s
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
a column-store
l
save disk I/O when scan-intensive queries columns avoid an expression interpreter to improve computational efficiency
need a few
106
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
107
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
108
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
109
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
a column-store
l l
Often SIMD for free Plus: use of cache-conscious algorithms for Sort/Aggr/Join
Column-Oriented Database Systems 110
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
a Faustian pact
l
I take scalability
l
111
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SIGMOD 1985
B MonetD bra lge BAT A ts suppor B MonetD G, L, ODM M SQL, X ..RDF
MIL Primitives for Querying a Fragmented World, Boncz, Kersten, VLDBJ98 Flattening an Object Algebra to Provide Performance Boncz, Wilschut, Kersten, ICDE98 MonetDB/XQuery: a fast XQuery processor powered C- by a relational engine Boncz, Grust, vanKeulen, port on e p RDF su SW-Stor Rittinger, Teubner, SIGMOD06 / SW-Store: a vertically partitioned DBMS for Semantic STORE Web data management Abadi, Marcus, Madden, Hollenbach, VLDBJ09
Column-Oriented Database Systems 113
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
The
Extensible query lang frontend framework XQuery .. SGL? Extensible Dynamic
Software Stack
RDF
Arrays
MonetDB kernel
OGIS
114
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
optimizer
frontend
backend
as a research platform
l
Cache-Conscious Joins
l
MonetDB/XQuery:
l
Bottleneck: Memory Access VLDB99 Cost Models, Radix-cluster, Generic Database Cost Models for Hierarchical Memory Radix-decluster Systems, VLDB02 (all Manegold, Boncz, Kersten) Cache-Conscious Radix-Decluster Projections, Manegold, Boncz, Nes, VLDB04
MonetDB/XQuery: a fast XQuery processor powered by a relational engine Boncz, Grust, vanKeulen, Rittinger, Teubner, SIGMOD06 Database Cracking, CIDR07 Updating a cracked database , SIGMOD07 Self-organizing tuple reconstruction in columnstores, SIGMOD09 (all Idreos, Manegold, Kersten)
Cracking:
l
Recycling:
l
An architecture for recycling intermediates in a using materialized intermediates column-store, Ivanova, Kersten, Nes, Goncalves, SIGMOD09
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
117
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
next()
SCAN
118
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
next()
> 30 ?
PROJECT
next()
22 37 45 25 22 37 45 25
next()
SCAN
119
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
In Cache
next()
102 ivan 350 104 peggy 750 - 30 * 50 102 ivan 37 7 350 104 peggy 45 15 750 > 30 ?
next() called much less often more time spent in primitives vector = array of ~100 less in overhead processed in a tight loop primitive calls process an array of values cache Resident CPU in a loop:
PROJECT
next()
22 37 45 25 22 37 45 25
next()
SCAN
120
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Observations: next() called much less often more time spent in primitives less in overhead primitive calls process an array of values in a loop:
CPU Efficiency depends on nice code - out-of-order execution - few dependencies (control,data) - compiler support
102 ivan 350 104 peggy 750 - 30 * 50 102 ivan 37 7 350 104 peggy 45 15 750 > 30 ?
next()
PROJECT
next()
22 37 45 25 22 37 45 25
next()
Compilers like simple loops over arrays - loop-pipelining - automatic SIMD
SCAN
121
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Observations: next() called much less often more time spent in primitives less in overhead primitive calls process an array of values in a loop:
CPU Efficiency depends on nice code - out-of-order execution - few dependencies (control,data) - compiler support Compilers like simple loops over arrays - loop-pipelining - automatic SIMD
PROJECT
FALSE res[i] = (col[i] - x) TRUE TRUE FALSE SELECT for(i=0; i<n; i++)
* 50 350 750
res[i] = (col[i] * x)
SCAN
122
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Tricks being played: - Late materialization - Materialization avoidance using selection vectors
> 30 ?
PROJECT
next()
1 2
SELECT next()
101 102 104 105 alice ivan peggy victor 22 37 45 25
SCAN
123
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
map_mul_flt_val_flt_col( float *res, int* sel, Tricks being played: float val, float *col, int n) - Late materialization next() { - Materialization avoidance for(int i=0; i<n; i++) using selection vectors res[i] = val * col[sel[i]]; } selection vectors used to reduce vector copying contain selected positions next()
101 102 104 105 alice ivan peggy victor 22 37 45 25
- 30 * 50 7 15 350 750
> 30 ?
PROJECT
next()
1 2
SELECT
SCAN
124
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
map_mul_flt_val_flt_col( float *res, int* sel, Tricks being played: float val, float *col, int n) - Late materialization next() { - Materialization avoidance for(int i=0; i<n; i++) using selection vectors res[i] = val * col[sel[i]]; } selection vectors used to reduce vector copying contain selected positions next()
101 102 104 105 alice ivan peggy victor 22 37 45 25
- 30 * 50 7 15 350 750
> 30 ?
PROJECT
next()
1 2
SELECT
SCAN
125
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
MonetDB/X100
l
Both efficiency
l
and scalability..
l
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Memory Hierarchy
CPU cache
RAM
(raid) Disk(s)
VLDB 2009 Tutorial Column-Oriented Database Systems 127
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Memory Hierarchy
Vectors are only the in-cache representation RAM & disk representation might actually be different
CPU cache
RAM
(raid) Disk(s)
VLDB 2009 Tutorial Column-Oriented Database Systems 128
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
All vectors together should fit the CPU cache Optimizer should tune this, given the query characteristics.
CPU cache
RAM
(raid) Disk(s)
VLDB 2009 Tutorial Column-Oriented Database Systems 129
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Less and less iterator.next() and primitive function calls (interpretation overhead)
130
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
131
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
132
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU cache
RAM
(raid) Disk(s)
VLDB 2009 Tutorial Column-Oriented Database Systems 133
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
iterator.next(), primitives
Buffering Database Operations for Enhanced Instruction Cache Performance Zhou, Ross, SIGMOD04
Block oriented processing of relational database operations in Less Data Cache Misses modern computer architectures l Cache-conscious data placement Padmanabhan, Malkemus, Agarwal, ICDE01 No Tuple Navigation
l
134
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Balancing Vectorized Query Execution with Bandwidth Optimized Storage Zukowski, CWI 2008
Project Select
l
Aggregation
l
Ordered, Hashed
Sort
l
Join
l
Merge-join + Hash-join
135
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Key Ingredients
l
l
Decompress small vectors of tuples from a column into the CPU cache
l
137
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Key Ingredients
l
l
Decompress small vectors of tuples from a column into the CPU cache
l
138
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Vectorized Decompression
X100 query engine
Idea: decompress a vector only compression: -between CPU and RAM -Instead of disk and RAM (classic)
CPU cache
RAM
(raid) Disk(s)
VLDB 2009 Tutorial Column-Oriented Database Systems 139
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
RAM-Cache Decompression
l
Decompress vectors on-demand into the cache RAM-Cache boundary only crossed once More (compressed) data cached in RAM Less bandwidth use
140
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
0.8
0.6
0.4
Intel Xeon E5410 (Harpertown) - Q6b Compressed Intel Xeon X5560 (Nehalem) - Q6b Normal Intel Xeon X5560 (Nehalem) - Q6b Compressed
0 1 2 3 4 5 6 7 8
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Decoding loop over cache-resident vectors of code words Avoid control dependencies within decoding loop
l
142
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Forward growing section of arbitrary size code words (code word size fixed per block)
143
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Forward growing section of arbitrary size code words (code word size fixed per block) Backwards growing exception list
144
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Use reserved value from code word domain (MAXCODE) to mark exception positions
145
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patching
l
Maintain a patch-list through code word section that links exception positions
148
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patching
l
Maintain a patch-list through code word section that links exception positions After decoding, patch up the exception positions with the correct values
149
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patched Decompression
150
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patched Decompression
151
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Decompression Bandwidth
Patching can be applied to: Frame of Reference (PFOR) Delta Compression (PFOR-DELTA) Dictionary Compression (PDICT)
Makes these methods much more applicable to noisy data without additional cost
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Summary (1/2)
l
not efficiently: row-stores need change actually independent issues, on-the-fly conversion pays off column favors sequential access, row random Fractured mirrors PAX, Clotho Data morphing
154
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Summary (2/2)
l
Storage
l
Lean headers, sparse indices, fast positional access Operating on compressed data Lightweight, vectorized decompression Non-join: LM always wins Nave/Invisible/Jive/Flash/Radix Join (LM often wins) Vectorized in-cache execution Exploiting SIMD
Column-Oriented Database Systems 155
Compression
l l
Execution
l l
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Future Work
l
156
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Updates (1/3)
l
In-place: I/O for each column + re-compression + multiple sorted replicas + sparse tree indices
Update-in-place is infeasible!
157
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Updates (2/3)
l
Differential lists/files or more advanced (e.g. PDTs) Updates buffered in RAM, merged on each query Checkpointing merges differences in bulk sequentially
l
trade RAM for converting random into sequential I/O this trade is also needed in Flash (do not write randomly!) Depends on available RAM for buffering (how long until full) Checkpoint must be done within that time The longer it can run, the less it molests queries Using Flash for buffering differences buys a lot of time Hundreds of GBs of differences per server
158
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Updates (3/3)
l l
Differential transactions favored by hardware trends Snapshot semantics accepted by the user community
l
My conclusion: a system that combines row- and columns needs differentially implemented transactions. Starting from a pure column-store, this is a limited add-on. Starting from a pure row-store, this implies a full rewrite.
VLDB 2009 Tutorial Column-Oriented Database Systems 159
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Future Work
l
l l
compression/materialization/join tricks cost models? can row-stores learn new column tricks?
l
Study of the minimal number changes one needs to make to a row store to get the majority of the benefits of a column-store Alternative: add features to column-stores that make them more like row stores
160
Re-use permitted when acknowledging the original Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Conclusion
l
Data warehousing, BI Information retrieval, graphs, e-science Without these, existing row systems do not benefit Row-stores may adopt some column-store techniques Column-stores add row-store (or PAX) functionality
161