Documente Academic
Documente Profesional
Documente Cultură
Outline
Use Case & Motivation : Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan
C
1
C
2
C
3
C
4
C
5
C
6
C
7
R1
R2
R3
R4
R5
R6
R7
R8
R9
C
1
C
2
C
3
C
4
C
5
C
6
C
7
R1
R2
R3
R4
R5
R6
R7
R8
R9
C
1
C
2
C
3
C
4
C
5
C
6
C
7
R1
R2
R3
R4
R5
R6
R7
R8
R9
Motivation
OLAP Style Query
(multi-dimensional analysis)
Sequential Access
(big scan)
Random Access
(narrow scan)
Design Goals
CarbonData:
Read-optimized columnar storage
Leveraging multi-level Index for low-latency
Support column group to leverage the benefit of row-based
Enables dictionary encoding for deferred decoding for aggregation
Optimized streaming ingestion support
Broader Integration across Hadoop-ecosystem
7
Outline
Use cases & Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan
Carbon File
Blocklet 1
Col1 Chunk
Col2 Chunk
Colgroup1 Chunk
Colgroup2 Chunk
Blocklet N
Footer
Format
Carbon
Carbon Data
Data File
File
Blocklet 1
Column 1 Chunk
Column 2 Chunk
ColumnGroup 1 Chunk
ColumnGroup 2 Chunk
Blocklet N
Blocklet Info
Blocklet 1 Info
File Footer
Blocklet Index
Blocklet 1 Index Node
Multi-dimensional index:
startKey, endKey
File Metadata
Version, No.
Version,
No. Row,,
Blocklet N Info
Segment
Segment Info
Info
Schema
Blocklet Index
Blocklet Info
10
Blocklet
Data are sorted along MDK (multi-dimensional keys)
data stored as index in columnar format
1143
1143
Blocklet
Blocklet Logical
Logical View
View 2
2
C1
C1
C7
C7
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
C2
C2
C3
C3 C4
C4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
1
1
1
1
1
1
2
2
2
3
3
3
3
3
3
4
4
4
1
1
1
1
3
3
1
1
1
1
1
2
2
3
3
1
1
3
1
1
3
3
2
2
4
4
5
7
7
8
8
6
6
1
1
9
C64462
C6
2
2
5470
5470
142
142
2
2
443
443
5887
541
541
1
1
545
545
5618
5618
675
1
1
570
570
5101
5101
561
561
8
52
52
5524
5524
144
144
5
5
525
9749
9749
1153
1153
2
5039
5039
C5
C5
Years
Quarters
Months
Territory
Country
2003
QTR1
Jan
EMEA
Germany
142
11,432
2003
QTR1
Jan
APAC
China
541
54,702
2003
QTR1
Jan
EMEA
Spain
443
44,622
2003
QTR1
Feb
EMEA
Denmark
545
58,871
2003
QTR1
Feb
EMEA
Italy
675
56,181
2003
QTR1
Mar
APAC
India
52
9,749
2003
QTR1
Mar
EMEA
UK
570
51,018
2003
QTR1
Mar
Japan
Japan
561
55,245
2003
QTR2
Apr
APAC
Australia
525
50,398
2003
QTR2
Apr
EMEA
Germany
144
11,532
Sorted
Sorted MDK
MDK
Index
Index
[1,1,1,1,1] :
[142,11432]
[1,1,1,1,3] :
[443,44622]
[1,1,1,3,2] :
[541,54702]
[1,1,2,1,4] :
[545,58871]
[1,1,2,1,5] :
[675,56181]
[1,1,3,1,7] :
[570,51018]
[1,1,3,2,8] :
[561,55245]
Quantity
Sales
Encoding
Sort
(MDK Index)
[1,1,1,1,1] :
[142,11432]
[1,1,1,3,2] :
[541,54702]
[1,1,1,1,3] :
[443,44622]
[1,1,2,1,4] :
[545,58871]
[1,1,2,1,5] :
[675,56181]
[1,1,3,3,6] :
[52,9749]
[1,1,3,1,7] :
[570,51018]
11
Blocklet
Block 1
1 1 1 1
12000
1 1 1 2
5000
1 1 2 1Block
1 21
12000
1 2 1 3
11000
1 1 2 2
5000
1 2 2 3
11000
1 1 3 1 1 1
1 2 3 3Block
2 33
12000
1
1 1
3 3
2 2
5
1 3 1 4
2
4
4
5000
1000
2000
1 3 3 4 3 4
1 3 1 5 3 4
Block 4
1 3 3 5 3 4
1
2 3
1 2
1 4
1 3
1 4
1
2000
1000
1000
2000
12000
1 4 1 4
20000
2 1 1 2
5000
1 4 2 4
20000
2 1 2 1
12000
1 4 3 4
20000
2 1 2 2
5000
1
3
3
Blocklet Index
Block1
Start Key1
End Key1
C1(Min, Max)
.
C7(Min, Max)
11000
Block4
Start Key4
End Key4
C1(Min, Max)
.
C7(Min, Max)
Start Key3
End Key4
Start
Key1
End Key1
Start
Key2
End
Key2
Start
Key3
End Key3
Start
Key4
End Key4
C1(Min,Max)
C1(Min,Max)
C1(Min,Max)
C1(Min,Max)
C7(Min,Max)
C7(Min,Max)
C7(Min,Max)
C7(Min,Max)
12
Blocklet
Blocklet
(( sort
sort column
column within
within column
column
chunk)
chunk)
Blocklet
Blocklet Physical
Physical View
View
C1
C1
d
d
1
1
1
0
0
C2
C2
rr
1
1
1
0
0
d
d
1
8
8
2
2
2
2
C3
C3
rr
1
1
1
0
0
1
3
3
2
2
2
2
3
3
3
4
4
2
2
C4
C4
d
d
rr
1
1
1
0
0
1
6
6
2
2
1
1
3
3
3
C5
C5
d
d
1
2
2
4
4
3
3
9
9
1
7
7
1
1
3
3
1
1
rr
1
2
2
2
2
1
1
3
3
1
4
4
1
1
5
5
1
1
1
1
1
9
9
1
1
3
3
1
2
2
1
1
4
4
1
1
C6
C6
d
d
rr
142
142
443
443
541
541
545
545
675
675
570
570
561
561
52
52
144
144
525
525
1143
C7
2
C7
2
4462
4462
d rr
d
2
5470
2
2
5887
5887
1
1
5618
5618
1
5101
5101
8
8
5524
5524
5
5
9749
1153
1153
2
2
5039
5039
13
Column Group
Allow multiple columns form a column group
stored as a single column chunk in rowbased format
suitable to set of columns frequently
fetched together
saving stitching cost for reconstructing
row
Blocklet 1
C
Col
1
C
Col
2
C
Col3
C
C
4 Col 5
10
23
23
10
50
10
11
12
Chun
k
C
6
Col
Chunk
38
15.2
15
29
18.5
51
18
52
22.8
60
29
16
32.9
68
32
18
21.6
14
Struts
Name
Array<Ph_Number
>
John
[192,191]
Sam
Bob
[121,345,333]
[198,787]
Nam
e
Array
[start,len
]
Ph_Number
John
0,2
192
Sam
2,3
191
Bob
5,2
Name
Info
Strut<age,gender
>
121
John
[31,M]
345
Sam
[45,F]
Bob
[16,M]
Nam
e
Info.age
Info.gender
John
31
Sam
45
Bob
16
333
198
15
Big Win:
Speedup Aggregation
Reduce run-time memory
footprint
Enable deferred decoding
Enable fast distinct count
16
Outline
Use Case & Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan
17
CarbonData Modules
Carbon-Spark
Integration
Carbon-Hadoop
Input/Output Format
Carbon-core
Reader/Writer
Carbon-format
Thrift definition
18
Spark Integration
Query CarbonData Table
DataFrame API
Spark SQL Statement
19
Spark Integration
Table Level MDK Tree Index
Query optimization
Table
C1
Block
Block
Block
Block
Blocklet
Blocklet
Blocklet
Blocklet
Blocklet
Blocklet
Blocklet
Blocklet
Footer +
Index
Footer +
Index
Footer +
Index
Footer +
Index
C2
Blocklet
C3
C4
C5 C6 C7
C9
Inverte
d
Index
20
Data Ingestion
Bulk Data Ingestion
df.write
.format("org.apache.spark.CarbonSource")
.options(Map("dbName" -> "db1", "tableName" ->
"tbl1"))
.mode(SaveMode.Overwrite)
.save(/path)
21
Data Compaction
22
Outline
Use Case & Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan
23
Performance comparison
Carbon vs Popular Columnar Stores
120.00
Random Access
Query
26x
688x111.86
107.39
faster
101.62
OLAP/Interactive Query
20x 33x faster
100.00
80.00
Popular
Columnar
Stores
60.00
Carbon
26.28
20.00
24.64
23.05
17.33
9.45
12.71
4.41
0.00
SQL1
SQL2
10.38
9.82
1.62
SQL3
2.54
SQL4
15.49
17.82
11.21
8.16
0.89
SQL5
SQL6
0.55
SQL7
0.52
SQL8
Benchmark Queries
0.54
SQL9
1.19
0.16
2.24
4.28
OLAP/Interactive Query
20 to 33 times faster
Outline
Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan
26
Live Demo
High Throughput/Full Scan Query
SELECT PROD_BRAND_NAME, SUM(STR_ORD_QTY)
FROM oscon_demo GROUP BY PROD_BRAND_NAME;
OLAP/Interactive query
SELECT PROD_COLOR, SUM(STR_ORD_QTY) FROM
oscon_demo WHERE CUST_COUNTRY ='New Zealand'
AND CUST_CITY = 'Auckland' AND PRODUCT_NAME =
'Huawei Honor 4X' GROUP BY PROD_COLOR;
Random
Query
SELECT
* FROMAccess
oscon_demo
WHERE
CUST_PRFRD_FLG= "Y" AND PROD_BRAND_NAME =
"Huawei" AND PROD_COLOR = "BLACK" AND
CUST_LAST_RVW_DATE = "2015-12-11 00:00:00" AND
CUST_COUNTRY ='New Zealand' AND CUST_CITY =
'Auckland' AND PRODUCT_NAME = 'Huawei Honor
4X' ;
Demo Environment
Number of
Nodes
5 VM (AWS
r3.4xlarge)
vCPU
80 (16/node)
Memory
#Columns
300
Data Size
600GB
#Records
300M
27
Outline
Motivation: Why introducing a new file format?
CarbonData File Format Deep Dive
Framework Integrated with CarbonData
Performance
Demo
Future Plan
28
Future Plan
29
Community
Main Contributors:
Jihong MA, Vimal, Raghu, Ramana, Ravindra, Vishal, Aniket, Liang Chenliang, Jacky Likun,
Jarry Qiuheng, David Caiqiang, Eason Linyixin, Ashok, Sujith, Manish, Manohar, Shahid,
Ravikiran, Naresh, Krishna, Babu, Ayush, Santosh, Zhangshunyu, Liujunjie, Zhujing
(Huawei)
Jean-Baptiste Onofre (Talend, ASF member), Henry Saputra (eBay, ASF member),
Uma Maheswara Rao G(Intel, Hadoop PMC)
30
Thank you
www.huawei.com