Hydrafs Presentation

HydraFS
C. Ungureanu, B. Atkin, A. Aranya, et al.
Slides: Joe Buck, CMPS 229, Spring 2010
April 27, 2010

1
Tuesday, April 27, 2010

Introduction
✤ What is HydraFS?
✤ Why is it necessary?

HYDRAstor
✤ What is HYDRAstor
✤ Immutable data
✤ High Latency
✤ Jitter
✤ Put / Get API

tions, and HydraFS acts as a front end for the Hydra distributed,
ovel chal- content-addressable block store (Figure 1). In this sec-
tion, we present the characteristics of Hydra and describe
Hydra Diagram
entation of
HydraFS is the key challenges faced when using it for applications,
CAS sys- such as HydraFS, that require high throughput.
e through-
Access Node
n.
ving high File Commit
ore expen- Server Server
t refer to a HydraFS
comprises
node data HYDRAstor Block Access Library
attributes,
files). Sec-
significant
on of high Hydra
rite buffer
Storage Storage Storage Storage
structures Node Node Node Node
will thrash.
Single−System Content−Addressable Store
ee design 4
adata pro-
CAS
Client
4 KB 4 KB 4 KB
4 KB 4 KB 4 KB
CAS

CAS - continued
Client
4 KB 4 KB 4 KB
4 KB 4 KB 4 KB
Chunker
CAS

CAS - continued
Client
4 KB 4 KB 4 KB
2 KB
CAS
cas1: 10KB

CAS - continued
Client
1 KB 4 KB
CAS
cas1: 10KB cas2: 9KB

A little more on CAS addresses
✤ Same data doesn’t mean the same address
✤ Impossible to calculate prior to write
✤ Foreground processing writes shallow trees
✤ Root cannot be updated until all child nodes are set

Issues for a CAS FS
✤ Updates are more expensive
✤ Metadata cache misses cause significant performance issues
✤ The combination of high latency and high throughput means lots of

buffering
10

Design Decisions
✤ Decouple data and metadata processing
✤ Fixed size caches with admission control
✤ Second-order cache for metadata
11

Issues - continued
✤ Immutable Blocks
✤ FS can only reference blocks already written
✤ Forms DAGs
✤ Height of DAGs needs to be minimized
12

Issues - continued
✤ High latency
✤ In stead of ms - 10’s of ms latency Hydra has 100’s ms - 1 s latency
✤ Stream hints
✤ Delay writes to batch streams together
✤ High degree of parallelism needed to mask high latencies
13

Issues - continued
✤ Variable sized blocks
✤ Avoids the “shifting window” problem
✤ Use a balanced tree structure
14

FS design
✤ High Throughput
✤ Minimize the number of dependent I/O operations
✤ Availability guarantees no worse than standard Unix FS
✤ Efficiently support both local and remote access
15

File System Layout
Super Blocks
File
Operations File
Server
Imap Handle
Imap B−Tree
Imap Segmented Array

Directory Inode Regular File Inode
Data Blocks
Inode B−Tree Inode B−Tree
Figure 3: HydraFS Soft
Directory Blocks File Contents

System [23]. In HydraFS, the
Filename1 321 R
array of content addresses and
Filename2 365 R a B-tree. It is used to translate i
Filename3 442 D as well as to allocate and free i
A regular file inode indexes
16
so as to accommodate very larg
HydraFS Software Stack
✤ Uses FUSE
✤ Split into file server and commit server
✤ Simplifies metadata locking
✤ Amortizes the cost of metadata updates via batching
✤ Each server has its own caching strategy
17

Writing Data
✤ Data stored in inode specific buffer
✤ Chunked, marked dirty and written to Hydra
✤ After write confirmation, block freed and entered in uncommitted

block table
✤ Needed until metadata is flushed to storage
✤ Designed for append writing, in-place updates are expensive
18

Metadata Cleaning
✤ Dirty data kept until the commit server applies changes
✤ New versions of file systems are created periodically
✤ Metadata in separate structures, tagged by time
✤ Always clean (in Hydra), can be dropped from cache at any time
✤ Cleaning allows file servers to drop changes in the new FS version
19

Admission Control
✤ Events assume worse case memory usage
✤ If insufficient resources are available, the event blocks
✤ Limits the number of active events
✤ Memory usage is tuned to the amount of physical memory
20

Read Processing
✤ Aggressive read-ahead
✤ Multiple fetches to get metadata
✤ Weighted caching to favor metadata over data
✤ Fast range map
✤ Metadata read-ahead
✤ Primes FRM, cache
21

Deletion
✤ File deletion removes the entry from the current FS
✤ Data remains until there are no pointers to it
22

Performance
Raw block device

File system
1.0 ex
Normalized Throughput
H
0.8
0.6 Table 1
ilar har
0.4
0.2 ited by
ing thro
0.0 keep th
Read (iSCSI) Read (Hydra) Write (iSCSI) Write (Hydra)
tem23 do
Metadata Intensive
✤ Postmark
✤ Generates files, then issues transactions.
✤ File size: 512 B - 16 KB
Create Delete
Overall
Alone Tx Alone Tx
ext3 1,851 68 1,787 68 136
HydraFS 61 28 676 28 57
Table 1: Postmark comparing HydraFS with ext3 on sim-

24
ilar hardware
tem does not accumulate a large number of uncommitted
blocks that increase the turnaround times for the commit
e system server processing, increasing unpredictably the latency of
Write Performance vs Dedup
user operations. In contrast, ext3 has no such limitations
and all metadata updates are written to the journal.
achieved
350
k device Hydra
HydraFS
pectively. 300
3 is com-
250
roughput
Throughput (MB/s)
the write 200

o around
150
of Hydra
ncy. 100
lementa-
50
general-
gnificant 0
ce comes 0 20 40 60 80
Duplicate Ratio (%)
d by de- 25
ory man-
as expected for duplicate data as the number of I/Os to
10
disk is correspondingly reduced. Second, for all cases, the
HydraFS throughput is within 12% of the Hydra through- 8
Write Behind put. Therefore, we conclude that HydraFS meets the de- 7
Page Memory (MB)

sired goal of maintaining high throughput.
6
10 5
3
9.5
2
9
1
8.5
Offset (GB)
7.5
7 Fig
6.5
6 tency of
0 5 10 15 20
Time (s)
block w
higher
26 t
Tuesday, April 27, 2010 parallel.
Time (s)
block
higher
Hydra Latency Figure 7: Write completion order paralle

To f
the Cu
1 event l
is crea
0.9 is dest
ure 8 s
0.8 less tha
Pr(t<=x)
0.7
Admis
that H
0.6 lying b
vent th
0.5 use ad
tem, th
0.4 buffer
0 10 20 30 40 50 60 70
Time (ms) wastin
27

locatio
Future Work
✤ Allow multiple nodes to manage same FS
✤ Makes failover transparent and automatic
✤ Exposing snapshots to users
✤ Incorporating SSD storage to lower latencies, make HydraFS usable

as primary storage
28

Thank you
✤ Questions?
✤ Comments?
✤ email: buck@soe.ucsc.edu
✤ Paper: http://www.usenix.org/events/fast10/tech/full_papers/
ungureanu.pdf
29

Sample Operations
✤ Block Write
✤ Block Read
✤ Searchable Block Write
✤ Searchable Block Read
30

Hydrafs Presentation

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Hydrafs Presentation

Încărcat de

Drepturi de autor:

Formate disponibile

HydraFS

C. Ungureanu, B. Atkin, A. Aranya, et al.

Slides: Joe Buck, CMPS 229, Spring 2010

April 27, 2010

Tuesday, April 27, 2010

Tuesday, April 27, 2010

✤ Put / Get API

Tuesday, April 27, 2010

Tuesday, April 27, 2010

Tuesday, April 27, 2010

Tuesday, April 27, 2010

cas1: 10KB cas2: 9KB

Tuesday, April 27, 2010

✤ Same data doesn’t mean the same address

✤ Impossible to calculate prior to write

✤ Foreground processing writes shallow trees

✤ Root cannot be updated until all child nodes are set

Tuesday, April 27, 2010

✤ Updates are more expensive

✤ Metadata cache misses cause significant performance issues

✤ The combination of high latency and high throughput means lots of

Tuesday, April 27, 2010

✤ Decouple data and metadata processing

✤ Fixed size caches with admission control

✤ Second-order cache for metadata

Tuesday, April 27, 2010

✤ FS can only reference blocks already written

✤ Height of DAGs needs to be minimized

Tuesday, April 27, 2010

✤ In stead of ms - 10’s of ms latency Hydra has 100’s ms - 1 s latency

✤ Delay writes to batch streams together

✤ High degree of parallelism needed to mask high latencies

Tuesday, April 27, 2010

✤ Variable sized blocks

✤ Avoids the “shifting window” problem

✤ Use a balanced tree structure

Tuesday, April 27, 2010

✤ Minimize the number of dependent I/O operations

✤ Availability guarantees no worse than standard Unix FS

✤ Efficiently support both local and remote access

Tuesday, April 27, 2010

Imap Segmented Array

Inode B−Tree Inode B−Tree

Figure 3: HydraFS Soft

Directory Blocks File Contents

✤ Split into file server and commit server

✤ Simplifies metadata locking

✤ Amortizes the cost of metadata updates via batching

✤ Each server has its own caching strategy

Tuesday, April 27, 2010

✤ Data stored in inode specific buffer

✤ Chunked, marked dirty and written to Hydra

✤ After write confirmation, block freed and entered in uncommitted

✤ Needed until metadata is flushed to storage

✤ Designed for append writing, in-place updates are expensive

Tuesday, April 27, 2010

✤ Dirty data kept until the commit server applies changes

✤ New versions of file systems are created periodically

✤ Metadata in separate structures, tagged by time

✤ Cleaning allows file servers to drop changes in the new FS version

Tuesday, April 27, 2010

✤ Events assume worse case memory usage

✤ If insufficient resources are available, the event blocks

✤ Limits the number of active events