Hydrafs Presentation With Notes

HydraFS
C. Ungureanu, B. Atkin, A. Aranya, et al.
Slides: Joe Buck, CMPS 229, Spring 2010
April 27, 2010

1
Tuesday, April 27, 2010

Introduction
✤ What is HydraFS?
✤ Why is it necessary?
HydraFS is a file system on top of HydraStore, a scalable, distributed CAS

Applications don’t write to a CAS interface, they write to a FS interface. Need an adapter
layer, thus hydraFS.
CAS uses a put/get model
HYDRAstor
✤ What is HYDRAstor
✤ Immutable data
✤ High Latency
✤ Jitter
✤ Put / Get API
Inconsistent use of capitalization in acronyms

Jitter, in this case, means distance between writes to storage
Mention chunking
tions, and HydraFS acts as a front end for the Hydra distributed,
ovel chal- content-addressable block store (Figure 1). In this sec-
tion, we present the characteristics of Hydra and describe
Hydra Diagram
entation of
HydraFS is the key challenges faced when using it for applications,
CAS sys- such as HydraFS, that require high throughput.
e through-
Access Node
n.
ving high File Commit
ore expen- Server Server
t refer to a HydraFS
comprises
node data HYDRAstor Block Access Library
attributes,
files). Sec-
significant
on of high Hydra
rite buffer
Storage Storage Storage Storage
structures Node Node Node Node
will thrash.
Single−System Content−Addressable Store
ee design 4
adata pro-
plit allows
applied ef- Figure 1: HYDRAstor Architecture.
nique that
CAS
Client
4 KB 4 KB 4 KB
4 KB 4 KB 4 KB
CAS

CAS - continued
Client
4 KB 4 KB 4 KB
4 KB 4 KB 4 KB
Chunker
CAS
The chunker uses some heuristic involving content data and some hard set limits to chunk in
variable sizes
CAS - continued
Client
4 KB 4 KB 4 KB
2 KB
CAS
cas1: 10KB
Objects in the CAS have ids that are pointed to by meta-data.

cas1 is 10 kb in size
CAS - continued
Client
1 KB 4 KB
CAS
cas1: 10KB cas2: 9KB
cas2 is 9 kb
A little more on CAS addresses
✤ Same data doesn’t mean the same address
✤ Impossible to calculate prior to write
✤ Foreground processing writes shallow trees
✤ Root cannot be updated until all child nodes are set
Differing retention levels can produce different CAS addresses.

Collisions can be detected but are unlikely.
Writes are done asynch, block on root node commit
Issues for a CAS FS
✤ Updates are more expensive
✤ Metadata cache misses cause significant performance issues
✤ The combination of high latency and high throughput means lots of

buffering
10
Updates must touch all meta-data that points to affected data

Buffering allows for optimal write ordering and read cache is important as well
Design Decisions
✤ Decouple data and metadata processing
✤ Fixed size caches with admission control
✤ Second-order cache for metadata
11
From the previous 3 issues come 3 design decisions:

1) this is done via a log. Allows batching of meta-data updates
2) this prevents swapping, other resource over-allocations
3) removes operations from reads via cache hits, improves metadata cache hit rate
Issues - continued
✤ Immutable Blocks
✤ FS can only reference blocks already written
✤ Forms DAGs
✤ Height of DAGs needs to be minimized
12
The entire tree must be updated if a block contained in it is updated, makes updates quite
expensive
Issues - continued
✤ High latency
✤ In stead of ms - 10’s of ms latency Hydra has 100’s ms - 1 s latency
✤ Stream hints
✤ Delay writes to batch streams together
✤ High degree of parallelism needed to mask high latencies
13
For an IO operation Hydra must:

Scan entire block to compute CAS, compress/decompress, determine block location,
fragment/defragment using ECCs, route to/from nodes.
Issues - continued
✤ Variable sized blocks
✤ Avoids the “shifting window” problem
✤ Use a balanced tree structure
14
This is the “chunking” referred to in the paper.

there is a min / max size for chunks.
tree helps minimize DAGs
FS design
✤ High Throughput
✤ Minimize the number of dependent I/O operations
✤ Availability guarantees no worse than standard Unix FS
✤ Efficiently support both local and remote access
15
close to open consistency (fsync acknowledgment means data is persisted)

Remote access could be NFS or CIFS
File System Layout
Super Blocks
File
Operations File
Server
Imap Handle
Imap B−Tree
Imap Segmented Array

Directory Inode Regular File Inode
Data Blocks
Inode B−Tree Inode B−Tree
Figure 3: HydraFS Soft
Directory Blocks File Contents

System [23]. In HydraFS, the
Filename1 321 R
array of content addresses and
Filename2 365 R a B-tree. It is used to translate i
Filename3 442 D as well as to allocate and free i
A regular file inode indexes
16
so as to accommodate very larg
size blocks. Regular file data
Inode map similar to Log-Structured
Figurefile systempersistent layout.
2: HydraFS
size blocks using a chunking a
Files dedup across file systems to increase the likelihood that th
3 File System Design block store will generate a ma
ten to the block store on one fil
to another file system using the
HydraFS Software Stack
✤ Uses FUSE
✤ Split into file server and commit server
✤ Simplifies metadata locking
✤ Amortizes the cost of metadata updates via batching
✤ Each server has its own caching strategy
17
File server manages the interface to the client, records file modifications in a transaction log
stored in hydra, in-memory cache of recent file modifications.
Commit server reads transaction log, updates FS metadata, generates new FS versions
Writing Data
✤ Data stored in inode specific buffer
✤ Chunked, marked dirty and written to Hydra
✤ After write confirmation, block freed and entered in uncommitted

block table
✤ Needed until metadata is flushed to storage
✤ Designed for append writing, in-place updates are expensive
18
Chunks have a max size at which point a chunk is created

Writes cached in memory until Hydra confirms them. (this allows for responses to reads in the
meantime or failures in hydra.
Data not visible in Hydra until a new FS is created.
Metadata Cleaning
✤ Dirty data kept until the commit server applies changes
✤ New versions of file systems are created periodically
✤ Metadata in separate structures, tagged by time
✤ Always clean (in Hydra), can be dropped from cache at any time
✤ Cleaning allows file servers to drop changes in the new FS version
19
New FS allows a file server to clean it’s dirty metadata proactively.

Admission Control
✤ Events assume worse case memory usage
✤ If insufficient resources are available, the event blocks
✤ Limits the number of active events
✤ Memory usage is tuned to the amount of physical memory
20
Not all memory used are freed when an action completes. For example, cache. This can be
flushed if the system finds it needs to reclaim memory.
Not swapping is key for keeping latencies low and performance up.
Read Processing
✤ Aggressive read-ahead
✤ Multiple fetches to get metadata
✤ Weighted caching to favor metadata over data
✤ Fast range map
✤ Metadata read-ahead
✤ Primes FRM, cache
21
Read-ahead goes into an in-memory LRU cache, default is 20 MB.

HydraFS caches both meta-data and data. Uses large leaf nodes and high-fan parent nodes.
Fast range map is a look-aside buffer, translates file offset to content address.
FRM and BtreeReadAhead add 36% performance for small memory/cpu overhead
Deletion
✤ File deletion removes the entry from the current FS
✤ Data remains until there are no pointers to it
22
The data will remain in storage until all FS versions that reference it are garbage collected.
Block maybe pointed to by other files as well.
The FS only marks roots for deletion, Hydra handles reference counting and storage
reclamation.
Performance
Raw block device

File system
1.0 ex
Normalized Throughput
H
0.8
0.6 Table 1
ilar har
0.4
0.2 ited by
ing thro
0.0 keep th
Read (iSCSI) Read (Hydra) Write (iSCSI) Write (Hydra)
tem23 do
blocks
Sequential throughput
Figure 5: Comparison of raw device and file system
iSCSI is 6 disks per node -> software raid5 (likely the write hit iscsi takes) server
Block size 64throughput
KB for iSCSI and Hydra user op
HydraFS 82% of read, 88% on write
and all
Metadata Intensive
✤ Postmark
✤ Generates files, then issues transactions.
✤ File size: 512 B - 16 KB
Create Delete
Overall
Alone Tx Alone Tx
ext3 1,851 68 1,787 68 136
HydraFS 61 28 676 28 57
Table 1: Postmark comparing HydraFS with ext3 on sim-

24
ilar hardware
This is a worse-case for HydraFS

Had to create FSes on the fly due to limit on outstanding metadata updates
Fewer operations
ited bytothe
amortize
number costs
ofover
inodes HydraFS creates without go-
ing through the metadata update in the commit server. We
tem does not accumulate a large number of uncommitted
blocks that increase the turnaround times for the commit
e system server processing, increasing unpredictably the latency of
Write Performance vs Dedup
user operations. In contrast, ext3 has no such limitations
and all metadata updates are written to the journal.
achieved
350
k device Hydra
HydraFS
pectively. 300
3 is com-
250
roughput
Throughput (MB/s)
the write 200

o around
150
of Hydra
ncy. 100
lementa-
50
general-
gnificant 0
ce comes 0 20 40 60 80
Duplicate Ratio (%)
d by de- 25
ory man-
Figure 6:
HydraFS
Hydra and HydraFS write throughput with vary-
Hydra ac-within 12% of Hydra throughout
ing duplicate ratio
e perfor-
as expected for duplicate data as the number of I/Os to
10
disk is correspondingly reduced. Second, for all cases, the
HydraFS throughput is within 12% of the Hydra through- 8
Write Behind put. Therefore, we conclude that HydraFS meets the de- 7
Page Memory (MB)

sired goal of maintaining high throughput.
6
10 5
3
9.5
2
9
1
8.5
Offset (GB)
7.5
7 Fig
6.5
6 tency of
0 5 10 15 20
Time (s)
block w
higher
26 t
Figure 7: Write completion order parallel.
Helps with buffering. No IO in the write “critical path”
A lot of jitter around 6 seconds, biggest gap is 1.5 GB To fu
the Cum
1 event lif
Time (s)
block
higher
Hydra Latency Figure 7: Write completion order paralle

To f
the Cu
1 event l
is crea
0.9 is dest
ure 8 s
0.8 less tha
Pr(t<=x)
0.7
Admis
that H
0.6 lying b
vent th
0.5 use ad
tem, th
0.4 buffer
0 10 20 30 40 50 60 70
Time (ms) wastin
27

locatio
90% percentile at 10 ms Figure 8: Write event lifetimes observ
Point: even though Hydra is jittery and high latency, hydraFS still works (smoothes things out)
mainta
To support high-throughput streaming writes, HydraFS the res
Future Work
✤ Allow multiple nodes to manage same FS
✤ Makes failover transparent and automatic
✤ Exposing snapshots to users
✤ Incorporating SSD storage to lower latencies, make HydraFS usable

as primary storage
28

Thank you
✤ Questions?
✤ Comments?
✤ email: buck@soe.ucsc.edu
✤ Paper: http://www.usenix.org/events/fast10/tech/full_papers/
ungureanu.pdf
29

Sample Operations
✤ Block Write
✤ Block Read
✤ Searchable Block Write
✤ Searchable Block Read
30
Writes trade blocks for CAS addresses, reads invert that

Labels can group data for retention or deletion, garbage collection reaps all the data that
isn’t part of a tree anchored by a retention block

Hydrafs Presentation With Notes

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Hydrafs Presentation With Notes

Încărcat de

Drepturi de autor:

Formate disponibile

HydraFS

C. Ungureanu, B. Atkin, A. Aranya, et al.

Slides: Joe Buck, CMPS 229, Spring 2010

April 27, 2010

Tuesday, April 27, 2010

Tuesday, April 27, 2010

HydraFS is a file system on top of HydraStore, a scalable, distributed CAS

✤ Put / Get API

Tuesday, April 27, 2010

Inconsistent use of capitalization in acronyms

Tuesday, April 27, 2010

Tuesday, April 27, 2010

Tuesday, April 27, 2010

Objects in the CAS have ids that are pointed to by meta-data.

cas1: 10KB cas2: 9KB

Tuesday, April 27, 2010

✤ Same data doesn’t mean the same address

✤ Impossible to calculate prior to write

✤ Foreground processing writes shallow trees

✤ Root cannot be updated until all child nodes are set

Tuesday, April 27, 2010

Differing retention levels can produce different CAS addresses.

✤ Updates are more expensive

✤ Metadata cache misses cause significant performance issues

✤ The combination of high latency and high throughput means lots of

Tuesday, April 27, 2010

Updates must touch all meta-data that points to affected data

✤ Decouple data and metadata processing

✤ Fixed size caches with admission control

✤ Second-order cache for metadata

Tuesday, April 27, 2010

From the previous 3 issues come 3 design decisions:

✤ FS can only reference blocks already written

✤ Height of DAGs needs to be minimized

Tuesday, April 27, 2010

✤ In stead of ms - 10’s of ms latency Hydra has 100’s ms - 1 s latency

✤ Delay writes to batch streams together

✤ High degree of parallelism needed to mask high latencies

Tuesday, April 27, 2010

For an IO operation Hydra must:

✤ Variable sized blocks

✤ Avoids the “shifting window” problem

✤ Use a balanced tree structure

Tuesday, April 27, 2010

This is the “chunking” referred to in the paper.

✤ Minimize the number of dependent I/O operations

✤ Availability guarantees no worse than standard Unix FS

✤ Efficiently support both local and remote access

Tuesday, April 27, 2010

close to open consistency (fsync acknowledgment means data is persisted)

Imap Segmented Array

Inode B−Tree Inode B−Tree

Figure 3: HydraFS Soft

Directory Blocks File Contents

✤ Split into file server and commit server

✤ Simplifies metadata locking

✤ Amortizes the cost of metadata updates via batching

✤ Each server has its own caching strategy

Tuesday, April 27, 2010

✤ Data stored in inode specific buffer

✤ Chunked, marked dirty and written to Hydra

✤ After write confirmation, block freed and entered in uncommitted

✤ Needed until metadata is flushed to storage

✤ Designed for append writing, in-place updates are expensive