Sunteți pe pagina 1din 21

Cumulus:

Filesystem Backup to
the Cloud
Paper: Michael Vrable, Stefan Savage &
Geoffrey M.Voelker
Slides: Joe Buck, CMPS 229 - Spring 2010

1
Thursday, May 27, 2010
Introduction

• “The Cloud” is a new-ish platform


• Really a spectrum
• Rethink applications
• Backup

2
Thursday, May 27, 2010
Backup

• Store data off-site


• File or sub-file granularity
• Point-in-time checkpoints
• Efficient backup, restore
• Full vs Incremental
3
Thursday, May 27, 2010
Cloud Backup

• Simple interface
• All logic is in the client
• Minimize resource usage & cost

4
Thursday, May 27, 2010
Cumulus Backup Format Example Backup
Monday
Snapshot Roots

photos/A photos/B mbox paper Metadata

photoA photoB mbox1 paper1 Data

5
Thursday, May 27, 2010
Example Backup - cont.
Cumulus Backup Format

Monday Monday Tuesday


Tuesday Snapshot Roots
Shared

photos/A photos/B mbox paper mbox' paper' Metadata

photoA photoB mbox1 paper1 mbox2 paper2 Data

� Stores filesystem snapshots at multiple points in time


� Data blocks shared within, between snapshots
� Minimizes storage, upload bandwidth needed

6
Thursday, May 27, 2010
Example Backup - cont.
Aggregation: Minimizing Per-Block Costs
Monday Tuesday
Segments Snapshot Roots

photos/A photos/B mbox paper mbox' paper' Metadata

photoA photoB mbox1 paper1 mbox2 paper2 Data

� May have per-file in addition to per-byte costs


� Protocol overhead: Slower backups from more transactions
� Per-file overhead at storage server
� May be exposed as monetary cost by provider
� Cumulus reduces these costs by aggregating blocks into segments
7
before storage
Thursday, May 27, 2010
Repacking

• Snapshots can link to old segments


• Cleaning allows space to be reclaimed
• Client-driven, threshold based

8
Thursday, May 27, 2010
Implementation Notes

• ~ 4,000 lines. C++ & Python


• Segments are the basis
• Data can be packaged after segmenting
• Compress, encryption, indexing, etc.

9
Thursday, May 27, 2010
Analysis

• What overhead is introduced by the design


choices?
• Analyze from a cost perspective
• Quantify the effects of aggregation and
tuning

10
Thursday, May 27, 2010
Evaluation Traces
Trace Data
Fileserver User
Duration (days) 157 223
Entries 26673083 122007
Files 24344167 116426
File Sizes
Median 0.996 KB 4.4 KB
Average 153 KB 21.4 KB
Maximum 54.1 GB 169 MB
Total 3.47 TB 2.37 GB
Update Rates
New data/day 9.50 GB 10.3 MB
Changed data/day 805 MB 29.9 MB
Total data/day 10.3 GB 40.2 MB

11
Thursday, May 27, 2010
Evaluation

• Model an ideal backup solution


• All unique data stored on the server
• All new data transferred over the wire
• Compare Cumulus to baseline
• Ignore compression and metadata
12
Thursday, May 27, 2010
Is Cleaning Necessary?

Benefit of Cleaning
1
� Wit
0.95
clea
0.9
utili
0.85
Storage Utilization

decr
0.8
� Wee
0.75

0.7 keep
0.65 with
0.6 rang
0.55 With Cleaning � Exa
No Cleaning
0.5 dep
0 50 100 150 200
Time (days) para
13
Thursday, May 27, 2010
How Much Data is Transferred?

Data Transfer Measured


40
16 MB Segments 52
35 4 MB Segments
Overhead vs. Optimal (%)

1 MB Segments 50

Raw Size (MB/day)


30 512 kB Segments
128 kB Segments 48
25
46 �
20 Agg
15 44 larg
10 42 incr
5 40

0 38
0 0.2 0.4 0.6 0.8 1
Cleaning Threshold
14
Thursday, May 27, 2010
What is the Storage Overhead?
Storage Overhead
25
16 MB Segments 3.3 � Larg
4 MB Segments
Overhead vs. Optimal (%)

20 1 MB Segments 3.2 incre


512 kB Segments

Raw Size (GB)


128 kB Segments � Too
15 3.1
leads
3 overh
10
2.9 � Aggr
5 2.8
leads
stora
0 2.7 when
0 0.2 0.4 0.6 0.8 1
mult
Cleaning Threshold

15
Thursday, May 27, 2010
What Settings Minimize Total Cost?
Cost
50
16 MB Segments � Agg
0.75
Cost Increase vs. Optimal (%)
4 MB Segments larg
40 1 MB Segments
512 kB Segments 0.7 incr
128 kB Segments
30 � Tot
0.65 per-
20 inte
0.6
segm
10
0.55 � Clea
0.4–
0
0 0.2 0.4 0.6 0.8 1 size
Cleaning Threshold well

16
Thursday, May 27, 2010
Simulation Results

• Storage cost was > 75 % total cost


• Cumulus was within 5 - 10% of ideal

17
Thursday, May 27, 2010
Prototype Results

• The code worked


• Ongoing costs ~ $0.25/month (2 GB)
• “Better” than Jungle Disk & Brakup

18
Thursday, May 27, 2010
Summary
• Cumulus is a cost-effective tool for
network backup
• Tunable metrics evaluated
• Low-overhead backup feasible on-top of a
simple interface
• Limited Deduplication
19
Thursday, May 27, 2010
My Thoughts

• Client-side cost?
• Segmentation...

20
Thursday, May 27, 2010
More Material

• Code available
• http://sysnet.ucsd.edu/projects/cumulus/
• FAST ’09 Presentation
• http://www.usenix.org/media/events/
fast09/tech/videos/vrable.mov

21
Thursday, May 27, 2010

S-ar putea să vă placă și