Sunteți pe pagina 1din 21

Cumulus:

Filesystem Backup to the Cloud

Paper: Michael Vrable, Stefan Savage & Geoffrey M.Voelker Slides: Joe Buck, CMPS 229 - Spring 2010

Thursday, May 27, 2010

1

Introduction

“The Cloud” is a new-ish platform

Really a spectrum

Rethink applications

Thursday, May 27, 2010

Backup

2

The cloud is new and shiny, we need to rethink solutions in light of its characteristics. The spectrum runs from thin cloud (S3) to thick cloud (Salesforce.com, Google docs) Interesting systems work exists in the asymmetries

Backup

Store data off-site

File or sub-file granularity

Point-in-time checkpoints

Efficient backup, restore

Full vs Incremental

Thursday, May 27, 2010

3

Costs include time, client space, Time to Restore. Restore can be single file, whole file system.

Cloud Backup

Simple interface

All logic is in the client

Minimize resource usage & cost

Thursday, May 27, 2010

4

Take the thin cloud approach. Uses get/put interface By going client heavy, any storage vendor can be used but more network tra !c

Example Backup

Monday Sn photos/A photos/B mbox paper Me photoA photoB mbox1 paper1 Da
Monday
Sn
photos/A
photos/B
mbox
paper
Me
photoA
photoB
mbox1
paper1
Da

Thursday, May 27, 2010

5

First day of the week, backup all data

Example Backup - cont.

Monday Monday Tuesday Tuesday Sn Shared photos/A photos/B mbox paper mbox' paper' Me photoA photoB
Monday
Monday
Tuesday
Tuesday
Sn
Shared
photos/A
photos/B
mbox
paper
mbox'
paper'
Me
photoA
photoB
mbox1
paper1
mbox2
paper2
Da

Thursday, May 27, 2010

6

Second day, mbox and papers change, photos do not. Files are determined to be changed by mtime & ctime entires. If local meta-data cache is lost, then all files are treated as new.

Example Backup - cont.

Monday Tuesday Segments Sn photos/A photos/B mbox paper mbox' paper' Me photoA photoB mbox1 paper1
Monday
Tuesday
Segments
Sn
photos/A
photos/B
mbox
paper
mbox'
paper'
Me
photoA
photoB
mbox1
paper1
mbox2
paper2
Da

Thursday, May 27, 2010

7

Data is stored in segments of a fixed size. Segment size needs to optimize network and storage costs. Data protocols tend to favor larger units of data transfer (works well with high-latency, doesn’t require parallelism for throughput).

Repacking

Snapshots can link to old segments

Cleaning allows space to be reclaimed

Client-driven, threshold based

Thursday, May 27, 2010

8

Links look like segment, start, length. Cleaning moves valid data to new segments so space can be reclaimed. Cleaning thresholds need to balance space savings with data transfer Cleaning involves reads and writes, like raid-5 updates.

Implementation Notes

~ 4,000 lines. C++ & Python

Segments are the basis

Data can be packaged after segmenting

Compress, encryption, indexing, etc.

Thursday, May 27, 2010

9

Segments are the units of operation. Can be parts of files or multiple files. Compression, etc is applied to segments at the client

Analysis

What overhead is introduced by the design choices?

Analyze from a cost perspective

Quantify the effects of aggregation and tuning

Thursday, May 27, 2010

10

Costs based on current web service costs.

Trace Data

 

Fileserver

User

Duration (days) Entries Files File Sizes Median Average Maximum Total Update Rates New data/day Changed data/day Total data/day

157

223

26673083

122007

24344167

116426

0.996 KB 153 KB 54.1 GB 3.47 TB

4.4 KB 21.4 KB 169 MB 2.37 GB

9.50 GB 805 MB 10.3 GB

10.3 MB 29.9 MB 40.2 MB

Thursday, May 27, 2010

11

Analysis is based on traces. Most of the numbers are based on the User trace

Evaluation

Model an ideal backup solution

All unique data stored on the server

All new data transferred over the wire

Compare Cumulus to baseline

Ignore compression and metadata

Thursday, May 27, 2010

12

Storage Utilization

Benefit of Cleaning

1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 With Cleaning No Cleaning 0.5
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
With Cleaning
No Cleaning
0.5
0
50
100
150
200

Thursday, May 27, 2010

Time (days)

13

Data Transfer Measured

16 MB Segments 4 MB Segments 1 MB Segments 512 kB Segments 128 kB Segments
16 MB Segments
4 MB Segments
1 MB Segments
512
kB Segments
128
kB Segments

Thursday, May 27, 2010

14

Cleaning threshold is the about of utilization below which a segment is cleaned.

Storage Overhead

16 MB Segments 4 MB Segments 1 MB Segments 512 kB Segments 128 kB Segments
16 MB Segments
4 MB Segments
1 MB Segments
512
kB Segments
128
kB Segments

Thursday, May 27, 2010

15

Cost

16 MB Segments 4 MB Segments 1 MB Segments 512 kB Segments 128 kB Segments
16 MB Segments
4 MB Segments
1 MB Segments
512
kB Segments
128
kB Segments

Thursday, May 27, 2010

16

There are per-segment charges. The sweet-spot is in the middle

Simulation Results

Storage cost was > 75 % total cost

Cumulus was within 5 - 10% of ideal

Thursday, May 27, 2010

17

The paper tried to call out integrated solutions but I think that’s an apple / oranges comparison as all their limitations painted them into a corner.

Prototype Results

The code worked

Ongoing costs ~ $0.25/month (2 GB)

“Better” than Jungle Disk & Brakup

18

Thursday, May 27, 2010

Snapshots were restorable. Jungle Disk & Brakup weren’t tuned for cost,

Summary

Cumulus is a cost-effective tool for network backup

Tunable metrics evaluated

Low-overhead backup feasible on-top of a simple interface

Limited Deduplication

Thursday, May 27, 2010

19

My Thoughts

Client-side cost?

Segmentation

Thursday, May 27, 2010

20

They never seem to quantify the client side cost of storing the meta-data and block-hash maps. Segmentation seems like just chunking a tar file. Simply auto-network tuning? Per vendor?

More Material

Code available

FAST ’09 Presentation

Thursday, May 27, 2010

21