DS 10 DataReplication PDF

Distributed Systems
Data Replication
Prof. Dr.-Ing. Torben Weis

Universitt Duisburg-Essen
Problem Statement
obj
obj
obj
Data objects copied (replicated) to other

machines
Read/write operations on copies
How to keep them synchronized?
Universitt Duisburg-Essen Torben Weis 2
Verteilte Systeme
Data Replication
A B A
Goals: A
D B
Increased
Node 4 Node 1 Node 5
availability
Shorter response
times A C B C
Node 2 Node 3
D D
Higher throughput
Parallel processing
of requests Node 6 Node 7
Less
communication B
A
overhead D

Verteilte Systeme
Logical & Physical Data Objects
Data objects are stored on nodes (computers)

Each node has a unique identifier
Any pair of nodes can communicate
Logical data object:
Has a unique value
Supports read and write operations
Physical copy of a data object:
Logical object stored on a certain node
Supports atomic read and write operations
Multiple copies can be stored on arbitrary nodes
Verteilte Systeme
Failure Model
We use the same failure model as for

transactions
Nodes can fail
Crash failures (but no arbitrary behavior)
Network communication can fail

Message loss (but no message corruption)
Network partitioning is possible
Assumption: nodes and network eventually

recover from failures

Verteilte Systeme
Problems with Network Partitioning
Questions:
Which operations are
still possible during a
partitioning?
What has to be done
after recovery from
network partitioning?

Verteilte Systeme
Correctness of Replication Management
Concurrency chapter: serializable schedule of

transactions
Concurrent execution of transactions is correct
if equivalent to a serial execution (and assuming all
transactions themselves are correct)
Concurrent execution of transactions with

replication is one-copy serializable,
... if equivalent to a serial execution without
replication (i.e. as if there was only one copy)
A one-copy serializable execution is correct

Verteilte Systeme
Consistency Approaches
syntactic semantic
pessimistic optimistic pessimistic optimistic
Syntactic consistency
Process the syntax of data but do not know the
meaning of it
Example: HOLA (string)
Example: 5 (number)

Verteilte Systeme
Consistency Approaches (2)
syntactic semantic
Semantic consistency
Have special application knowledge about data
Example: measured temperature
36,6C and 36,7C are considered to be equal
Example: unordered elements in a set
{A, B, C, D} equals {D, A, B, C}

Verteilte Systeme
syntactic semantic
Pessimistic methods
Consistent data in every failure situation
Guaranteed one-copy serializability
But at limited availability during failures

Verteilte Systeme
syntactic semantic
Optimistic methods
Increase the availability of data
Accept temporary inconsistencies in case of failures
Accept non-serializable executions
Usually with a specific application scenario in mind,
which tolerates inconsistencies

Verteilte Systeme
Syntactic Pessimistic Methods
Two example methods:
cost cost
write read
read write
degree of replication degree of replication
Read One, Write All Read All, Write One
Between those two extremes, any trade-off

between read and write costs is possible
Verteilte Systeme
Required Protocols
1. Synchronization protocol
2. Replication management protocol
3. Update protocol for physical copies

Verteilte Systeme
Required Protocols
Synchronizes physical operations on object copies
e.g. distributed locking or distributed 2PL
Guarantees serializability on a physical object
But: No guarantee for one-copy serializability (does
not manage multiple copies of the same object)


Verteilte Systeme
Required Protocols
Maps logical read/write operations to physical ones
e.g. Read One, Write All:
logical read read one of the physical objects
logical write write all physical objects
Guarantees together with the synchronization
protocol (1.) one-copy serializability

Verteilte Systeme
Required Protocols
Node failures and partitioning cause physical copies
to become stale (outdated)
Update protocol makes outdated copies current
(up-to-date) after recovery from failure
Sometimes combined with replication management

Verteilte Systeme
Definitions
Logical object: x
Physical copy of x on
node p: xp
Set of all copies of x:
copies[x]
Set of all nodes with
copies of x: nodes[x]
Number of copies of x: n[x]
Read/write x : r[x] / w[x]
Read/write xp : r[xp] / w[xp]

Verteilte Systeme
Protocol Read One, Write All
Read: Write:
x1 .... xp . . . xn x1 .... xp . . . xn
reading writing
x x
read write
Read: An arbitrary copy is read

Write: All copies are written

Verteilte Systeme
Protocol Read One, Write All
Replication management protocol:

Read operation r[x]
Read any copy of x (i.e. any xp)
Choose the nearest/fastest/best reachable copy of x
All copies will always contain the same current value
Write operation w[x]
Write to all physical copies: w[xp] for all xp copies[x]
In order to do a write, all copies must be locked
Update protocol: not applicable
If one copy is not available (i.e. lock fails), write
operations will not succeed

Verteilte Systeme
Properties of Read One, Write All
Availability for reading increases with the

number of copies
pk : Probability of correct functioning of a node
pr : Probability of reading availability for an object
that is replicated n times
pr = 1 - (1 - pk)n
Example: pk = 0,9; n = 4 pr = 0,9999

Verteilte Systeme
Properties of Read One, Write All (2)
Availability for writing decreases with the

number of copies
pk : Probability of correct functioning of a node
pw : Probability of writing availability for an object
that is replicated n times
pw = pkn
Examples: pk = 0,9 ; n = 4 pw = 0,6561
pk = 0,9 ; n = 10 pw 0,3487
pk = 0,99; n = 4 pw 0,9606
pk = 0,99; n = 10 pw 0,9044

Verteilte Systeme
Properties of Read One, Write All (3)
Costs for reading/writing

Read: about as expensive as without replication
Write: costs increase proportionally with number of
copies
Availability in case of network partitioning

Read: still possible, as long as there is one copy in my
network partition
Write: not possible, because copies in the other
network partition are unreachable

Verteilte Systeme
Protocol Primary Copy
Read: Write:
x* x1 . . . . xp . . . xn
x* x1 .... xp . . . xn writing
lock reading lock;

write x
x read
write
Copies do not have equal rights

One is the primary copy x*, the others are replicas
For all operations, a lock on x* is required
For reading, any copy can be used
For writing, x* and all other available copies are
written
Verteilte Systeme
Protocol Primary Copy (2)
Read operation r[x] :

1. shared(x*)
Get shared read lock
2. read xp
xp can be nearest/fastest/best reachable copy
3. unlock(x*)
Write operation w[x] :

1. lock(x*) and write(x*)
Get exclusive write lock
2. Write all available copies xp copies[x] \ x*
3. unlock(x*)

Verteilte Systeme
Protocol Primary Copy: Failures
1. Network Partitioning
Read/write succeeds as long as the primary copy is
available in the network partition
Only primary copy + available copies are written
2. Crash of the primary copy

The other nodes elect a new primary copy
Distributed election algorithm required
Only reliable if it can be guaranteed that there is no
network partitioning
Not easy to achieve

Verteilte Systeme
Protocol Primary Copy: Update Protocol
After crash and restart, a node ...

1. ... makes its copies unavailable for any operations
2. ... sets the value of its copy to the current value of
the primary copy (i.e. it updates)
3. ... makes its copies available for operations again
When partitioned networks are rejoined, ...

1. ... all copies are unavailable for any operations
2. ... the copies of the partition(s) without primary copy
are updated
3. ... the copies are available for operations again
Verteilte Systeme
Protocol Primary Copy: Example
Node E wants to read:

r[x] :
shared(x*); r[xE]
unlock(x*)
Node E wants to write:
w[x] :
lock(x*); w[x*]
w[xB]; w[xE]
unlock(x*)
After recovery from the
partitioning:
r[x*]; w[xC]; w[xD]

Verteilte Systeme
Protocol Primary Copy: Properties
Costs for reading/writing

Similar to Read One, Write All, though with constant
locking overhead
Read: additional but constant costs for locking
(scalable, as long as the primary copy does not
become a bottleneck)
Write: costs increase proportionally with number of
available copies

Read/write operations possible in partition with
primary copy only
Verteilte Systeme
Protocol Primary Copy: Properties (2)
Crash of the primary copy must be

distinguishable from partitioning
Otherwise no one can continue working during a
primary copy crash
Node restart and network joining must be

detectable in order to trigger the update
protocol

Verteilte Systeme
Voting Approaches
Votes:
Each copy gets a number of votes
The copies vote on each operation
Quorum:
A set of copies with a certain minimum number of
votes forms a quorum
A read or write operation needs a read or write
quorum, respectively

Verteilte Systeme
Version Numbers of Copies
Copies of an object can be current or stale

Use versioning to determine the current one
On creation of an object, all copies are assigned the
version number 1
With every write operation, the version numbers of all
written copies are set to a value that is greater than
all previous version numbers
Only those copies with the highest version number
contain the current value of an object
Version numbers have to be examined before

reading and writing
Verteilte Systeme
Protocol Majority Consensus
Read: Write:
majority
xA xD xA xD majority
xC xC
xE xE
xB xB
reading writing
x read write x
Each copy has one vote and a version

Attempt to lock a majority of copies
A read/write operation is possible when a
majority of copies can be locked
Verteilte Systeme
Protocol Majority Consensus: Read
Read operation r[x] :

Starting node:
Ask every p nodes[x] for a read vote on x
if m > n[x] / 2 possible responses arrived
(before timeout)
then r[xpr], where pr is the responding node with the
highest version number
else abort
Responding node p :
if xp is not write-locked
then set shared read lock and return a positive response
with version number of xp

Verteilte Systeme
Protocol Majority Consensus: Write
Write operation w[x] :

Starting node:
Ask every p nodes[x] for a write vote on x
if m > n[x] / 2 possible responses arrived
(before timeout)
then calculate the maximum version number + 1 and
write to all nodes which responded
else abort
Responding node p :
if xp is not read-locked or write-locked
then set write lock and return a positive response with
version number of xp

Verteilte Systeme
Protocol Majority Consensus: Properties (1)
Update protocol is not needed

Will not read from stale copies because their version
number is lower than the version of a current copy
Stale copies will be updated eventually with write
operation
Any two quorums will always overlap

Two write operations will always access at least one
copy that is part of both quorums (majorities)
A read and a write operation will always access at
least one copy that is part of both quorums,
i.e. read operations read at least one current copy
Verteilte Systeme
Protocol Majority Consensus: Properties (2)
Identical costs for reading and writing

Increase proportionally with number of copies
Compared to Read One, Write All: reading is more
expensive, but writing is less expensive
Identical availability for reading and writing
Compared to Read One, Write all: worse availability
for reading, but better availability for writing
At most one partition can continue reading/writing
Availability in case of node failures
At least 2n+1 copies needed to tolerate n crashes
Verteilte Systeme
Protocol Majority Consensus: Example
Problems in this example:

Node 1 Node 4 half and half partitioning
x1: VN = 2 x4: VN = 2
There is no majority for
operations on x
Node 3 Node 2
Without partitioning, at most
x3: VN = 2 x2: VN = 2
1 node failure is tolerated
partition 1 partition 2
Each operation will have to
access 3 of the 4 copies

Verteilte Systeme
Protocol Majority Consensus: Example (2)
To improve the availability

Node 1 Node 4 of the protocol, we would
x1: VN = 2 x4: VN = 2 need to...
... flexibly allocate votes
Node 3 Node 2
... adapt the required
x3: VN = 2 x2: VN = 2 number of votes to the
relative frequency of read
partition 1 partition 2 and write operations
... take parameters like
failure probability, access
speed etc. into account

Verteilte Systeme
Protocol Weighted Voting
A generalization of the majority consensus

protocol
Each copy xp has q[xp] votes, where q[xp] 0
Total number of votes: q q[ xp]

p nodes [x]
For every logical object x,

a read threshold qr[x] and
a write threshold qw[x] are defined

Verteilte Systeme
Protocol Weighted Voting
Read quorum Qr : q [ x p ] q r[ x ]
pQ r
Write quorum Qw : q [ x p ] q w[ x ]
pQ w
A write quorum must overlap with all other write

or read quorums
Conditions for setting the read and write thresholds:
qw[x] > q[x] / 2 two writes overlap
qr[x] + qw[x] > q[x] read and write overlap

Verteilte Systeme
Protocol Weighted Voting: Special Cases
1. Read One, Write All

For object x, let
1 vote per copy
q[xp] = 1 for all xp copies[x]
qr[x] = 1; qw[x] = n[x] Read threshold = 1,
Write threshold =
2. Majority Consensus total number of nodes
For object x, let

q[xp] = 1 for all xp copies[x]
n[x] / 2 + 1 for even n[x]

qr[x] = qw[x] =
(n[x] + 1) / 2 for odd n[x]
Verteilte Systeme
Improvements over Majority Consensus
Adapt number of votes assigned to a node

Copies on reliable nodes get a high number of votes
Total availability is increased
Unreliable local copies can be assigned a null vote,
i.e. qr[xp] = 0
Failure does not affect the total availability
Example: reliable, high-bandwidth main server, plus
secondary servers with lower reliability or bandwidth

Verteilte Systeme
Improvements over Majority Consensus (2)
Adapt thresholds to read and write frequency

What happens more often: read or write operations?
Objects that are read more often than written are
assigned qr[x] < qw[x]
Higher read availability
Reading is less expensive
Example: documents on a web server are usually read
more frequently than written

Verteilte Systeme
Improvements over Majority Consensus (3)
Example: 4 copies
1. Majority Consensus: 2. Weighted Voting
2 votes
Possible quorums: Possible quorums:

{x1, x2 , x3}, {x1, x2, x4}, {x1, x2 }, {x1, x3}, {x1, x4},
{x1, x3, x4}, {x2, x3, x4} {x2, x3, x4}
Tolerates only one node Can tolerate two node
failure failures, if x1 is alive

Verteilte Systeme
Pessimistic versus Optimistic
Pessimistic consistency protocols

One-copy serializability defines correctness
Limited availability when partitioning and/or node
failures occur
Some applications can tolerate short-time
inconsistencies to increase the availability
Optimistic consistency protocols
Provide weak consistency
Short-time inconsistencies are possible
But copies converge to an eventually consistent state
Achieve higher availability
Verteilte Systeme
Data Replication Summary
Pessimistic Methods
Two extremes: Read One, Write All
Read All, Write One
(compromises between those two are possible)
Primary Copy (requires update protocol)
Majority Consensus
Weighted voting
Optimistic Methods
Update as soon as possible

Verteilte Systeme

DS 10 DataReplication PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

DS 10 DataReplication PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Distributed Systems

Prof. Dr.-Ing. Torben Weis

Data objects copied (replicated) to other

Universitt Duisburg-Essen Torben Weis 3

Data objects are stored on nodes (computers)

We use the same failure model as for

Network communication can fail

Assumption: nodes and network eventually

Universitt Duisburg-Essen Torben Weis 5

Universitt Duisburg-Essen Torben Weis 6

Concurrency chapter: serializable schedule of

Concurrent execution of transactions with

A one-copy serializable execution is correct

pessimistic optimistic pessimistic optimistic

Universitt Duisburg-Essen Torben Weis 8

pessimistic optimistic pessimistic optimistic

Universitt Duisburg-Essen Torben Weis 9

pessimistic optimistic pessimistic optimistic

Universitt Duisburg-Essen Torben Weis 10

pessimistic optimistic pessimistic optimistic

Universitt Duisburg-Essen Torben Weis 11

Two example methods:

degree of replication degree of replication

Read One, Write All Read All, Write One

Between those two extremes, any trade-off

Universitt Duisburg-Essen Torben Weis 13

2. Replication management protocol

Universitt Duisburg-Essen Torben Weis 14

3. Update protocol for physical copies

Universitt Duisburg-Essen Torben Weis 15

Universitt Duisburg-Essen Torben Weis 16

Universitt Duisburg-Essen Torben Weis 17

Read: An arbitrary copy is read

Universitt Duisburg-Essen Torben Weis 18

Replication management protocol:

Universitt Duisburg-Essen Torben Weis 19

Availability for reading increases with the

Universitt Duisburg-Essen Torben Weis 20

Availability for writing decreases with the

Universitt Duisburg-Essen Torben Weis 21

Costs for reading/writing

Availability in case of network partitioning

Universitt Duisburg-Essen Torben Weis 22

lock reading lock;

Copies do not have equal rights

Read operation r[x] :

Write operation w[x] :

Universitt Duisburg-Essen Torben Weis 24

2. Crash of the primary copy

Universitt Duisburg-Essen Torben Weis 25

After crash and restart, a node ...

When partitioned networks are rejoined, ...

Node E wants to read:

Universitt Duisburg-Essen Torben Weis 27

Costs for reading/writing

Availability in case of network partitioning

Crash of the primary copy must be

Node restart and network joining must be

Universitt Duisburg-Essen Torben Weis 29

Universitt Duisburg-Essen Torben Weis 30

Copies of an object can be current or stale

Version numbers have to be examined before

Each copy has one vote and a version