Sunteți pe pagina 1din 46

Distributed Systems

Data Replication

Prof. Dr.-Ing. Torben Weis


Universitt Duisburg-Essen
Problem Statement

obj
obj

obj

Data objects copied (replicated) to other


machines
Read/write operations on copies
How to keep them synchronized?
Universitt Duisburg-Essen Torben Weis 2
Verteilte Systeme
Data Replication
A B A
Goals: A
D B

Increased
Node 4 Node 1 Node 5
availability
Shorter response
times A C B C
Node 2 Node 3
D D
Higher throughput
Parallel processing
of requests Node 6 Node 7

Less
communication B
A
overhead D

Universitt Duisburg-Essen Torben Weis 3


Verteilte Systeme
Logical & Physical Data Objects

Data objects are stored on nodes (computers)


Each node has a unique identifier
Any pair of nodes can communicate
Logical data object:
Has a unique value
Supports read and write operations
Physical copy of a data object:
Logical object stored on a certain node
Supports atomic read and write operations
Multiple copies can be stored on arbitrary nodes
Universitt Duisburg-Essen Torben Weis 4
Verteilte Systeme
Failure Model

We use the same failure model as for


transactions
Nodes can fail
Crash failures (but no arbitrary behavior)

Network communication can fail


Message loss (but no message corruption)
Network partitioning is possible

Assumption: nodes and network eventually


recover from failures

Universitt Duisburg-Essen Torben Weis 5


Verteilte Systeme
Problems with Network Partitioning

Questions:
Which operations are
still possible during a
partitioning?
What has to be done
after recovery from
network partitioning?

Universitt Duisburg-Essen Torben Weis 6


Verteilte Systeme
Correctness of Replication Management

Concurrency chapter: serializable schedule of


transactions
Concurrent execution of transactions is correct
if equivalent to a serial execution (and assuming all
transactions themselves are correct)

Concurrent execution of transactions with


replication is one-copy serializable,
... if equivalent to a serial execution without
replication (i.e. as if there was only one copy)

A one-copy serializable execution is correct


Universitt Duisburg-Essen Torben Weis 7
Verteilte Systeme
Consistency Approaches

syntactic semantic

pessimistic optimistic pessimistic optimistic

Syntactic consistency
Process the syntax of data but do not know the
meaning of it
Example: HOLA (string)
Example: 5 (number)

Universitt Duisburg-Essen Torben Weis 8


Verteilte Systeme
Consistency Approaches (2)

syntactic semantic

pessimistic optimistic pessimistic optimistic

Semantic consistency
Have special application knowledge about data
Example: measured temperature
36,6C and 36,7C are considered to be equal
Example: unordered elements in a set
{A, B, C, D} equals {D, A, B, C}

Universitt Duisburg-Essen Torben Weis 9


Verteilte Systeme
Consistency Approaches (3)

syntactic semantic

pessimistic optimistic pessimistic optimistic

Pessimistic methods
Consistent data in every failure situation
Guaranteed one-copy serializability
But at limited availability during failures

Universitt Duisburg-Essen Torben Weis 10


Verteilte Systeme
Consistency Approaches (4)

syntactic semantic

pessimistic optimistic pessimistic optimistic

Optimistic methods
Increase the availability of data
Accept temporary inconsistencies in case of failures
Accept non-serializable executions
Usually with a specific application scenario in mind,
which tolerates inconsistencies

Universitt Duisburg-Essen Torben Weis 11


Verteilte Systeme
Syntactic Pessimistic Methods

Two example methods:

cost cost

write read

read write

degree of replication degree of replication

Read One, Write All Read All, Write One

Between those two extremes, any trade-off


between read and write costs is possible
Universitt Duisburg-Essen Torben Weis 12
Verteilte Systeme
Required Protocols

1. Synchronization protocol
2. Replication management protocol
3. Update protocol for physical copies

Universitt Duisburg-Essen Torben Weis 13


Verteilte Systeme
Required Protocols

1. Synchronization protocol
Synchronizes physical operations on object copies
e.g. distributed locking or distributed 2PL
Guarantees serializability on a physical object
But: No guarantee for one-copy serializability (does
not manage multiple copies of the same object)

2. Replication management protocol


3. Update protocol for physical copies

Universitt Duisburg-Essen Torben Weis 14


Verteilte Systeme
Required Protocols

1. Synchronization protocol
2. Replication management protocol
Maps logical read/write operations to physical ones
e.g. Read One, Write All:
logical read read one of the physical objects
logical write write all physical objects
Guarantees together with the synchronization
protocol (1.) one-copy serializability

3. Update protocol for physical copies

Universitt Duisburg-Essen Torben Weis 15


Verteilte Systeme
Required Protocols

1. Synchronization protocol
2. Replication management protocol
3. Update protocol for physical copies
Node failures and partitioning cause physical copies
to become stale (outdated)
Update protocol makes outdated copies current
(up-to-date) after recovery from failure
Sometimes combined with replication management

Universitt Duisburg-Essen Torben Weis 16


Verteilte Systeme
Definitions

Logical object: x
Physical copy of x on
node p: xp
Set of all copies of x:
copies[x]
Set of all nodes with
copies of x: nodes[x]
Number of copies of x: n[x]
Read/write x : r[x] / w[x]
Read/write xp : r[xp] / w[xp]

Universitt Duisburg-Essen Torben Weis 17


Verteilte Systeme
Protocol Read One, Write All

Read: Write:
x1 .... xp . . . xn x1 .... xp . . . xn

reading writing

x x
read write

Read: An arbitrary copy is read


Write: All copies are written

Universitt Duisburg-Essen Torben Weis 18


Verteilte Systeme
Protocol Read One, Write All

Replication management protocol:


Read operation r[x]
Read any copy of x (i.e. any xp)
Choose the nearest/fastest/best reachable copy of x
All copies will always contain the same current value
Write operation w[x]
Write to all physical copies: w[xp] for all xp copies[x]
In order to do a write, all copies must be locked
Update protocol: not applicable
If one copy is not available (i.e. lock fails), write
operations will not succeed

Universitt Duisburg-Essen Torben Weis 19


Verteilte Systeme
Properties of Read One, Write All

Availability for reading increases with the


number of copies
pk : Probability of correct functioning of a node
pr : Probability of reading availability for an object
that is replicated n times
pr = 1 - (1 - pk)n
Example: pk = 0,9; n = 4 pr = 0,9999

Universitt Duisburg-Essen Torben Weis 20


Verteilte Systeme
Properties of Read One, Write All (2)

Availability for writing decreases with the


number of copies
pk : Probability of correct functioning of a node
pw : Probability of writing availability for an object
that is replicated n times
pw = pkn
Examples: pk = 0,9 ; n = 4 pw = 0,6561
pk = 0,9 ; n = 10 pw 0,3487

pk = 0,99; n = 4 pw 0,9606
pk = 0,99; n = 10 pw 0,9044

Universitt Duisburg-Essen Torben Weis 21


Verteilte Systeme
Properties of Read One, Write All (3)

Costs for reading/writing


Read: about as expensive as without replication
Write: costs increase proportionally with number of
copies

Availability in case of network partitioning


Read: still possible, as long as there is one copy in my
network partition
Write: not possible, because copies in the other
network partition are unreachable

Universitt Duisburg-Essen Torben Weis 22


Verteilte Systeme
Protocol Primary Copy

Read: Write:
x* x1 . . . . xp . . . xn
x* x1 .... xp . . . xn writing

lock reading lock;


write x
x read
write

Copies do not have equal rights


One is the primary copy x*, the others are replicas
For all operations, a lock on x* is required
For reading, any copy can be used
For writing, x* and all other available copies are
written
Universitt Duisburg-Essen Torben Weis 23
Verteilte Systeme
Protocol Primary Copy (2)

Read operation r[x] :


1. shared(x*)
Get shared read lock
2. read xp
xp can be nearest/fastest/best reachable copy
3. unlock(x*)

Write operation w[x] :


1. lock(x*) and write(x*)
Get exclusive write lock
2. Write all available copies xp copies[x] \ x*
3. unlock(x*)

Universitt Duisburg-Essen Torben Weis 24


Verteilte Systeme
Protocol Primary Copy: Failures

1. Network Partitioning
Read/write succeeds as long as the primary copy is
available in the network partition
Only primary copy + available copies are written

2. Crash of the primary copy


The other nodes elect a new primary copy
Distributed election algorithm required
Only reliable if it can be guaranteed that there is no
network partitioning
Not easy to achieve

Universitt Duisburg-Essen Torben Weis 25


Verteilte Systeme
Protocol Primary Copy: Update Protocol

After crash and restart, a node ...


1. ... makes its copies unavailable for any operations
2. ... sets the value of its copy to the current value of
the primary copy (i.e. it updates)
3. ... makes its copies available for operations again

When partitioned networks are rejoined, ...


1. ... all copies are unavailable for any operations
2. ... the copies of the partition(s) without primary copy
are updated
3. ... the copies are available for operations again
Universitt Duisburg-Essen Torben Weis 26
Verteilte Systeme
Protocol Primary Copy: Example

Node E wants to read:


r[x] :
shared(x*); r[xE]
unlock(x*)
Node E wants to write:
w[x] :
lock(x*); w[x*]
w[xB]; w[xE]
unlock(x*)
After recovery from the
partitioning:
r[x*]; w[xC]; w[xD]

Universitt Duisburg-Essen Torben Weis 27


Verteilte Systeme
Protocol Primary Copy: Properties

Costs for reading/writing


Similar to Read One, Write All, though with constant
locking overhead
Read: additional but constant costs for locking
(scalable, as long as the primary copy does not
become a bottleneck)
Write: costs increase proportionally with number of
available copies

Availability in case of network partitioning


Read/write operations possible in partition with
primary copy only
Universitt Duisburg-Essen Torben Weis 28
Verteilte Systeme
Protocol Primary Copy: Properties (2)

Crash of the primary copy must be


distinguishable from partitioning
Otherwise no one can continue working during a
primary copy crash

Node restart and network joining must be


detectable in order to trigger the update
protocol

Universitt Duisburg-Essen Torben Weis 29


Verteilte Systeme
Voting Approaches

Votes:
Each copy gets a number of votes
The copies vote on each operation

Quorum:
A set of copies with a certain minimum number of
votes forms a quorum
A read or write operation needs a read or write
quorum, respectively

Universitt Duisburg-Essen Torben Weis 30


Verteilte Systeme
Version Numbers of Copies

Copies of an object can be current or stale


Use versioning to determine the current one
On creation of an object, all copies are assigned the
version number 1
With every write operation, the version numbers of all
written copies are set to a value that is greater than
all previous version numbers
Only those copies with the highest version number
contain the current value of an object

Version numbers have to be examined before


reading and writing
Universitt Duisburg-Essen Torben Weis 31
Verteilte Systeme
Protocol Majority Consensus

Read: Write:
majority
xA xD xA xD majority
xC xC
xE xE
xB xB
reading writing

x read write x

Each copy has one vote and a version


Attempt to lock a majority of copies
A read/write operation is possible when a
majority of copies can be locked
Universitt Duisburg-Essen Torben Weis 32
Verteilte Systeme
Protocol Majority Consensus: Read

Read operation r[x] :


Starting node:
Ask every p nodes[x] for a read vote on x
if m > n[x] / 2 possible responses arrived
(before timeout)
then r[xpr], where pr is the responding node with the
highest version number
else abort
Responding node p :
if xp is not write-locked
then set shared read lock and return a positive response
with version number of xp

Universitt Duisburg-Essen Torben Weis 33


Verteilte Systeme
Protocol Majority Consensus: Write

Write operation w[x] :


Starting node:
Ask every p nodes[x] for a write vote on x
if m > n[x] / 2 possible responses arrived
(before timeout)
then calculate the maximum version number + 1 and
write to all nodes which responded
else abort
Responding node p :
if xp is not read-locked or write-locked
then set write lock and return a positive response with
version number of xp

Universitt Duisburg-Essen Torben Weis 34


Verteilte Systeme
Protocol Majority Consensus: Properties (1)

Update protocol is not needed


Will not read from stale copies because their version
number is lower than the version of a current copy
Stale copies will be updated eventually with write
operation

Any two quorums will always overlap


Two write operations will always access at least one
copy that is part of both quorums (majorities)
A read and a write operation will always access at
least one copy that is part of both quorums,
i.e. read operations read at least one current copy
Universitt Duisburg-Essen Torben Weis 35
Verteilte Systeme
Protocol Majority Consensus: Properties (2)

Identical costs for reading and writing


Increase proportionally with number of copies
Compared to Read One, Write All: reading is more
expensive, but writing is less expensive
Identical availability for reading and writing
Compared to Read One, Write all: worse availability
for reading, but better availability for writing
Availability in case of network partitioning
At most one partition can continue reading/writing
Availability in case of node failures
At least 2n+1 copies needed to tolerate n crashes
Universitt Duisburg-Essen Torben Weis 36
Verteilte Systeme
Protocol Majority Consensus: Example

Problems in this example:


Node 1 Node 4 half and half partitioning
x1: VN = 2 x4: VN = 2
There is no majority for
operations on x
Node 3 Node 2
Without partitioning, at most
x3: VN = 2 x2: VN = 2
1 node failure is tolerated

partition 1 partition 2
Each operation will have to
access 3 of the 4 copies

Universitt Duisburg-Essen Torben Weis 37


Verteilte Systeme
Protocol Majority Consensus: Example (2)

To improve the availability


Node 1 Node 4 of the protocol, we would
x1: VN = 2 x4: VN = 2 need to...
... flexibly allocate votes
Node 3 Node 2
... adapt the required
x3: VN = 2 x2: VN = 2 number of votes to the
relative frequency of read
partition 1 partition 2 and write operations
... take parameters like
failure probability, access
speed etc. into account

Universitt Duisburg-Essen Torben Weis 38


Verteilte Systeme
Protocol Weighted Voting

A generalization of the majority consensus


protocol
Each copy xp has q[xp] votes, where q[xp] 0

Total number of votes: q q[ xp]


p nodes [x]

For every logical object x,


a read threshold qr[x] and
a write threshold qw[x] are defined

Universitt Duisburg-Essen Torben Weis 39


Verteilte Systeme
Protocol Weighted Voting

Read quorum Qr : q [ x p ] q r[ x ]
pQ r

Write quorum Qw : q [ x p ] q w[ x ]
pQ w

A write quorum must overlap with all other write


or read quorums
Conditions for setting the read and write thresholds:
qw[x] > q[x] / 2 two writes overlap
qr[x] + qw[x] > q[x] read and write overlap

Universitt Duisburg-Essen Torben Weis 40


Verteilte Systeme
Protocol Weighted Voting: Special Cases

1. Read One, Write All


For object x, let
1 vote per copy
q[xp] = 1 for all xp copies[x]
qr[x] = 1; qw[x] = n[x] Read threshold = 1,
Write threshold =
2. Majority Consensus total number of nodes

For object x, let


q[xp] = 1 for all xp copies[x]

n[x] / 2 + 1 for even n[x]


qr[x] = qw[x] =
(n[x] + 1) / 2 for odd n[x]
Universitt Duisburg-Essen Torben Weis 41
Verteilte Systeme
Improvements over Majority Consensus

Adapt number of votes assigned to a node


Copies on reliable nodes get a high number of votes
Total availability is increased
Unreliable local copies can be assigned a null vote,
i.e. qr[xp] = 0
Failure does not affect the total availability
Example: reliable, high-bandwidth main server, plus
secondary servers with lower reliability or bandwidth

Universitt Duisburg-Essen Torben Weis 42


Verteilte Systeme
Improvements over Majority Consensus (2)

Adapt thresholds to read and write frequency


What happens more often: read or write operations?
Objects that are read more often than written are
assigned qr[x] < qw[x]
Higher read availability
Reading is less expensive
Example: documents on a web server are usually read
more frequently than written

Universitt Duisburg-Essen Torben Weis 43


Verteilte Systeme
Improvements over Majority Consensus (3)

Example: 4 copies
1. Majority Consensus: 2. Weighted Voting

2 votes

Possible quorums: Possible quorums:


{x1, x2 , x3}, {x1, x2, x4}, {x1, x2 }, {x1, x3}, {x1, x4},
{x1, x3, x4}, {x2, x3, x4} {x2, x3, x4}
Tolerates only one node Can tolerate two node
failure failures, if x1 is alive

Universitt Duisburg-Essen Torben Weis 44


Verteilte Systeme
Pessimistic versus Optimistic

Pessimistic consistency protocols


One-copy serializability defines correctness
Limited availability when partitioning and/or node
failures occur
Some applications can tolerate short-time
inconsistencies to increase the availability
Optimistic consistency protocols
Provide weak consistency
Short-time inconsistencies are possible
But copies converge to an eventually consistent state
Achieve higher availability
Universitt Duisburg-Essen Torben Weis 45
Verteilte Systeme
Data Replication Summary

Pessimistic Methods
Two extremes: Read One, Write All
Read All, Write One
(compromises between those two are possible)
Primary Copy (requires update protocol)
Majority Consensus
Weighted voting

Optimistic Methods
Update as soon as possible

Universitt Duisburg-Essen Torben Weis 46


Verteilte Systeme

S-ar putea să vă placă și