Sunteți pe pagina 1din 30

2-Node Clustering

Active-Standby Deployment

www.opendaylight.org
2-Node Deployment Topology
Active-Standby Requirements

Requirements
Configuration of Primary controller in cluster (Must)
Primary Controller services the Northbound IP address, a Secondary takes over NB IP upon
failover (Must)
Configuration of whether on failover & recovery, configured Primary controller reasserts
leadership (Must)
Configuration of merge strategy on failover & recovery (Want)
Primary controller is master of all devices and is leader of all shards (Must)
Initial Config (design to allow for alternatives multi-shard / multiple device masters)
Single node operation allowed (access to datastore on non-quorum) (Want)
2
www.opendaylight.org
Failure of Primary
Scenario 1: Master Stays Offline

Failover Sequence
1. Secondary controller becomes master of all devices and leader of all shards

3
www.opendaylight.org
Failure of Primary
Scenario 2: Primary Comes Back Online

Recovery Sequence
1. Controller A comes back online and its data is replaced by all of Controller Bs data
2. For Re-assert leadership configuration:
1. (ON) Controller A becomes master of all devices and leader of all shards
2. (OFF) Controller B stays master of all devices and maintains leadership of all shards

4
www.opendaylight.org
Network Partition
Scenario 1: During Network Partition

Failover Sequence
1. Controller A becomes master of devices in its network segment and leader of all shards
2. Controller B becomes master of devices in its network segment and leader of all shards

5
www.opendaylight.org
Network Partition
Scenario 2: Network Partition Recovers

Recovery Sequence
1. Merge data according to pluggable merge strategy
(Default: Secondarys data replaced with Primarys data.)
2. For Re-assert leadership configuration:
1. (ON) Controller A becomes master of all devices and leader of all shards again.
2. (OFF) Controller B becomes master of all devices and leader of all shards again

6
www.opendaylight.org
No-Op Failures
Failures That Do Not Result in Any Role Changes
Scenarios
1. Secondary controller failure.
2. Any single link failure.
3. Secondary controller loses network connectivity (but device connections to Primary
maintained)

7
www.opendaylight.org
Cluster Configuration Options
Global & Granular Configuration
Global
1. Cluster Leader (aka Primary)
1. Allow this to be changed on live system, e.g. maintenance.
2. Assigned (2-Node Case), Elected (Larger Cluster Case)
2. Cluster Leader Northbound IP
3. Reassert Leadership on Failover and Recovery
4. Network Partition Detection Alg. (pluggable)
5. Global Overrides of Per Device/Group and Per Shard items (below)
Per Device / Group
1. Master / Slave
Per Shard
1. Shard Leader (Shard Placement Strategy pluggable)
2. Shard Data Merge (Shard Merge Strategy pluggable)

8
www.opendaylight.org
HA Deployment Scenarios
Simplified Global HA Settings
Can we Abstract Configurations to Admin-Defined Deployment
Scenarios?
e.g. Admin Configures 2-Node (Active-Standby):
This means Primary controller is master of all devices and leader of all shards.
Conflicting configurations are overridden by deployment scenario.

9
www.opendaylight.org
Implementation Dependencies
Potential Changes to Other ODL Projects
Clustering:
1. Refactoring of Raft Actor vs. 2-Node Raft Actor code.
2. Define Cluster Leader
3. Define Northbound Cluster Leader IP Alias

OpenFlow Plugin:
1. OpenFlow Master/Slave Roles
2. Grouping of Master/Slave Roles (aka Regions)

System:
1. Be Able to SUSPEND the Secondary controller to support Standby mode.

10
www.opendaylight.org
Open Issues
Follow-up Design Discussion Topics
TBD:
1. Is Master/Slave definition too tied to OpenFlow? (Generalize?)
Should device ownership/mastership be implemented by OF Plugin?
2. How to define Northbound Cluster Leader IP in a platform independent way?
(Linux/Mac OSx: IP Alias, Windows: Possible)
Gratuitous ARP on Leader Change.
3. When both Controllers are active in Network Partition scenario which controller
owns the Northbound Cluster Leader IP?
4. Define Controller-Wide SUSPEND behavior (how?)
5. On failure Primary controller should be elected (2-node case Secondary is only
option to be elected)
6. How/Need to detect management plane failure? (Heartbeat timeout >> w.c. GC?)

11
www.opendaylight.org
Implementation
(DRAFT)

12
www.opendaylight.org
Change Summary

Cluster Primary: (OF Master & Shard Leader)


Northbound IP Address
(Config) Define Northbound IP Alias Address
(Logic) <Pluggable> Northbound IP Alias Implementation (Platform Dependent)
Behavior
(Config / Logic) <Pluggable> Define Default Primary Controller
1. Assigned (Configuration) Default for 2-Node
2. Calculated (Election Algorithm)
Redefine Default Primary Controller on Running Clustering
(Logic) Control OF Master Role
(Logic) Control Datastore Shards
Global Config (Overridden)
Shard Placement (On Primary)
<Pluggable> Leadership Determination
Match OF Master Default for 2-Node
Election Based (With Influence)

13
www.opendaylight.org
Change Summary
(Continued)
Cluster Primary: (OF Master & Shard Leader)
Behavior (Continued)
Network Partition & Failure Detection
(Config / Logic) <Pluggable> Detection Algorithm Default: Akka Clustering Alg.
Failover
(Config / Logic) <Pluggable> Secondary Controller Behavior
(Logic) Suspend
(Dependent APP, Datastore, etc.)
(Logic) Resume (Become Primary)
(OF Mastership, Shards Leader, Non-Quorum Datastore Access)
Failback
(Logic) <Pluggable> Data Merge Strategy Default: Current Primary Overrides Secondary
(Config) Primary Re-Asserts Leadership on Failback
(OF Master & Shard Leader Roles After Merge)

14
www.opendaylight.org
Dependencies

1. Southbound
Device Ownership & Roles
2. System Suspend Behavior
How to Enforce System-Wide Suspend When Desired? (Config Subsystem? OSGI?)
3. Config Subsystem
4. Resolving APP Data
Notifications?

Measure Failover Times


No Data Exchange
Various Data Exchange Cases (Sizes)

15
www.opendaylight.org
RAFT/Sharding Changes
(DRAFT)

16
www.opendaylight.org
(Current) Shard Design
ShardManager is an actor who does the following
Creates all local shard replicas on a given cluster node and maintains the shard information
Monitor the cluster members, their status, and stores their addresses
Finds local shards
Shard is an actor (instance of RaftActor) which represents a sub-tree within data store
Uses in-memory data store
Handles requests from Three phase commit Cohorts
Handles the data change listener requests and notifies the listeners upon state change
Responsible for data replication among the shard (data sub-tree) replicas.
Shard uses RaftActorBehavior for two tasks
Leader Election for a given shard
Data Replication
RaftActorBehavior can be in any of the following roles at any given point of time
Leader
Follower
Candidate

17
www.opendaylight.org
(Current) Shard Class Diagram
Shard <<interface>>
-configParams:ConfigParams RaftActorBehavior
-store:InMemoryDOMDataStore
-name:ShardIdentifier
-dataStoreContext:DataStoreContext handleMessage(sender: ActorRef, message:Object)
-schemaContext:SchemaContext state ()
+onReceiveRecover(message:Object) getLeaderId ()
+onReceiveCommand(message:Object)
+commit(sender:ActorRef, serialized:Object)

AbstractRaftActorBehavior

Raft Actor #context : RaftActorContext


#leaderId: String
#context : RaftActorContext 1 1 #requestVote(sender:ActorRef, requestVote:RequestVote)
-currentBehavior : RaftActorBehavior #handleRequestVoteReply(sender:ActorRef, requestVotReply:RequestVotReply)
+onReceivedRecover(message : Object) #handleAppendEntries(sender:ActorRef, appendEntries:AppendEntries)
+onReceiveCommand(message : Object) #handleAppendEntriesReply(sender:ActorRef,
#onLeaderChanged() appendEntriesReply:AppendEntriesReply)
-switchBehavior(state : RaftState) +handleMessage(sender:ActorRef, message:Object)
#stopElection ()
#scheduleElection(interval:FiniteDuration)

Leader Candidate Follower


followers:set<String> -voteCount:int -memberName
+handleMessage(sender:ActorRef, originalMessage:Object) -votesRequired:int #handleAppendEntries (sender:ActorRef,
-replicate(replicate:Replicate) +handleMessage(sender:ActorRef, originalMessage:Object) appendEntries:AppendEntries)
-sendHeartBeat() -startNewTerm() #handleMessage(sender:ActorRef,
-installSnapShotIfNeeded() #handleRequestVoteReply(sender:ActorRef, originalMessage:Object)
-handleInstallSnapshotReply(reply:InstallSnapshotReply) requestVoteReply:RequestVoteReply -handleInstallSnapshot()
-sendAppendEntries() #scheduleEletion(interval:FiniteDuration)

18
www.opendaylight.org
(Proposed) Shard Design
Intent
Support two-node cluster by separating shard data replication from Leader election
Elect one of the ODL node master and mark that as Leader for all the shards
Make Leader Election Pluggable
Current Raft Leader Election logic should work for 3-node deployment
Design Idea
Minimize the impact on ShardManager and Shard
Separate leader election and data replication logic with RaftActorBehavior classes.
Create two separate abstract classes and interfaces for leader election and data
replication
Shard actor will contain reference to RaftReplicatonActorBehavior instances
(currentBehavior).
RaftReplicationActorBehavior will contain reference to ElectionActorBehavior instance.
Both RaftReplicationActorBehavior and ElectionActorBehavior instances will be in one
of the roles at any given point of time
Leader
Follower
Candidate
RaftReplicationActorBehavior will update its ElectionActorBehavior instance based on
message received. The message could be sent either by one of the
ElectionActorBehavior instance or a module that implement 2-node cluster logic.
19
www.opendaylight.org
(Proposed) Shard Class Diagram
<<interface>> Leader
ElctionActorBehavior
followers:set<String>
heartbeatSchedule:Can cellable
Shard
handleMessage(sender: ActorRef, message:Object) #handleRequestVoteReply(:ActorRef, :RequestVoteReply
-configParams:ConfigParams state () -sendHeartBeat()
-store:InMemoryDOMDataStore getLeaderId ()
-name:ShardIdentifier
-dataStoreContext:DataStoreContext Candidate
-schemaContext:SchemaContext
+onReceiveRecover(message:Object) AbstractElectionActorBehavior -voteCount:int
+onReceiveCommand(message:Object) +handleMessage(sender:ActorRef, originalMessage:Object)
+commit(sender:ActorRef, serialized:Object) #context : RaftActorContext -startNewTerm()
#leaderId: String #handleRequestVoteReply(:ActorRef, :RequestVoteReply
#requestVote(sender:ActorRef,
requestVote:RequestVote)
#handleRequestVoteReply(sender:ActorRef, Follower
requestVotReply:RequestVotReply)
+handleMessage(sender:ActorRef, message:Object)
#stopElection () #handleRequestVoteReply(:ActorRef, :RequestVoteReply)
#schedu leElection(interval:FiniteDuration) #schedu leEletion(interval:FiniteDuration)
Raft Actor #currentTerm ()
#voteFor()
#context : RaftActorContext Follower
1 1
-currentBehavior : RaftActorBehavior 1
+onReceivedRecover(message : Object) snapshotChunksCollected:ByteString
+onReceiveCommand(message : Object) +handleInstallSnapShot(:ActorRef, :InstallSnapshot)
AbstractRaftReplicationActorBehavior
#onLeaderChanged() #handleAppendEntriesReply(:ActorRef,
-switchBehavior(state : RaftState) :AppendEntriesReply)
#context : RaftActorContext
#handleAppendEntries(sender:ActorRef, #handleAppendEntries (:ActorRef, :AppendEntries)

1
appendEntries:AppendEntries)
#handleAppendEntriesReply(sender:ActorRef, Candidate
appendEntriesReply:AppendEntriesReply)
+handleMessage(sender:ActorRef, message:Object) -startNewTerm()
#applyLogToStateMachine(index:long) #handleAppendEntriesReply(ActorRef, AppendEntriesReply)

Leader
<<interface>> -minReplicationLog:int
RaftReplicationActorBehavior #handleAppendEntries (sender:ActorRef,
appendEntries:AppendEntries)
-handleInstallSnapshotReply
handleMessage(sender: ActorRef, message:Object) -sendAppendEntries()
+switchElectionBehavior(:RaftState) -installSnaphotIfNeeded()
getLeaderId () +sendSnapshotChunk(:ActorSelection, :String)
20
www.opendaylight.org
2-node cluster work flow
Method-1: Run 2-node cluster protocol outside of ODL
External cluster protocol decides which node is master and which node
is standby. Once the master election is complete, master sends node
roles and node membership information to all the ODL instances.
Cluster module within ODL defines cluster node model and provides
REST APIs to configure the cluster information by modifying the *.conf
files.
Cluster module will send RAFT messages to all other the cluster
members about cluster information membership & shard RAFT state.
ShardActors in both the cluster nodes will handle these messages, and
instantiate corresponding replication Behavior & election Behavior
role instances and switch to new roles.
Northbound virtual IP is OS dependent and out of scope here.

21
www.opendaylight.org
Reference diagram for Method-2

1a. Switch to
controller
connectivity state
polling Cluster protocol - Primary path

1b. Cluster protocol Secondary path

22
www.opendaylight.org
2-node cluster work flow
Method-2: Run cluster protocol within ODL
Cluster Module within each ODL instance, talks to other ODL
instance and elects the master and standby nodes.
If cluster times out, a node will check other factors (probably
cross-check with connected open flow switches for primary
controller information or use alternative path) for new master
election.
Cluster module will send RAFT messages to all other the cluster
members about cluster information membership & shard RAFT
state.
ShardActors in both the cluster nodes will handle these
messages, and instantiates corresponding replication Behavior
& election Behavior role instances and switch to new roles.
Northbound virtual IP is OS dependent and out of scope here.

23
www.opendaylight.org
3-node Cluster work flow
Shard Manager will create the local shards based on the shard
configuration.
Each shard will start of as candidate for role election and as well
as for data replication messages, by instantiating the
ElectionBehavior and ReplicationBehavior classes in
Candidate roles.
Candidate node will start sending requestForVote messages to
other members.
Leader is elected based on Raft leader election algorithm and
each shard will set its state to Leader by switching the
ElectionBehavior & ReplicationBehavior instances to Leader.
Remaining candidates, receive the leader assertion messages,
they will move to Follower state by switching to
ElectionBehavior & ReplicationBehavior instances to Follower
24
www.opendaylight.org
(Working Proposal) ConsensusStrategy
Provide Hooks to Influence Key RAFT Decisions
(Shard Leader Election / Data Replication)
https://git.opendaylight.org/gerrit/#/c/12588/

25
www.opendaylight.org
Config Changes
(DRAFT)

26
www.opendaylight.org
(Current) Config
Config Files (Karaf: /config/initial)
Read Once on Startup (Default Settings For New Modules)
(sal-clustering-commons) Hosts Akka & Config Subsystem Reader/Resolver/Validator
Currently No Config Subsystem Config Properties Defined?

Akka/Cluster Config: (akka.conf)


Akka-Specific Settings (actorspaces data/rpc, mailbox, logging, serializers, etc.)
Cluster Config (IPs, names, network parameters)
Shard Config: (modules.conf, modules-shards.config)
Shard Name / Namespace
Sharding Strategies
Replication (# and Location)
Default Config

27
www.opendaylight.org
(Proposal) Config
Intent
Continue to Keep Config Outside of Shard/RAFT/DistributedDatastore Code
Provide Sensible Defaults and Validate Settings When Possible
Error/Warn on Any Changes That Are Not Allowed On a Running System
Provide REST Config Access (where appropriate)

Design Idea
Host Configuration Settings in Config Subsystem
Investigate Using Karaf Cellar To Distribute Common Cluster-Wide Config

Move Current Config Processing (org.opendaylight.controller.cluster.common.actor) to


existing sal-clustering-config?
Akka-Specific Config:
Make Most of Existing akka.conf File as Default Settings
Separate Cluster Member Config (see Cluster Config)
Options:
Provide Specific Named APIs, e.g. setTCPPort()
Allow Akka <type,value> Config To Be Set Directly

28
www.opendaylight.org
(Proposal) Config
Design Idea (Continued)
Cluster Config:
Provide a Single Point For Configuring A Cluster
Feeds Back to Akka-Specific Settings, etc.
Define Northbound Cluster IP Config (alias)
Shard Config:
Define Shard Config (Name / Namespace / Sharding Strategy)
Will NOT Support Changing Running Shard For Now
Other Config:
2-Node:
Designate Clusters Primary Node or Election Algorithm (dynamic)
Failback to Primary Node (dynamic)
Strategies (Influence These in RAFT) Separate Bundles?
Election
Consensus

29
www.opendaylight.org
Northbound IP Alias
(DRAFT)

30
www.opendaylight.org

S-ar putea să vă placă și