Sunteți pe pagina 1din 60

Course at the University of Applied

Sciences - FH München
Prof. Dr. Christian Vogt
 TBD
 Gregory Pfister: In Search of Clusters, 2nd ed.,
Pearson 1998

 Documentation for the Windows Server 2008


Failover Cluster (on the Microsoft Web Pages)

 Sven Ahnert: Virtuelle Maschinen mit VMware


und Microsoft, 2. Aufl., Addison-Wesley 2007
(the 3rd edition is announced for June 26, 2009).
 Wikipedia says:
A computer cluster is a group of linked
computers, working together closely so that in
many respects they form a single computer.
 Gregory Pfister says:
A cluster is a type of parallel or distributed
system that:
 consists of a collection of interconnected whole
computers,
 and is utilized as a single, unified computing
resource.
 High Performance Computing
 Load Balancing
 High Availability
 Scalability
 Simplified System Management
 Single System Image
 High Performance Computing (HPC) Clusters

 Load Balancing Clusters (aka Server Farms)

 High-Availability Clusters (aka Failover


Clusters)
 VMScluster (DEC 1984, today: HP)
 Shared everything cluster with up to 96 nodes.
 IBM HACMP (High Availability Cluster
Multiprocessing, 1991)
 Up to 32 nodes (IBM System p with AIX or Linux).
 IBM Parallel Sysplex (1994)
 Shared everything, up to 32 nodes (mainframes
with z/OS).
 Solaris Cluster, aka Sun Cluster
 Up to 16 nodes.
 Heartbeat (HA Linux project, started in 1997)
 No architectural limit for the number of nodes.
 Red Hat Cluster Suite
 Up to 128 nodes. DLM
 Windows Server 2008 Failover Cluster
 Was: Microsoft Cluster Server (MSCS, since 1997).
 Up to 16 nodes on x64 (8 nodes on x86).

 Oracle Real Application Cluster (RAC)


 Two or more computers, each running an instance of the
Oracle Database, concurrently access a single database.
 Up to 100 nodes.
 One physical machine as hot standby for
several physical machines:

physical virtual cluster


 Consolidation of several clusters:

physical virtual cluster


 Clustering hosts (failing over whole VMs):

physical virtual cluster


 Internet Small Computer Systems Interface
 is a storage area network (SAN) protocol,
 carries SCSI commands over IP networks (LAN,
WAN, Internet),
 is an alternative to Fibre Channel (FC), using an
existing network infrastructure.

 An iSCSI client is called an iSCSI Initiator.

 An iSCSI server is called an iSCSI Target


 An iSCSI initiator initiates a SCSI session, i.e.
sends a SCSI command to the target.

 A Hardware Initiator (host bus adapter, HBA)


 handles the iSCSI and TCP processing and Ethernet
interrupts independently of the CPU.

 A Software Initiator
 runs as a memory resident device driver,
 uses an existing network card,
 leaves all protocol handling to the main CPU.
 An iSCSI target
 waits for iSCSI initiators‘ commands,
 provides required input/output data transfers.

 Hardware Target:
A storage array (SAN) may offer its disks via the
iSCSI protocol.

 A Software Target:
 offers (parts of) the local disks to iSCSI initiators,
 uses an existing network card,
 leaves all protocol handling to the main CPU.
 A Logical Unit Number (LUN)
 is the unit offered by iSCSI targets to iSCSI initiators,
 represents an individually addressable SCSI device,
 appears to an initiator like a locally attached device,
 may physically reside on a non-SCSI device, and/or
be part of a RAID set,
 may restrict access to a single initiator,
 may be shared between several initiators (leaving
the handling of access conflicts to the file resp.
operating system, or to some cluster software).
Attention: many iSCSI target solutions do not offer
this functionality.
 iSCSI
 optionally uses the Challenge-Hand-shake Authen-
tication Protocol (CHAP) for authentication of
initiators to the target,
 does not provide cryptographic protection for the
data transferred.

 CHAP
 uses a three-way handshake,
 bases the verification on a shared secret, which
must be known to both the initiator and the target.
 In order to build a Windows Server 2008
Failover Cluster you need to:
 Install the Failover Cluster Feature (in Server
Manager).
 Conncect networks and storage.
 Public network
 Heartbeat network
 Storage network (FC or iSCSI, unless you use SAS)
 Validate the hardware configuration (Cluster Vali-
dation Wizard in the Failover Cluster Management
snap-in).
 All disks on a shared storage bus are
automatically placed in an offline state when
first mapped to a cluster node. This allows
storage to be simultaneously mapped to all
nodes in a cluster even before the cluster is
created. No longer do nodes have to be booted
one at a time, disks prepared on one and then
the node shut down, another node booted, the
disk configuration verified, and so on.
 Run the Cluster Validation Wizard (in Failover
Cluster Management).
 Adjust your configuration until the wizard does not
report any errors.
 An error-free cluster validation is a prerequisite for
obtaining Microsoft support for your cluster installation.

 A full test of the Wizard consists of:


 System configuration
 Inventory
 Network
 Storage
 Use the Create Cluster Wizard (in Failover
Cluster Management) to create the cluster.
You will have to specify
 which servers are to be part of the cluster,
 a name for the cluster,
 an IP address for the cluster.

 Other parameters will be chosen automatically,


and can be changed later.
 (Node) Fencing is the act of forcefully disabling
a cluster node (or at least keeping it from doing
disk I/O: Disk Fencing).

 The decision when a node needs to be fenced


is taken by the cluster software.

 Some ways of how a node can be fenced are


 by disabling its port(s) on a Fibre Channel switch,
 by (remotely) powering down the node,
 by using the SCSI-3 Persistent Reservation.
 Some Fibre Channel Switches allow programs
to fence a node by disabling the switch port(s)
that it is connected to.
 “Shoot the other node in the head”.

 A special STONITH device (a Network Power


Switch) allows a cluster node to power down
other cluster nodes.

 Used, for example, in Heartbeat, the Linux HA


project.
 Allows multiple nodes to access a SCSI device.
 Blocks other nodes from accessing the device.
 Supports multiple paths from host to disk.
 Reservations are persistent across SCSI bus
resets, and node reboots.
 Uses reservations, and registration.
 To eject another system‘s registration, a node
issues a pre-empt and abort command.
 Windows Server 2008 Failover Cluster uses
SCSI-3 Persistent Reservations.

 All shared storage solutions (e.g. iSCSI Targets)


used in the cluster must use SCSI-3 commands,
and in particular support persistent reserva-
tions.
(Many open source iSCSI targets do not fulfill this
requirement, e.g. OpenFiler, or FreeNAS target.)
 The Cluster Validation Wizard may report the
following error:
 Cluster Partitioning (Split-Brain) is the situ-
ation when the cluster nodes break up into
groups which can communicate in their
groups, and with the shared storage, but not
between groups.

 Cluster Partitioning can lead to serious


problems, including data corruption on the
shared disks.
 Cluster Partitioning can be avoided by using
a Quorum Scheme:
 A group of nodes is only allowed to run as a cluster
when it has quorum.
 Quorum consists of a majority of votes.
 Votes can be contributed by
 Nodes
 Disks
 File Shares
each of which can provide one or more votes.
 In Windows Server 2008 Failover Cluster votes
can be contributed by
 a node,
 a disk (called the witness disk),
 a file share,
each of which provides exactly one vote.

 A Witness Disk or File Share contains the


cluster registry hive in the \Cluster directory.
(The same information is also stored on each of
the cluster nodes but may be out of date).
Windows Server 2008 Failover Cluster can use
any of four different Quorum Schemes:

 Node Majority
 Recommended for a cluster with an odd number
of nodes.

 Node and Disk Majority


 Recommended for a cluster with an even number
of nodes.
 Node and File Share Majority
 Recommended for a multi-site cluster with an even
number of nodes.

 No Majority: Disk Only


 A group of nodes may run as a cluster if they have
access to the witness disk.
 The witness disk is a single point of failure.
 Not recommended. (Only for backward
compatibility with Windows Server 2003.)
 Resources
 Groups
 Services and Applications
 Dependencies
 Failover
 Failback
 Looks-Alive („Basic resource health check“,
default interval: 5 sec.)
 Is-Alive („Thorough resource health check“,
default interval: 1 min.)
 DFS Namespace Server
 DHCP Server
 Distributed Transaction Coordinator (DTC)
 File Server
 Generic Application
 Generic Script
 Generic Service
 Internet Storage Name Service (ISNS) Server
 Message Queuing
 Other Server
 Print Server
 Virtual Machine (Hyper-V)
 WINS Server
 General:
 Name
 Preferred Owner(s)
(Muss angegeben werden, wenn ein Failback gewünscht ist.)
 Failover:
 Period (Default: 6 hours)
Number of hours in which the Failover Threshold must not be
exceeded.
 Threshold (Default: 2 [?, 2 for File Server])
Maximum number of times to attempt a restart or failover in the
specified period. When this number is exceeded, the application is
left in the failed state.
 Failback:
 Prevent failback (Default)
 Allow failback
 Immediately
 Failback between (specify range of hours of the day)
In addition to all services and applications
mentioned before:

 File Share Quorum Witness


 IP Address
 IPv6 Address
 IPv6 Tunnel Address
 MSMQ Triggers
 Network Name
 NFS Share
 Physical Disk
 Volume Shadow Copy Service Task
 General:
 Resource Name
 Resource Type
 Dependencies
 Policies:
 Do not restart
 Restart (Default)
 Threshold: Maximum number of restarts in the period.
Default: 1
 Period: Period for restarts. Default: 15 min.
 Failover all resources in the service/application if restart
fails? Default: yes
 If restart fails, begin restarting again after ... Default: 1 hour
 Pending Timeout. Default: 3 minutes
 Advanced Policies:
 Possible Owners.
 Basic resource health check interval / Thorough
resource health check interval
 Default: Use standard time period for the resource type
 Use specified time period (defaults: 5 sec. / 1 min.)
 Run resource in separate Resource Monitor.
Default: no.
 Further parameters depending on the type of
the resource.
CluAdmin.msc
Validate

ClusAPI
WMI
RHS.exe
CPrepSrv ClusSvc.exe ClusRes.dll
Disk Resource
C:\ F:\ User

Kernel
Volume Volume

NetFT ClusDisk.sys PartMgr.sys

Disk.sys
Major change is that
MS MPIO Filter
ClusDisk no longer
is in the disk fencing Storport
business
Miniport

HBA

Storage enclosure
 Database Manager
 Manages the configuration database contained in the registry of
each cluster node.
 Coordinates updates of the database.
 Makes sure that updates are atomic across the cluster nodes..

 Node Manager (or: Membership Manager)


 Maintains cluster membership.
 The node managers of all cluster managers communicate in
order to determine the failure of a node.

 Event Processor
 Is responsible for communicating events to the applications,
and to other components of the cluster service.
 Communication Manager
 Is responsible for the communication between the cluster
services on the cluster nodes, e.g. related to
 negotiating the entrance of a node into the cluster,
 information about resource states,
 failover and failback operations.

 Global Update Manager


 Component for distributing update requests to all cluster
nodes.

 Resource/Failover Manager: is responsible for


 managing the depencies between resources,
 starting and stopping resources,
 initializing failover and failback.
 Resource Monitors handle the communication
between the cluster service and resources.
 A Resource Monitor is a separate process,
using resource specific DLLs.
 A Resource Monitor uses one „poller thread“
per 16 resources for performing the LooksAlive
and IsAlive tests.
 The resource API for writing own resource DLLs
knows two types of functions:
 Callback routines, which can be called from the DLL:
 LogEvent
 SetResourceStatus
 Entry-point routines, which are called by the resource
monitor:
 Startup (called once for every resource type)
 Open (executed when creating a new resource)
 Online (limit: 300 ms or asynch. in worker thread)
 LooksAlive (limit: 300 ms, recommended: < 50 ms)
 IsAlive (limit: 400 ms, recomm.: < 100 ms, or asynch.)
 Offline (limit: 300 ms, or asynch. in worker thread)
 Terminate (on error in offline or pending-timeout)
 Close (executed when deleting a resource)
 ResourceControl, and ResourceTypeControl (for „private
properties“)
Cluster Service

LooksAlive poll every


Call
(LooksAlivePollIntervall) LooksAlive
milliseconds

IsAlive poll every


(IsAlivePollIntervall)
milliseconds
True No Response
LooksAlive
result

False
Call
IsAlive Is
RestartAction No
Set to
DontRestart?
No Response
No Attempt to restart the resource by
True calling Online entry point
IsAlive
result

Yes
Online
False result
Have there
been more than Success
The resource DLL may Yes (RestartThreshold) Fail
independently report RESOURCE FAILURE
restart attempts within
that the resource has (RestartPeriod)
failed minutes? The resource is
back online

What is the Fail the resource over


DontRestart Online result
RestartAction to a new node and
attemp restart (new node)
setting? RestartNotify

RestartNoNotify No

Resource remains in Have there


The resource remains a failed state for Yes been more than Fail
in a failed state (FailoverThreshold) (FailoverThreshold)
failover attempts? failover attempts?

Fail Online
result Success

Success
The resource is The resource is back
back online online (on another node)

S-ar putea să vă placă și