Sunteți pe pagina 1din 6

Cluster VOTES, EXPECTED_VOTES,

and Quorum

The core requirement of an OpenVMS cluster is simple: to maintain data integrity. To


avoid corrupting your data.

To this end, an OpenVMS cluster implements two key components, the Connection
Manager (CNXMAN), and the distributed lock manager (DLM). CNXMAN controls
access into and membership in the cluster, and maintains and monitors connectivity.
DLM provides services that coordinate access to arbitrary resources, whether files or
records within files, files, data structures, or most anything else that can be represented
by an application-defined text-based resource name.

CNXMAN and Quorum

Data integrity is achieved by coordinating access, which means that all cluster member
nodes synchronize activity with all other nodes. Unsynchronized or uncoordinated access
arises in cases of “partitioning” — when one or more nodes nodes is not connected to and
communicating with one or more other nodes.

Partitioning is bad. Partitioning leads to uncoordinated access. Uncoordinated access


leads to data corruptions.

CNXMAN requires that each node in a cluster have direct access to every other member
node, and ensures that the cluster has sufficient structures and resources to allow
processing; that the cluster has and maintains a “quorum”. The combination of these two
CNXMAN requirements prevents partitioning, as nodes or groups of nodes without
quorum are not permitted to modify shared resources. Total connectivity avoids the
difficulties of ensuring and maintaining and verifying delegation; each node can always
communicate with every other node, and can verify the cluster configuration directly.

Additional details on cluster communications are available here at HoffmanLabs.

To manage cluster quorum, CNXMAN uses a voting scheme. Each node contributes zero
or more votes, and EXPECTED_VOTES is set to the numbers of votes that should be
present across the entire cluster.

So long as more than half of the total numbers of votes are present and accounted for
within the cluster, then the cluster has quorum and can modify shared resources. If half or
less votes are present, a partition is assumed and all shared processing is deliberately
halted. To implement this quorum scheme, the VOTES and EXPECTED_VOTES system
parameters, and the QDSKVOTES and DISK_QUORUM system parameters must be set
appropriately on all cluster nodes. In aggregate, a value for the cluster quorum is derived,
and compared against the numbers of votes that should be present. And processing is
permitted or denied appropriately.

Should CNXMAN detect that aggregate number of votes present is below quorum, the
node will deliberately lock up all non-critical processing, and will prevent all modify
access to shared resources. This to prevent corruptions. This case is better known as a
“quorum hang”; as an interlock that prevents the corruption of user data.

The Quorum Disk

For configurations with few and particularly for two-node cluster nodes, there is no host-
based voting mechanism by which either of the two nodes can continue upon an outage
involving the other. You are left to pick one node that must always be up, and a node that
can arrive or depart largely independently. While certainly useful, this particular
asymmetric configuration has its limitations.

One approach to avoid the asymmetry is to use a disk on a shared interconnect — DSSI
or SCSI, most commonly — that can be configured to contribute votes. This scheme is
known as the “quorum disk”. The quorum disk is implemented by setting the disk name
in the DISK_QUORUM parameter, and setting the number of votes in the QDSKVOTES
parameter.

The quorum disk is particularly valuable in these two-node configurations, to allow either
of the two nodes to continue processing when the other node is offline, so long as the
quorum disk is accessible.

Configuration caveat: do not configure the quorum disk inside the physical system
enclosure of one of the two systems, as this disk will itself typically be powered down
when the system is powered down. And it will be offline from both nodes.

The quorum disk is polled by what are known as “quorum disk watchers”; by the nodes
with direct access to the shared interconnect. So long as the nodes read and write to a
reserved area on the quorum disk and only detect the expected nodes reading and writing
the disk, the quorum disk contributes its votes. If unexpected nodes are detected reading
and writing the quorum disk, the votes from the quorum disk will not be available.
Detection is based on writing and reading in a polling loop, and this means that the
scheme cannot react within less than a specified multiple of the polling interval
(QDSKINTERVAL).

The quorum disk watchers — those nodes with direct access to the quorum disk should
have the same non-blank disk name setting across all of the watchers present. Cluster
member nodes that lack direct access to the quorum disk should generally have the
DISK_QUORUM parameter setting left blank.

EXPECTED_VOTES and Data Corruptions


As it is clearly undesirable to encounter a hang, there can be some temptation to set the
EXPECTED_VOTES too low.

Do not do this.

There are failure paths and console commands that can trigger corruptions when
EXPECTED_VOTES is set below the actual value. One of the more common cases is
shared multi-host SCSI, and booting two hosts from the same system disk root, either due
to having the firmware reset, or through direct entry of an erroneous boot command. The
nodes sharing the SCSI and unintentionally sharing the system root will not be able to
establish communications due to the duplicated cluster and network addresses, and — if
the node finds that it has quorum because EXPECTED_VOTES was set too low — will
proceed to corrupt the shared storage.

Always set EXPECTED_VOTES to the number of VOTES that will be present cluster-
wide.

The only typical reason to set EXPECTED_VOTES low is during cluster formation or
error recovery, to allow specific configurations to mount and initialize the quorum disk
structures, for instance. Once connections are established, the running quorum value
derived from EXPECTED_VOTES is automatically floated; the value is corrected once
communications are established. (This means that the exposure to the problem is limited
to environments with degraded connectivity. This also means that there is no further use
made once connections have been established; quorum calculations will be processed
based on the total votes present.)

Configuration caveat: The quorum disk structures and the QUORUM.DAT file can only
be created and initialized when the system parameter STARTUP_P1 is set for a full
OpenVMS system startup, and when the quorum disk can be mounted and accessed from
a quorum disk watcher node with quorum; that doesn't need the votes from the quorum
disk.

The Distributed Lock Manager

The Distributed Lock Manager (DLM) is the central mechanism for coordination within
OpenVMS, and within an OpenVMS Cluster.

The DLM operates with far more than files and records. The DLM coordination is based
entirely on text-based resource names, and not on any particular objects. The resource
name might represent a device name or a file or a record within a file, or most anything
else of interest. Cooperating applications can use the DLM to select a primary process
from a pool of available server processes, to notify applications of changes to a global
section, or any number of other activities requiring distributed coordination.
Within a cluster configuration, the DLM coordinates access to shared resources such as
directories, and can be used to maintain connectivity and to receive notifications when an
application or a host exists the cluster.

A future article here at HoffmanLabs will provide details on the DLM.

Quorum Hang

The quorum hang is not designed nor intended to annoy you. It is intended to protect
your data from corruption.

On various system consoles, you can request the IPL C (IPC) handler to attempt to clear a
quorum hang, though this can easily run afoul of either console restrictions or of the
cluster sanity timers. See the OpenVMS Documentation for details of the IPC handler.

On VAX and Alpha, you are depositing an interrupt request register value that triggers
the handler, and continuing the processing. Specific Alpha systems — including the
AlphaServer ES47, AlphaServer ES80 and AlphaServer GS1280 — cannot continue
from a halt, so you must use other means to access the underlying mechanisms and clear
the hang. If you remain at the console too long, you will trigger the cluster sanity timers,
and the node will be forced out of the cluster.

On Integrity, OpenVMS I64 implements a ^P (CTRL/P) handler within the console


environment and the handler triggers IPC-like processing on V8.2-1 and later.

With the various difficulties around IPC and the requirement for local console-level
access, the preferred approach is to have AMDS or Availability Manager configured and
available. And to use AMDS or AvailMan to capture and clear the quorum condition, and
to resume processing. AMDS and AvailMan can allow the quorum to be adjusted from
any authorized client system, and from anywhere within the local LAN.

If you should be shutting down nodes in a cluster in a controlled fashion, the command
SET CLUSTER /EXPECTED_VOTES is available, and is the easiest means to
incrementally reduce the running calculation of the quorum. Issue this command on one
of the remaining nodes after each voting node is shut down. A similar command
sequence can be used if a key node crashes and is expected to be down for some time.
This to reconfigure the cluster toward improved uptime of the remaining nodes, should
subsequent failures arise.

Configuration Tools, Recommendations

To enable clustering on an OpenVMS node, you will be using CLUSTER_CONFIG or


CLUSTER_CONFIG_LAN, and you can potentially be making manual changes to
parameters.
When enabling clustering, HoffmanLabs recommends you always set VAXCLUSTER to
2; that you do not set VAXCLUSTER to 1. This whether or not you set
NISCS_LOAD_PEA0. Setting VAXCLUSTER to 2 avoids a potential corruption path
that can arise during cases of degenerate failures.

The parameter NISCS_LOAD_PEA0 can enable all NICs found for cluster
communications. This usage is normal and desirable for most clusters, and using the
network path(s) assists in more easily meeting the CNXMAN requirement for total
connectivity. There are additional mechanisms available to configure and control which
NICs should be used for cluster communications and to control device prioritization of
the cluster traffic, including the SCACP tool.

Consider use of the CHECK_CLUSTER system parameter, a parameter which causes the
OpenVMS bootstrap to explicitly confirm with the system manager that the node should
be booted with VAXCLUSTER set to zero. This can be used as a safety mechanism, to
avoid cases where a node is booted with clustering turned off. This should only be done
with care, and typically only when local resources are accessed, and only when
NISCS_LOAD_PEA0 is also set to zero.

If feasible, a third voting node is preferable to a quorum disk. This because it can speed
the cluster transition, and it avoids the need for a shared interconnect for the quorum disk.

Configuration caveat: You cannot use Host-based Volume Shadowing (HBVS) to


shadow the quorum disk. This to avoid cases where the shadowset members become
visible individually. For purposes of avoiding disk failures, you can use controller-level
RAID — so long as there is no possibility that the RAIDset can be split and constituent
disks can become directly visible from multiple nodes.

Configuration caveat: do not configure a quorum disk on a private interconnect; that


quorum disks always be on shared buses. This because having a single-path to a quorum
disk means that the host serving that path must be up, so for reasons of simplicity and
speed having the votes contributed by the host is easier. There's no point in having only a
single quorum disk watcher for a quorum disk. Move the quorum vote(s) to the host and
remove the quorum disk, or move the quorum disk onto a shared multi-host bus.

Further Reading

• Low-End Cluster VOTES, EXPECTED_VOTES for a discussion of two-node or


few-node configurations; configurations using shared SCSI or other shared
interconnect and a quorum disk.
• OpenVMS Tips: Cluster VOTES, EXPECTED_VOTES, and Quorum
• OpenVMS Tips: Disk Allocation Classes, Shadowing, Quorum Disks

Cluster Configuration Assistance


Formal cluster configuration assistance and troubleshooting is available; contact
HoffmanLabs for details and availability.

S-ar putea să vă placă și