A high-performance cluster is a group of computers (real or virtual) sharing application load to get things done fast. A high availability cluster is designed primarily to secure the highest possible availability of services.
Whenever communication fails between one or more nodes and the rest of the cluster, a cluster partition occurs. The nodes of a cluster partition are still active and able to communicate with each other, but they are unaware of the nodes with which they cannot communicate. As the loss of the other partition cannot be confirmed, a split brain scenario develops (see also split brain).
consensus cluster membership (CCM)
The CCM determines which nodes make up the cluster and shares this information across the cluster. Any new addition and any loss of nodes or quorum is delivered by the CCM. A CCM module runs on each node of the cluster.
cluster information base (CIB)
A representation of the whole cluster configuration and status (node membership, resources, constraints, etc.) written in XML and residing in memory. A master CIB is kept and maintained on the DC and replicated to the other nodes.
cluster resource manager (CRM)
The main management entity responsible for coordinating all nonlocal interactions. Each node of the cluster has its own CRM, but the one running on the DC is the one elected to relay decisions to the other nonlocal CRMs and process their input. A CRM interacts with a number of components: local resource managers both on its own node and on the other nodes, nonlocal CRMs, administrative commands, the fencing functionality, and the membership layer.
designated coordinator (DC)
master node. This node is where the master copy of
the CIB is kept. All other nodes get their configuration and
resource allocation information from the current DC. The DC is elected
from all nodes in the cluster after a membership change.
Distributed replicated block device (drbd)
DRBD is a block device designed for building high availability clusters. The whole block device is mirrored via a dedicated network and is seen as a network RAID-1.
Occurs when a resource or node fails on one machine and the affected resources are started on another node.
Describes the concept of preventing access to a shared
resource by non-cluster members. It can be archived by
killing (shutting down) a
misbehaving node to prevent
it from causing trouble, locking resources away from a node whose
status is uncertain, or in several other ways.
Furthermore fencing is distinguished between node and resource fencing.
Heartbeat resource agent
Heartbeat resource agents were widely used with Heartbeat version 1. Their use is deprecated, but still supported in version 2. A Heartbeat resource agent can perform start, stop, and status operations and resides under /etc/ha.d/resource.d or /etc/init.d. For more information about Heartbeat resource agents, refer to http://www.linux-ha.org/HeartbeatResourceAgent.
local resource manager (LRM)
The local resource manager (LRM) is responsible for performing
operations on resources. It uses the resource agent scripts to
carry out the work. The LRM is
dumb in that it does not
know of any policy by itself. It needs the DC to tell it what to do.
LSB resource agent
LSB resource agents are standard LSB init scripts. LSB init scripts are not limited to use in a high availability context. Any LSB-compliant Linux system uses LSB init scripts to control services. Any LSB resource agent supports a start, stop, restart, status and force-reload option and may optionally provide try-restart and reload as well. LSB resource agents are located in /etc/init.d. Find more information about LSB resource agents and the actual specification at http://www.linux-ha.org/LSBResourceAgent and http://www.linux-foundation.org/spec/refspecs/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html
Any computer (real or virtual) that is a member of a cluster and invisible for the user.
The ping daemon. It continuously contacts one or more servers outside the cluster with ICMP pings.
policy engine (PE)
The policy engine computes the actions that need to be taken to implement policy changes in the CIB. This information is then passed on to the transaction engine, which in turn implements the policy changes in the cluster setup. The PE always runs on the DC.
OCF resource agent
OCF resource agents are similar to LSB resource agents (init scripts). Any OCF resource agent must support start, stop, and status (sometimes called monitor) options. Additionally, they support a metadata option that returns the description of the resource agent type in XML. Additional options may be supported, but are not mandatory. OCF resource agents reside in /usr/lib/ocf/resource.d/provider. Find more information about OCF resource agents and a draft of the specification at http://www.linux-ha.org/OCFResourceAgent and http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/resource-agent-api.txt?rev=HEAD.
In a cluster, a cluster partition is defined to have quorum (is
quorate) if it has the majority of nodes (or votes).
Quorum distinguishes exactly one partition. It is part of the algorithm
to prevent several disconnected partitions or nodes from proceeding and
causing data and service corruption (split brain). Quorum is a
prerequisite for fencing, which then ensures that quorum is indeed
Any type of service or application that is known to Heartbeat. Examples include an IP address, a file system, or a database.
resource agent (RA)
A resource agent (RA) is a script acting as a proxy to manage a resource. There are three different kinds of resource agents: OCF (Open Cluster Framework) resource agents, LSB resource agents (Standard LSB init scripts), and Heartbeat resource agents (Heartbeat v1 resources).
Single Point of Failure (SPOF)
A single point of failure (SPOF) is any component of a cluster that, should it fail, triggers the failure of the entire cluster.
A scenario in which the cluster nodes are divided into two or more
groups that do not know of each other (either through a software or
hardware failure). To prevent a split brain situation from badly
affecting the entire cluster, STONITH must come to the rescue. Also
known as a
partitioned cluster scenario.
The acronym for
Shoot the other node in the head which
is basically bringing down a misbehaving node to prevent it from
causing trouble in the cluster.
transition engine (TE)
The transition engine (TE) takes the policy directives from the PE and carries them out. The TE always runs on the DC. From there, it instructs the local resource managers on the others nodes which actions to take.