azure-events-az: The Scheduled Events processor for use across Azure Availability Zones

Share
Share

Redevelopment of the original azure-events resource agent

Not everyone is crazy enough to go line-by-line through someone else’s code when it’s poorly documented and, thus, is difficult to support.

Me: Enters stage left

Hi!

My name is Grant and I support Microsoft’s Azure Customer Service and Support (CSS) team directly so they can support their end-customers with Pay-As-You-Go SUSE workloads running in Azure.

For this blog post, I read through every line of the azure-events-az resource agent (source located here) in an effort to fix a gaping knowledge gap for this agent.

After taking twelve pages of hand-written notes, submitting several improvements to the upstream project, running tests in a cluster, and revisiting the code, here’s what I learned along my journey.

The fact I could even do this is only made possible with the awesome power of open-source!

What is the point of this blog post?

If more people understand how this agent works, I get to do less unnecessary work, Azure Support also gets to do less unnecessary work, and customers don’t need to put in unnecessary support requests. It’s a win/win/win situation!

What is the purpose of this agent?

In order to make the cluster resilient against planned and unplanned platform maintenance which may cause service disruption, this agent regularly polls the Azure Instance Metadata Service (IMDS) API for information about upcoming Scheduled Events. When an upcoming relevant event is detected, the agent prepares the affected node for the event by marking the node as unhealthy by assigning a value to a special attribute, #health-azure. In response to this change, the cluster reacts by gracefully stopping active resources on the node before the event is processed. Migration to the new azure-events-az is recommended, but not required. New cluster installations should implement azure-events-az.  If the old agent is to be obsoleted at some point, an announcement will be made and this blog post will be updated (or taken down if better documentation is made available.)

Why the rewrite?

The original azure-events resource agent is functionally limited to cluster nodes located in an Availability Set which is only possible within the same data center. The original agent does not work with clusters spread across Availability Zones in which nodes are located in different data centers within the same geographical region. The original agent depends on a node’s ability to query the Azure Instance Metadata Service (IMDS) API for ones own scheduled events as well as the events of one’s peer node(s). Instance metadata is not exposed to nodes that are not within the same Availability Set, hence the need for an events processing agent capable of working without the need to access peer node instance metadata.

What changed with the updated agent?

The global cluster attribute attr_globalPullState is made obsolete as nodes need not be aware of whether a peer node is already pulling instance metadata. Instance metadata must be pulled serially with the original agent; Other nodes are prevented from polling for scheduled events while attr_globalPullState does not have a value of AVAILABLE.

New node-level attributes azure-events-az_curNodeState, azure-events-az_lastDocVersion, azure-events-az_pendingEventIDs, and special attribute #health-azure are added to the CIB in order to keep track of event processing and define node health status.

Creates new resource agent parameters which allow end-users to modify the frequency at which scheduled event polls happen, the retry count before the RA determines polling failure, and the timeout for when a poll does not receive a response.

Changes the action taken when preparing for an upcoming event. In the original version of azure-events, the affected node is placed into standby mode by setting the standby=true node-level attribute. With the new rewrite, a special node-level attribute #health-azure is used, instead. This special attribute should not be confused with the HEALTHY and UNHEALTHY general node status which is determined by the membership state of a node and changes when a node unexpectedly disappears.

Works with nodes spread across different Availability Zones.

The Node class block is reduced and raAzEvents extended to address change in functionality.

What’s the same?

The clusterHelper class block and section that defines node status are virtually identical to the original (except for an updated timestamp in a comment).

Much of the Node class block is retained with some functionality added.

Why add a new agent instead of updating the original?

In two words: Risk reduction. According to the developer of the new agent, this was a mutual decision between Microsoft and SUSE engineering to avoid the risk of disruption to existing workloads that do not need the features added by this agent. To force all users to use the rewritten agent unnecessarily would be a bad business decision if anything were to go wrong, as it would create more work for customers whose deployments are not affected by the limitations of the original agent.

How does it work under the hood? Gimme the gory details!

Each node has its own azure-events-az_lastDocVersion attribute recorded in the CIB. When an event is added to an instance, the document version published by the API is incremented. As long as the document version in the IMDS Scheduled Events metadata matches the version set in the CIB, the monitor completes with no action. When the scheduled events query returns a newer document version, and the event is one configured to mark the node as unhealthy, the agent adds the event ID(s) to azure-events-az_pendingEventIDs and sets the #health-azure attribute to -1000000, which will be used as a node-level location constraint to force all resources away from the node. Once all resources are stopped, the platform event can be processed safely. The agent then sends an API request to begin the queued scheduled event(s) immediately and records the azure-events-az_lastDocVersion with the one pulled from the IMDS. When pacemaker comes back online, the agent executes a one-time monitor “probe” action. If all events have been processed, the agent can then safely put the node back online. To make the cluster more resilient against a series of events, one may add node-health-strategy=custom to the cluster property line and allow-unhealthy-nodes=true to the azure-events-az clone resource’s meta line.  This will allow only the scheduled events agent resource to start up and continue to run local monitors even on a node marked as unhealthy.  In most cases, the initial probe should do what’s needed, but this configuration adds a layer of resiliency.

In contrast to the new agent, the original azure-events resource agent expects the ability to take a peer node out of standby. By default, only Reboot and Redeploy will trigger a migration. The Freeze event may optionally be added to the primitive’s eventTypes parameter in the CIB.

IMPORTANT: Depending on cluster configuration, the affected resources may or may not be restarted on a different node. Sometimes a “migration” is not needed; such is the case with SAP’s Enqueue Replication Server (ERS) in a SAP Central Services cluster. In this case, the resource(s) will be started up again on the affected node once it’s healthy again.

The azure-events and azure-events-az resource agents also pick up on Reboot and Redeploy user events initiated with the platform controls. When a user presses the “Reboot” button in the Azure Portal WebUI or initiates a reboot with azure-cli, the platform does not immediately reboot the VM. It first places a scheduled event in the instance metadata which the azure-events-az resource agent picks up on its next poll to the IMDS scheduled events API. The agent will then set the #health-azure attribute for that node accordingly.  The location constraint will then give any resources that don’t start with “health-*” a -inf score, and move them off of the affected node.  I tested whether it worked without a location constraint using a ocf:heartbeat:Dummy resource, and found that the resource was not automatically migrated when #health-azure was set to -1000000 unless I set this location constraint.

PRO TIP:

By default, there is a 216 second (3 minute 36 second) delay on pacemaker startup set in /etc/sysconfig/sbd for Pacemaker in Azure. This was an optimization done to resolve intermittent startup issues. If the node doesn’t rejoin immediately after starting up, it may be expected behavior. Give it some time. It should rejoin on its own within 5 minutes. Also one must take into account that when a reboot is triggered by an operator, there is a delay between when the event is scheduled and when it is picked up and executed. If it takes a node longer than 8 minutes to rejoin, then something may be wrong.

What to do if you are switching to the new agent:

    1. Put the cluster into maintenance mode.
    2. Delete both the azure-events primitive and cln_azure-eventsclone resources and the attr_globalPullState property from the CIB.
    3. Remove standby=false in the node attributes (Optional. Can be retained for ease of use if admin/operator plans to do manual maintenance tasks in standby mode)
    4. Run the following commands:
      # crm configure property node-health-strategy=custom
      # crm configure location /'!health-.*'/ rule '#health-azure': defined #uname

      Run the below command for all the nodes in the cluster, replacing <node> with a node name.

      # crm_attribute --node <node> --name '#health-azure' --update 0
      # crm configure primitive health-azure-events ocf:heartbeat:azure-events-az \
      params eventTypes="Reboot,Redeploy" verbose=false retry_count=3 retry_wait=20 request_timeout=15 \
      op start start-delay=60s \
      op monitor interval=10s \
      meta failure-timeout=120s
      
      # crm configure clone health-azure-events-cln health-azure-events \
      meta allow-unhealthy-nodes=true
    5. Take the cluster out of maintenance mode.

Under what condition should “Freeze” be added as an event to stop cluster services?

As a general rule, Freeze event should not be included, as it is unlikely to cause an issue and in many cases may complete within one second.  In a majority of cases, failover due to an upcoming scheduled Freeze event is considered more costly in terms of uptime than the freeze event, itself.  If cost of downtime is not an issue, one may choose to protect their services in case anything goes wrong with the Freeze process.  Again, a Freeze event is extremely unlikely to cause an issue under normal circumstances, but having it as an option to trigger the failover is an option some may choose depending on the nature of their workload.

What does the resource agent look like in the CIB after configuring based on the Azure Doc?

I’m glad you asked.  Please use this as a reference (omissions of irrelevant config indicated with […]), in case you need to check whether a customer has configured their agent in compliance with best practices:

node 1: node1 \
attributes #health-azure=0 azName=node1
node 2: node2 \
attributes #health-azure=0 azName=node2

[...]

primitive health-azure-events azure-events-az \
params eventTypes="Reboot,Redeploy" verbose=false retry_count=3 retry_wait=20 request_timeout=15 \
op start start-delay=60s interval=0s \
op monitor interval=10s \
meta failure-timeout=120s
[...]

clone health-azure-events-cln health-azure-events \
meta allow-unhealthy-nodes=true interleave=true
location loc_azure_health /!health-.*/ \
rule #health-azure: defined #uname

property cib-bootstrap-options: \
[...]
node-health-strategy=custom \
[...]

The node attributes azure-events-az_curNodeState, azure-events-az_pendingEventIDs, and azure-events-az_lastDocVersion will be generated as events are processed.

Additional Resources:

Official definition in Event Properties table here:
https://learn.microsoft.com/en-us/azure/virtual-machines/linux/scheduled-events#event-properties

Official azure-events-az documentation here:
https://learn.microsoft.com/en-us/azure/sap/workloads/high-availability-guide-suse-pacemaker?tabs=msi#configure-pacemaker-for-azure-scheduled-events

Share
(Visited 5 times, 1 visits today)
Avatar photo
742 views