Failed cluster actions in crm_mon

This document (7012145) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 11 (HAE)
SUSE Linux Enterprise Server 11

Situation

Cluster resources are running fine, but an error shows up in crm_mon.
A cluster resource has failed to start on all nodes.

# /usr/sbin/crm_mon -r -1
--snip--
Online: [ node4 node5 node6 node19 ]

Full list of resources:

STONITH_SBD    (stonith:external/sbd): Started node19
Clone Set: ctdb-clone [ctdb]
  ctdb:0     (ocf::heartbeat:CTDB):  Started node4 FAILED
  ctdb:1     (ocf::heartbeat:CTDB):  Started node5 FAILED
  ctdb:2     (ocf::heartbeat:CTDB):  Started node6 FAILED
Resource Group: firewall_group
  External_IP (ocf:heartbeat:IPaddr):  Started on node6
  FW_Rules    (lsb:iptables):  Started on node6
  netmon      (ocf:heartbeat:ethmonitor):  Started on node6
Resource Group: apache2
  webip       (ocf:heartbeat:IPaddr):   Started node2
  websrv      (ocf:heartbeat:apache):   Stopped

Failed actions:
  ctdb:0_monitor_10000 (node=node4, call=155, rc=1, status=complete): unknown error
  ctdb:1_monitor_10000 (node=node5, call=135, rc=1, status=complete): unknown error
  ctdb:2_monitor_10000 (node=node6, call=499, rc=1, status=complete): unknown error
  netmon_monitor_7000 (node=node4, call=45, rc=-2, status=Time Out): unknown exec error
  websrv (node=node4, call=15, rc=5, status=complete): not installed
  websrv (node=node5, call=12, rc=5, status=complete): not installed
  websrv (node=node6, call=16, rc=5, status=complete): not installed

Resolution

Try cleaning up the failed resource with crm shell. For example,

# crm resource cleanup ctdb-clone
# crm resource cleanup netmon node4

# crm_resource cleanup apache2
From the Failed actions: list above, notice that netmon only failed on node4. The crm shell allows you to reduce cluster communication by cleaning up the resource only on that node. The other resources failed on all nodes and need to be cleaned up on all nodes, so the node option was left off of the crm shell command.

Cause


Additional Information

Some errors are valid and current. Others are left over from previous failures and just need to be reset. In this case the netmon resource is successfully running on node6 and just needs to be clean up to get rid of the error. The ctdb-clone resource is currently experiencing an error as shown by the FAILED messages. You can try cleaning this resource up too. If it fails again, you will need to do more in depth troubleshooting.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:7012145
  • Creation Date: 15-Apr-2013
  • Modified Date:03-Mar-2020
    • SUSE Linux Enterprise High Availability Extension
    • SUSE Linux Enterprise Server

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center