My Favorites

Close

Please to see your favorites.


Failed cluster actions in crm_mon

This document (7012145) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 11 (HAE)
SUSE Linux Enterprise Server 11

Situation

Cluster resources are running fine, but an error shows up in crm_mon.
A cluster resource has failed to start on all nodes.

# /usr/sbin/crm_mon -r -1
--snip--
Online: [ node4 node5 node6 node19 ]

Full list of resources:

STONITH_SBD    (stonith:external/sbd): Started node19
Clone Set: ctdb-clone [ctdb]
  ctdb:0     (ocf::heartbeat:CTDB):  Started node4 FAILED
  ctdb:1     (ocf::heartbeat:CTDB):  Started node5 FAILED
  ctdb:2     (ocf::heartbeat:CTDB):  Started node6 FAILED
Resource Group: firewall_group
  External_IP (ocf:heartbeat:IPaddr):  Started on node6
  FW_Rules    (lsb:iptables):  Started on node6
  netmon      (ocf:heartbeat:ethmonitor):  Started on node6
Resource Group: apache2
  webip       (ocf:heartbeat:IPaddr):   Started node2
  websrv      (ocf:heartbeat:apache):   Stopped

Failed actions:
  ctdb:0_monitor_10000 (node=node4, call=155, rc=1, status=complete): unknown error
  ctdb:1_monitor_10000 (node=node5, call=135, rc=1, status=complete): unknown error
  ctdb:2_monitor_10000 (node=node6, call=499, rc=1, status=complete): unknown error
  netmon_monitor_7000 (node=node4, call=45, rc=-2, status=Time Out): unknown exec error
  websrv (node=node4, call=15, rc=5, status=complete): not installed
  websrv (node=node5, call=12, rc=5, status=complete): not installed
  websrv (node=node6, call=16, rc=5, status=complete): not installed

Resolution

Try cleaning up the failed resource with crm shell. For example,

# crm resource cleanup ctdb-clone
# crm resource cleanup netmon node4

# crm_resource cleanup apache2
From the Failed actions: list above, notice that netmon only failed on node4. The crm shell allows you to reduce cluster communication by cleaning up the resource only on that node. The other resources failed on all nodes and need to be cleaned up on all nodes, so the node option was left off of the crm shell command.

Additional Information

Some errors are valid and current. Others are left over from previous failures and just need to be reset. In this case the netmon resource is successfully running on node6 and just needs to be clean up to get rid of the error. The ctdb-clone resource is currently experiencing an error as shown by the FAILED messages. You can try cleaning this resource up too. If it fails again, you will need to do more in depth troubleshooting.

Disclaimer

This Support Knowledgebase provides a valuable tool for NetIQ/Novell/SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:7012145
  • Creation Date:15-APR-13
  • Modified Date:15-APR-13
    • SUSESUSE Linux Enterprise High Availability Extension
      SUSE Linux Enterprise Server

Did this document solve your problem? Provide Feedback