PostgreSQL resource fails to start with "My data may be inconsistent"

This document (7023166) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise High Availability Extension 12

Situation

PostgreSQL database configured in Pacemaker cluster in similar setup:

/usr/sbin/crm configure show
node 1084772924: zrs-db-t-01 \
        attributes pgsql-data-status=LATEST standby=off
node 1084772925: zrs-db-t-02 \
        attributes pgsql-data-status="STREAMING|ASYNC" standby=off
primitive Service_ip IPaddr2 \
        params ip=192.168.82.62 cidr_netmask=24 \
        op monitor interval=10s timeout=20s \
        meta target-role=Started multiple-active=stop_start
primitive pgsql pgsql \
        params pgdata="/data/pgdata" config="/data/pgdata/postgresql.conf" rep_mode=async node_list="zrs-db-t-01 zrs-db-t-02" master_ip=192.168.82.62 repuser=repuser restore_command="test ! -f archivedir/%f && cp %p  archivedir/%f" restart_on_promote=true \
        op monitor timeout=30 interval=30 depth=0
ms ms_pgsql pgsql \
        meta notify=true master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 target-role=Started
colocation col_vip_on_prim inf: Service_ip ms_pgsql:Master
order demote-then-stop-ip Mandatory: _rsc_set_ ms_pgsql:demote Service_ip:stop symmetrical=false
order promote-then-ip Mandatory: _rsc_set_ ms_pgsql:promote Service_ip:start symmetrical=false


When setting the slave node to standby and then setting it back online the PostgreSQL resource agent fails to start with the following error:

My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start

Resolution

Before putting the node back online, remove the lock file:

rm /var/lib/pgsql/tmp/PGSQL.lock

and set the node online:

crm node online <node name>

Cause

The resource agent has limited information about the state of the actual database due to how the intra-node synchronization in PostgreSQL works. There is the possibility of data corruption even in this case.

Unfortunately the result is that the agent cannot detect the difference between a legitimate standby (as in this case) and data corruption, so manual intervention is needed even though the user in this case knows that enabling the slave again is safe. That is, the user has to delete the lock file before setting the slave back into online mode.

This is a known limitation of the resource agent at the moment.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:7023166
  • Creation Date: 10-Jul-2018
  • Modified Date:03-Mar-2020
    • SUSE Linux Enterprise High Availability Extension

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center