Estimate Multipath Timeouts

This document (7016305) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 11 Service Pack 3 (SLES 11 SP3)
Device Mapper Multipathing IO (MPIO) on Fibre Channel SAN


Situation

 It might be desirable to estimate how long it takes before the Linux kernel shows a path failure or a failed LUN to the upper layers. The affect of parameters in /etc/multipath.conf should be predicted.
 This is useful when timeouts should be defined for applications, monitoring agents, watchdogs, or high availability clusters.

Resolution

Overview

Simplified, three factors define the time taken by the Linux kernel before an inaccessible LUN leads to IO errors: number of recently seen active paths (seen_paths) polling interval (polling_interval) path retries (no_path_retry) The two factors polling_interval and no_path_retry are configuration parameters of MPIO. All three factors are explained below. The formula can be written as: time = ( seen_paths * polling_intervall * no_path_retry ) + x The additional x at the end varies between 0 and something in the range of fast_io_fail_tmo. Usually it should be close to 0. The model given above is true only for unbuffered IO. If IO spends a significant time inside the various Linux buffers and caches, the observed time might be longer than estimated. So, reads might be more convenient when doing tests. The factors in detail - Number of recently seen active paths The number of recently seen active paths depends on the number of available paths, as configured in the SAN, and the MPIO parameter dev_loss_tmo. If all paths fail within a period shorter than dev_loss_tmo seconds, the number of recently seen active paths is equal to the number of paths as configured in the SAN. If paths fail one after the other, with an interval of more than dev_loss_tmo seconds between the failures, than the number of recently seen active paths is less. Example 1: available paths, as configured in the SAN: 4 dev_loss_tmo: 300
All 4 paths fail within 200 seconds, than the number of recently seen active paths is 4 for this event. Since all paths are gone, the LUN is inaccessible. Example 2: available paths, as configured in the SAN: 4 dev_loss_tmo: 300 2 paths fail within 200 seconds. The LUN is still accessible. But the number of recently seen active paths is 2 for the next failure. Thus, it takes only half the time until an IO error is shown. - Polling intervall The parameter polling_interval is the interval between two path checks in seconds. For properly functioning paths, the interval between checks will gradually increase to max_polling_interval. - Path retries The parameter no_path_retry specifies the number of retries until queuing for that path is disabled. Testing example This example shows how to list, disable and re-enable MPIO paths with basic Linux commands. For real tests, it is necessary to unplug cables or at least block ports on the FC switch. First show the normal situation and find the paths to disable. In this case we have two path-groups. The output differs on other systems. # multipath -l 3600a0b800016abf3000017795141a1f9 dm-0 IBM,1722-600 size=339G features='1 queue_if_no_path' hwhandler='1 rdac' wp=rw |-+- policy='queue-length 0' prio=0 status=active | |- 4:0:1:1 sdg 8:96 active undef running | `- 3:0:1:1 sdi 8:128 active undef running `-+- policy='queue-length 0' prio=0 status=enabled |- 3:0:0:1 sdc 8:32 active undef running `- 4:0:0:1 sde 8:64 active undef running Disable the paths. Here we disable only one path. We just tell the kernel to set a path offline. This is not the same as unplugging a cable. Here the failure is reported upwards immediately. # echo offline >/sys/block/sdg/device/state Show how multipathing looks now. The path is still probed. # multipath -l 3600a0b800016abf3000017795141a1f9 dm-0 IBM,1722-600 size=339G features='1 queue_if_no_path' hwhandler='1 rdac' wp=rw |-+- policy='queue-length 0' prio=0 status=active | |- 4:0:1:1 sdg 8:96 active faulty offline | `- 3:0:1:1 sdi 8:128 active undef running `-+- policy='queue-length 0' prio=0 status=enabled |- 3:0:0:1 sdc 8:32 active undef running `- 4:0:0:1 sde 8:64 active undef running After a while the path is set to failed. # multipath -l 3600a0b800016abf3000017795141a1f9 dm-0 IBM,1722-600 size=339G features='1 queue_if_no_path' hwhandler='1 rdac' wp=rw |-+- policy='queue-length 0' prio=0 status=active | |- 4:0:1:1 sdg 8:96 failed faulty offline | `- 3:0:1:1 sdi 8:128 active undef running `-+- policy='queue-length 0' prio=0 status=enabled |- 3:0:0:1 sdc 8:32 active undef running `- 4:0:0:1 sde 8:64 active undef running In the system log, you will see the failure as well. Feb 17 12:09:02 sapnode1 multipathd: checker failed path 8:96 in map 3600... Feb 17 12:09:02 sapnode1 multipathd: 3600...: remaining active paths: 3 Feb 17 12:09:02 sapnode1 kernel: ... multipath: Failing path 8:96. Re-enable the path. # echo running >/sys/block/sdg/device/state Show how multipathing looks now. The path is still probed. # multipath -l 3600a0b800016abf3000017795141a1f9 dm-0 IBM,1722-600 size=339G features='1 queue_if_no_path' hwhandler='1 rdac' wp=rw |-+- policy='queue-length 0' prio=0 status=active | |- 4:0:1:1 sdg 8:96 failed undef running | `- 3:0:1:1 sdi 8:128 active undef running `-+- policy='queue-length 0' prio=0 status=enabled |- 3:0:0:1 sdc 8:32 active undef running `- 4:0:0:1 sde 8:64 active undef running After a while the path is back. # multipath -l 3600a0b800016abf3000017795141a1f9 dm-0 IBM,1722-600 size=339G features='1 queue_if_no_path' hwhandler='1 rdac' wp=rw |-+- policy='queue-length 0' prio=0 status=active | |- 4:0:1:1 sdg 8:96 active undef running | `- 3:0:1:1 sdi 8:128 active undef running `-+- policy='queue-length 0' prio=0 status=enabled |- 3:0:0:1 sdc 8:32 active undef running `- 4:0:0:1 sde 8:64 active undef running Script samples for testing This samples are provided under the GNU General Public License v2. There is absolutely no warranty for anything. - Generate read IOs #!/bin/bash # chkio DEV=/dev/dm-0 while true; do date dd if=$DEV of=/dev/null count=1 2>/dev/null echo $? sleep 1 done # - Filter messages from system log #!/bin/bash # logio LOG=/var/log/messages tail -f $LOG | grep -e "multipathd.*error" -e "multipathd.*path.*down" \ -e "multipathd.*failed.path" -e "multipathd.*remaining.*paths" \ -e "multipath.*Failing.path" -e "blocked FC.*port" \ -e "I/O.error" -e "transport.error" # - Stop the IO, in case it is not possible to unplug cables Never do this on a production system. This test is not the same as unplugging cables. The paths will be set offline and the failure is reported upwards immediately. Make sure you know how to re-enable the paths afterwards. The SCSI devices to be blocked can be seen with "mulitpath -l". #!/bin/bash # stpio DEV=$* for d in $DEV; do test -w /sys/block/$d/device/state || continue # echo offline >/sys/block/$d/device/state echo -n "$d: " cat /sys/block/$d/device/state done

Additional Information

Online documentation

 https://www.suse.com/documentation/sles11/stor_admin/data/bookinfo.html

Technical Information Documents

 Using LVM on Multipath (DM MPIO) Devices
 https://www.suse.com/support/kb/doc.php?id=7007498

 Recommendations for the usage of user_friendly_names in multipath configurations
 https://www.suse.com/support/kb/doc.php?id=7001133

Manual pages

multipath.conf(5)
multipath(5)
rescan-scsi-bus.sh(1)
sg3_utils(8)
sbd(8)
iostat(1)

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:7016305
  • Creation Date: 13-Mar-2015
  • Modified Date:03-Mar-2020
    • SUSE Linux Enterprise Server

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback@suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center