🚀 Knowledge Base Beta
Preview our redesigned Knowledge Base. Click here to try it and give us your feedback! Your input helps us improve

Estimate Multipath Timeouts

This document (7016305) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 11 Service Pack 3 (SLES 11 SP3)

Device Mapper Multipathing IO (MPIO) on Fibre Channel SAN

Situation

It might be desirable to estimate how long it takes before the Linux kernel shows a path failure or a failed LUN to the upper layers. The affect of parameters in /etc/multipath.conf should be predicted.
This is useful when timeouts should be defined for applications, monitoring agents, watchdogs, or high availability clusters.

Resolution

Overview

Simplified, three factors define the time taken by the Linux kernel before an inaccessible LUN leads to IO errors:

number of recently seen active paths (seen_paths)
polling interval (polling_interval)
path retries (no_path_retry)

The two factors polling_interval and no_path_retry are configuration parameters of MPIO. All three factors are explained below.

The formula can be written as:

time = ( seen_paths * polling_intervall * no_path_retry ) + x

The additional x at the end varies between 0 and something in the range of fast_io_fail_tmo. Usually it should be close to 0. The model given above is true only for unbuffered IO. If IO spends a
significant time inside the various Linux buffers and caches, the observed time might be longer than estimated. So, reads might be more convenient when doing tests.

The factors in detail

- Number of recently seen active paths

The number of recently seen active paths depends on the number of available paths, as configured in the SAN, and the MPIO parameter dev_loss_tmo. If all paths fail within a period shorter than dev_loss_tmo seconds, the number of
recently seen active paths is equal to the number of paths as configured in the SAN. If paths fail one after the other, with an interval of more than dev_loss_tmo seconds between the failures, than the number of recently seen active paths is less.

Example 1:
available paths, as configured in the SAN: 4
dev_loss_tmo: 300

All 4 paths fail within 200 seconds, than the number of recently seen active paths is 4 for this event. Since all paths are gone, the LUN is inaccessible.

Example 2:
available paths, as configured in the SAN: 4
dev_loss_tmo: 300
2 paths fail within 200 seconds.

The LUN is still accessible. But the number of recently seen active paths is 2 for the next failure. Thus, it takes only half the time until an IO error is shown.

- Polling intervall

The parameter polling_interval is the interval between two path checks in seconds. For properly functioning paths, the interval between checks will gradually increase to max_polling_interval.

- Path retries

The parameter no_path_retry specifies the number of retries until queuing for that path is disabled.

Testing example

This example shows how to list, disable and re-enable MPIO paths with basic Linux commands. For real tests, it is necessary to unplug cables or at least block ports on the FC switch.

First show the normal situation and find the paths to disable. In this case we have two path-groups. The output differs on other systems.

# multipath -l
3600a0b800016abf3000017795141a1f9 dm-0 IBM,1722-600
size=339G features='1 queue_if_no_path' hwhandler='1 rdac' wp=rw
|-+- policy='queue-length 0' prio=0 status=active
| |- 4:0:1:1 sdg 8:96 active undef running
| `- 3:0:1:1 sdi 8:128 active undef running
`-+- policy='queue-length 0' prio=0 status=enabled
|- 3:0:0:1 sdc 8:32 active undef running
`- 4:0:0:1 sde 8:64 active undef running

Disable the paths. Here we disable only one path. We just tell the kernel to set a path offline. This is not the same as unplugging a cable. Here the failure is reported upwards immediately.

# echo offline >/sys/block/sdg/device/state

Show how multipathing looks now. The path is still probed.

# multipath -l
3600a0b800016abf3000017795141a1f9 dm-0 IBM,1722-600
size=339G features='1 queue_if_no_path' hwhandler='1 rdac' wp=rw
|-+- policy='queue-length 0' prio=0 status=active
| |- 4:0:1:1 sdg 8:96 active faulty offline
| `- 3:0:1:1 sdi 8:128 active undef running
`-+- policy='queue-length 0' prio=0 status=enabled
|- 3:0:0:1 sdc 8:32 active undef running
`- 4:0:0:1 sde 8:64 active undef running

After a while the path is set to failed.

# multipath -l
3600a0b800016abf3000017795141a1f9 dm-0 IBM,1722-600
size=339G features='1 queue_if_no_path' hwhandler='1 rdac' wp=rw
|-+- policy='queue-length 0' prio=0 status=active
| |- 4:0:1:1 sdg 8:96 failed faulty offline
| `- 3:0:1:1 sdi 8:128 active undef running
`-+- policy='queue-length 0' prio=0 status=enabled
|- 3:0:0:1 sdc 8:32 active undef running
`- 4:0:0:1 sde 8:64 active undef running

In the system log, you will see the failure as well.

Feb 17 12:09:02 sapnode1 multipathd: checker failed path 8:96 in map 3600...
Feb 17 12:09:02 sapnode1 multipathd: 3600...: remaining active paths: 3
Feb 17 12:09:02 sapnode1 kernel: ... multipath: Failing path 8:96.

Re-enable the path.

# echo running >/sys/block/sdg/device/state

Show how multipathing looks now. The path is still probed.

# multipath -l
3600a0b800016abf3000017795141a1f9 dm-0 IBM,1722-600
size=339G features='1 queue_if_no_path' hwhandler='1 rdac' wp=rw
|-+- policy='queue-length 0' prio=0 status=active
| |- 4:0:1:1 sdg 8:96 failed undef running
| `- 3:0:1:1 sdi 8:128 active undef running
`-+- policy='queue-length 0' prio=0 status=enabled
|- 3:0:0:1 sdc 8:32 active undef running
`- 4:0:0:1 sde 8:64 active undef running

After a while the path is back.

Script samples for testing

This samples are provided under the GNU General Public License v2.
There is absolutely no warranty for anything.

- Generate read IOs

#!/bin/bash
# chkio
DEV=/dev/dm-0
while true; do
date
dd if=$DEV of=/dev/null count=1 2>/dev/null
echo $?
sleep 1
done
#

- Filter messages from system log

#!/bin/bash
# logio
LOG=/var/log/messages
tail -f $LOG | grep -e "multipathd.*error" -e "multipathd.*path.*down" \
-e "multipathd.*failed.path" -e "multipathd.*remaining.*paths" \
-e "multipath.*Failing.path" -e "blocked FC.*port" \
-e "I/O.error" -e "transport.error"
#

- Stop the IO, in case it is not possible to unplug cables

Never do this on a production system.

This test is not the same as unplugging cables. The paths will be set offline and the failure is reported upwards immediately. Make sure you know how to re-enable the paths afterwards. The SCSI devices to be blocked can be seen with "mulitpath -l".

#!/bin/bash
# stpio
DEV=$*
for d in $DEV; do
test -w /sys/block/$d/device/state || continue
# echo offline >/sys/block/$d/device/state
echo -n "$d: "
cat /sys/block/$d/device/state
done

Additional Information

Online documentation

https://www.suse.com/documentation/sles11/stor_admin/data/bookinfo.html

Technical Information Documents

Using LVM on Multipath (DM MPIO) Devices
https://www.suse.com/support/kb/doc.php?id=7007498

Recommendations for the usage of user_friendly_names in multipath configurations
https://www.suse.com/support/kb/doc.php?id=7001133

Manual pages

multipath.conf(5)

multipath(5)

rescan-scsi-bus.sh(1)

sg3_utils(8)

sbd(8)

iostat(1)

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

Document ID:7016305
Creation Date: 13-Mar-2015
Modified Date:03-Mar-2020
- SUSE Linux Enterprise Server

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.

Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Report a Software Vulnerability

Go to Customer Center

SUSE Support

Here When You Need Us

🚀 Knowledge Base Beta Preview our redesigned Knowledge Base. Click here to try it and give us your feedback! Your input helps us improve