7.11 Configuring Path Failover Policies and Priorities

In a Linux host, when there are multiple paths to a storage controller, each path appears as a separate block device, and results in multiple block devices for single LUN. The Device Mapper Multipath service detects multiple paths with the same LUN ID, and creates a new multipath device with that ID. For example, a host with two HBAs attached to a storage controller with two ports via a single unzoned Fibre Channel switch sees four block devices: /dev/sda, /dev/sdb, /dev/sdc, and /dev/sdd. The Device Mapper Multipath service creates a single block device, /dev/mpath/mpath1 that reroutes I/O through those four underlying block devices.

This section describes how to specify policies for failover and configure priorities for the paths.

7.11.1 Configuring the Path Failover Policies

Use the multipath command with the -p option to set the path failover policy:

multipath devicename -p policy 

Replace policy with one of the following policy options:

Table 7-5 Group Policy Options for the multipath -p Command

Policy Option

Description

failover

(Default) One path per priority group.

multibus

All paths in one priority group.

group_by_serial

One priority group per detected serial number.

group_by_prio

One priority group per path priority value. Priorities are determined by callout programs specified as a global, per-controller, or per-multipath option in the /etc/multipath.conf configuration file.

group_by_node_name

One priority group per target node name. Target node names are fetched in the /sys/class/fc_transport/target*/node_name location.

7.11.2 Configuring Failover Priorities

You must manually enter the failover priorities for the device in the /etc/multipath.conf file. Examples for all settings and options can be found in the /usr/share/doc/packages/multipath-tools/multipath.conf.annotated file.

If you modify the /etc/multipath.conf file, the changes are not automatically applied when you save the file. For information, see Section 7.6.3, Verifying the Multipath Setup in the /etc/multipath.conf File and Section 7.6.4, Applying the /etc/multipath.conf File Changes to Update the Multipath Maps.

Understanding Priority Groups and Attributes

A priority group is a collection of paths that go to the same physical LUN. By default, I/O is distributed in a round-robin fashion across all paths in the group. The multipath command automatically creates priority groups for each LUN in the SAN based on the path_grouping_policy setting for that SAN. The multipath command multiplies the number of paths in a group by the group’s priority to determine which group is the primary. The group with the highest calculated value is the primary. When all paths in the primary group are failed, the priority group with the next highest value becomes active.

A path priority is an integer value assigned to a path. The higher the value, the higher the priority is. An external program is used to assign priorities for each path. For a given device, the paths with the same priorities belong to the same priority group.

Multipath Tools 0.4.9 for SLES 11 SP2 uses the prio setting in the defaults{} or devices{} section of the /etc/multipath.conf file. It silently ignores the keyword prio when it is specified for an individual multipath definition in the multipaths{) section. Multipath Tools 0.4.8 for SLES 11 SP1 and earlier allows the prio setting in the individual multipath definition in the multipaths{) section to override the prio settings in the defaults{} or devices{} section.

The syntax for the prio keyword in the /etc/multipath.conf file is changed in multipath-tools-0.4.9. The prio line specifies the prioritizer. If the prioritizer requires an argument, you specify the argument by using the prio_args keyword on a second line. Previously, the prioritizer and its arguments were included on the prio line.

PRIO Settings for the Defaults or Devices Sections

prio

Specifies the prioritizer program to call to obtain a path priority value. Weights are summed for each path group to determine the next path group to use in case of failure.

Use the prio_args keyword to specify arguments if the specified prioritizer requires arguments.

Values

If no prio keyword is specified, all paths are equal. The default setting is “const” with a prio_args setting with no value.

prio      "const"
prio_args ""

Example prioritizer programs include:

Prioritizer Program

Description

alua

Generates path priorities based on the SCSI-3 ALUA settings.

const

Generates the same priority for all paths.

emc

Generates the path priority for EMC arrays.

hdc

Generates the path priority for Hitachi HDS Modular storage arrays.

hp_sw

Generates the path priority for Compaq/HP controller in active/standby mode.

ontap

Generates the path priority for NetApp arrays.

random

Generates a random priority for each path.

rdac

Generates the path priority for LSI/Engenio RDAC controller.

weightedpath

Generates the path priority based on the weighted values you specify in the arguments for prio_args, such as:

<hbtl|devname> <regex1> <prio1> <regex2> <prio2>...

The hbtl regex argument format uses the SCSI H:B:T:L notation (such as 1:0:.:. and *:0:0:.) with a weight value, where H, B, T, L are the host, bus, target, and LUN IDs for a device. For example:

prio "weightedpath"
prio_args "hbtl 1:.:.:. 2 4:.:.:. 4"

The devname regex argument format uses a device node name with a weight value for each device. For example:

prio "weightedpath"
prio_args "devname sda 50 sde 10 sdc 50 sdf 10"
prio_args

Specifies the arguments for the specified prioritizer program that requires arguments. Most prio programs do not need arguments.

Values

There is no default. The value depends on the prio setting and whether the prioritizer requires arguments.

prio      "const"
prio_args ""

Multipath Attributes

Multipath attributes are used to control the behavior of multipath I/O for devices. You can specify attributes as defaults for all multipath devices. You can also specify attributes that apply only to a given multipath device by creating an entry for that device in the multipaths section of the multipath configuration file.

user_friendly_names

Specifies whether to use world-wide IDs (WWIDs) or to use the /var/lib/multipath/bindings file to assign a persistent and unique alias to the multipath devices in the form of /dev/mapper/mpathN.

This option can be used in the devices section and the multipaths section.

Values

Value

Description

no

(Default) Use the WWIDs shown in the /dev/disk/by-id/ location.

yes

Autogenerate user-friendly names as aliases for the multipath devices instead of the actual ID.

failback

Specifies whether to monitor the failed path recovery, and indicates the timing for group failback after failed paths return to service.

When the failed path recovers, the path is added back into the multipath enabled path list based on this setting. Multipath evaluates the priority groups, and changes the active priority group when the priority of the primary path exceeds the secondary group.

Values

Value

Description

manual

(Default) The failed path is not monitored for recovery. The administrator runs the multipath command to update enabled paths and priority groups.

immediate

When a path recovers, enable the path immediately.

n

When the path recovers, wait n seconds before enabling the path. Specify an integer value greater than 0.

We recommend failback setting of “manual” for multipath in cluster environments in order to prevent multipath failover ping-pong.

failback "manual"

IMPORTANT:Ensure that you verify the failback setting with your storage system vendor. Different storage systems can require different settings.

getuid_callout

The default program and arguments to call to obtain a unique path identifier. Specify the location with an absolute Linux path.

This attribute is deprecated in SLES 11 SP3 and later. It is replaced by the uid_attribute.

Values

The default location and arguments are:

/lib/udev/scsi_id -g -u -s

Example:

getuid_callout "/lib/udev/scsi_id -g -u -d /dev/%n"

getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
no_path_retry

Specifies the behaviors to use on path failure.

Values

Value

Description

n

Specifies the number of retries until multipath stops the queuing and fails the path. Specify an integer value greater than 0.

In a cluster, you can specify a value of “0” to prevent queuing and allow resources to fail over.

fail

Specifies immediate failure (no queuing).

queue

Never stop queuing (queue forever until the path comes alive).

We recommend a retry setting of “fail” or “0” in the /etc/multipath.conf file when working in a cluster. This causes the resources to fail over when the connection is lost to storage. Otherwise, the messages queue and the resource failover cannot occur.

no_path_retry "fail"
no_path_retry "0"

IMPORTANT:Ensure that you verify the retry settings with your storage system vendor. Different storage systems can require different settings.

path_checker

Determines the state of the path.

Values

Value

Description

directio

(Default in multipath-tools version 0.4.8 and later) Reads the first sector that has direct I/O. This is useful for DASD devices. Logs failure messages in /var/log/messages.

readsector0

(Default in multipath-tools version 0.4.7 and earlier; deprecated and replaced by directio.) Reads the first sector of the device. Logs failure messages in /var/log/messages.

tur

Issues a SCSI test unit ready command to the device. This is the preferred setting if the LUN supports it. On failure, the command does not fill up /var/log/messages with messages.

custom_vendor_value

Some SAN vendors provide custom path_checker options:

  • cciss_tur: Checks the path state for HP Smart Storage Arrays.

  • emc_clariion: Queries the EMC Clariion EVPD page 0xC0 to determine the path state.

  • hp_sw: Checks the path state (Up, Down, or Ghost) for HP storage arrays with Active/Standby firmware.

  • rdac: Checks the path state for the LSI/Engenio RDAC storage controller.

path_grouping_policy

Specifies the path grouping policy for a multipath device hosted by a given controller.

Values

Value

Description

failover

(Default) One path is assigned per priority group so that only one path at a time is used.

multibus

All valid paths are in one priority group. Traffic is load-balanced across all active paths in the group.

group_by_prio

One priority group exists for each path priority value. Paths with the same priority are in the same priority group. Priorities are assigned by an external program.

group_by_serial

Paths are grouped by the SCSI target serial number (controller node WWN).

group_by_node_name

One priority group is assigned per target node name. Target node names are fetched in /sys/class/fc_transport/target*/node_name.

path_selector

Specifies the path-selector algorithm to use for load balancing.

Values

Value

Description

round-robin 0

(Default in SLES 11 SP2 and earlier) The load-balancing algorithm used to balance traffic across all active paths in a priority group.

queue-length 0

A dynamic load balancer that balances the number of in-flight I/O on paths similar to the least-pending option.

service-time 0

(Default in SLES 11 SP3 and later) A service-time oriented load balancer that balances I/O on paths according to the latency.

pg_timeout

Specifies path group timeout handling.

Values

NONE (internal default)

polling_interval

Specifies the time in seconds between the end of one path checking cycle and the beginning of the next path checking cycle.

Values

Specify an integer value greater than 0. The default value is 5. Ensure that you verify the polling_interval setting with your storage system vendor. Different storage systems can require different settings.

prio_callout

Specifies the program and arguments to use to determine the layout of the multipath map.

Multipath prio_callout programs are located in shared libraries in /lib/libmultipath/lib*. By using shared libraries, the callout programs are loaded into memory on daemon startup.

When queried by the multipath command, the specified mpath_prio_* callout program returns the priority for a given path in relation to the entire multipath layout.

When it is used with the path_grouping_policy of group_by_prio, all paths with the same priority are grouped into one multipath group. The group with the highest aggregate priority becomes the active group.

When all paths in a group fail, the group with the next highest aggregate priority becomes active. Additionally, a failover command (as determined by the hardware handler) might be send to the target.

The mpath_prio_* program can also be a custom script created by a vendor or administrator for a specified setup.

  • A %n in the command line expands to the device name in the /dev directory.

  • A %b in the command line expands to the device number in major:minor format in the /dev directory.

  • A %d in the command line expands to the device ID in the /dev/disk/by-id directory.

If devices are hot-pluggable, use the %d flag instead of %n. This addresses the short time that elapses between the time when devices are available and when udev creates the device nodes.

Values

Value

Description

(No value)

If no prio_callout attribute is used, all paths are equal. This is the default.

/bin/true

Specify this value when the group_by_prio is not being used.

The prioritizer programs generate path priorities when queried by the multipath command. The program names must begin with mpath_prio_ and are named by the device type or balancing method used. Current prioritizer programs include the following:

Prioritizer Program

Description

mpath_prio_alua %n

Generates path priorities based on the SCSI-3 ALUA settings.

mpath_prio_balance_units

Generates the same priority for all paths.

mpath_prio_emc %n

Generates the path priority for EMC arrays.

mpath_prio_hds_modular %b

Generates the path priority for Hitachi HDS Modular storage arrays.

mpath_prio_hp_sw %n

Generates the path priority for Compaq/HP controller in active/standby mode.

mpath_prio_netapp %n

Generates the path priority for NetApp arrays.

mpath_prio_random %n

Generates a random priority for each path.

mpath_prio_rdac %n

Generates the path priority for LSI/Engenio RDAC controller.

mpath_prio_tpc %n

You can optionally use a script created by a vendor or administrator that gets the priorities from a file where you specify priorities to use for each path.

mpath_prio_spec.sh %n

Provides the path of a user-created script that generates the priorities for multipathing based on information contained in a second data file. (This path and filename are provided as an example. Specify the location of your script instead.) The script can be created by a vendor or administrator. The script’s target file identifies each path for all multipathed devices and specifies a priority for each path. For an example, see Section 7.11.3, Using a Script to Set Path Priorities.

rr_min_io

Specifies the number of I/O transactions to route to a path before switching to the next path in the same path group, as determined by the specified algorithm in the path_selector setting.

The rr_min_io attribute is used only for kernels 2.6.31 and earlier. It is obsoleted in SLES 11 SP2 and replaced by the rr_min_io_rq attribute.

Values

Specify an integer value greater than 0. The default value is 1000.

rr_min_io "1000"
rr_min_io_rq

Specifies the number of I/O requests to route to a path before switching to the next path in the current path group, using request-based device-mapper-multipath.

This attribute is available for systems running SLES 11 SP2 and later. It replaces the rr_min_io attribute.

Values

Specify an integer value greater than 0. The default value is 1.

rr_min_io_rq "1"
rr_weight

Specifies the weighting method to use for paths.

Values

Value

Description

uniform

(Default) All paths have the same round-robin weights.

priorities

Each path’s weight is determined by the path’s priority times the rr_min_io_rq setting (or the rr_min_io setting for kernels 2.6.31 and earlier).

uid_attribute

A udev attribute that provides a unique path identifier. The default value is ID_SERIAL.

Configuring for Round-Robin Load Balancing

All paths are active. I/O is configured for some number of seconds or some number of I/O transactions before moving to the next open path in the sequence.

Configuring for Single Path Failover

A single path with the highest priority (lowest value setting) is active for traffic. Other paths are available for failover, but are not used unless failover occurs.

Grouping I/O Paths for Round-Robin Load Balancing

Multiple paths with the same priority fall into the active group. When all paths in that group fail, the device fails over to the next highest priority group. All paths in the group share the traffic load in a round-robin load balancing fashion.

7.11.3 Using a Script to Set Path Priorities

You can create a script that interacts with Device Mapper Multipath (DM-MPIO) to provide priorities for paths to the LUN when set as a resource for the prio_callout setting.

First, set up a text file that lists information about each device and the priority values you want to assign to each path. For example, name the file /usr/local/etc/primary-paths. Enter one line for each path in the following format:

host_wwpn target_wwpn scsi_id priority_value

Return a priority value for each path on the device. Ensure that the variable FILE_PRIMARY_PATHS resolves to a real file with appropriate data (host wwpn, target wwpn, scsi_id and priority value) for each device.

The contents of the primary-paths file for a single LUN with eight paths each might look like this:

0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:0 sdb 3600a0b8000122c6d00000000453174fc 50
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:1 sdc 3600a0b80000fd6320000000045317563 2
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:2 sdd 3600a0b8000122c6d0000000345317524 50
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:3 sde 3600a0b80000fd6320000000245317593 2
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:0 sdi 3600a0b8000122c6d00000000453174fc 5
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:1 sdj 3600a0b80000fd6320000000045317563 51
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:2 sdk 3600a0b8000122c6d0000000345317524 5
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:3 sdl 3600a0b80000fd6320000000245317593 51

To continue the example mentioned in prio_callout, create a script named /usr/local/sbin/path_prio.sh. You can use any path and filename. The script does the following:

  • On query from multipath, grep the device and its path from the /usr/local/etc/primary-paths file.

  • Return to multipath the priority value in the last column for that entry in the file.

7.11.4 Configuring ALUA (mpath_prio_alua)

The mpath_prio_alua(8) command is used as a priority callout for the Linux multipath(8) command. It returns a number that is used by DM-MPIO to group SCSI devices with the same priority together. This path priority tool is based on ALUA (Asynchronous Logical Unit Access).

Syntax

mpath_prio_alua [-d directory] [-h] [-v] [-V] device [device...] 

Prerequisite

SCSI devices.

Options

-d directory

Specifies the Linux directory path where the listed device node names can be found. The default directory is /dev. When you use this option, specify the device node name only (such as sda) for the device or devices you want to manage.

-h

Displays help for this command, then exits.

-v

Turns on verbose output to display status in human-readable format. Output includes information about which port group the specified device is in and its current state.

-V

Displays the version number of this tool, then exits.

device [device...]

Specifies the SCSI device (or multiple devices) that you want to manage. The device must be a SCSI device that supports the Report Target Port Groups (sg_rtpg(8)) command. Use one of the following formats for the device node name:

  • The full Linux directory path, such as /dev/sda. Do not use with the -d option.

  • The device node name only, such as sda. Specify the directory path by using the -d option.

  • The major and minor number of the device separated by a colon (:) with no spaces, such as 8:0. This creates a temporary device node in the /dev directory with a name in the format of tmpdev-<major>:<minor>-<pid>. For example, /dev/tmpdev-8:0-<pid>.

Return Values

On success, returns a value of 0 and the priority value for the group. Table 7-6 shows the priority values returned by the mpath_prio_alua command.

Table 7-6 ALUA Priorities for Device Mapper Multipath

Priority Value

Description

50

The device is in the active, optimized group.

10

The device is in an active but non-optimized group.

1

The device is in the standby group.

0

All other groups.

Values are widely spaced because of the way the multipath command handles them. It multiplies the number of paths in a group with the priority value for the group, then selects the group with the highest result. For example, if a non-optimized path group has six paths (6 x 10 = 60) and the optimized path group has a single path (1 x 50 = 50), the non-optimized group has the highest score, so multipath chooses the non-optimized group. Traffic to the device uses all six paths in the group in a round-robin fashion.

On failure, returns a value of 1 to 5 indicating the cause for the command’s failure. For information, see the man page for mpath_prio_alua.

7.11.5 Reporting Target Path Groups

Use the SCSI Report Target Port Groups (sg_rtpg(8)) command. For information, see the man page for sg_rtpg(8).