14.12 Adding SR-IOV Devices

Single Root I/O Virtualization (SR-IOV) capable PCIe devices can replicate their resources, so they appear to be multiple devices. Each of these "pseudo-devices" can be assigned to a VM Guest.

SR-IOV is an industry specification that was created by the Peripheral Component Interconnect Special Interest Group (PCI-SIG) consortium. It introduces physical functions (PF) and virtual functions (VF). PFs are full PCIe functions used to manage and configure the device. PFs also can move data. VFs lack the configuration and management part—they only can move data and a reduced set of configuration functions. Since VFs do not have all PCIe functions, the host operating system or the Hypervisor must support SR-IOV to be able to access and initialize VFs. The theoretical maximum for VFs is 256 per device (consequently the maximum for a dual-port Ethernet card would be 512). In practice this maximum is much lower, since each VF consumes resources.

14.12.1 Requirements

The following requirements must be met to be able to use SR-IOV:

  • An SR-IOV-capable network card (as of SUSE Linux Enterprise Server 12 SP3, only network cards support SR-IOV)

  • An AMD64/Intel 64 host supporting hardware virtualization (AMD-V or Intel VT-x), see Section 7.3, KVM Hardware Requirements for more information

  • A chipset that supports device assignment (AMD-Vi or Intel VT-d)

  • libvirt-0.9.10 or better

  • SR-IOV drivers must be loaded and configured on the host system

  • A host configuration that meets the requirements listed at Requirements for VFIO and SR-IOV

  • A list of the PCI addresses of the VF(s) that will be assigned to VM Guests

HINT: Checking if a Device is SR-IOV-Capable

The information whether a device is SR-IOV-capable can be obtained from its PCI descriptor by running lspci. A device that supports SR-IOV reports a capability similar to the following:

Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)

NOTE: Adding an SR-IOV Device at VM Guest Creation

Before adding an SR-IOV device to a VM Guest when initially setting it up, the VM Host Server already needs to be configured as described in Section 14.12.2, Loading and Configuring the SR-IOV Host Drivers.

14.12.2 Loading and Configuring the SR-IOV Host Drivers

To be able to access and initialize VFs, an SR-IOV-capable driver needs to be loaded on the host system.

  1. Before loading the driver, make sure the card is properly detected by running lspci. The following example shows the lspci output for the dual-port Intel 82576NS network card:

    tux > sudo /sbin/lspci | grep 82576
    01:00.0 Ethernet controller: Intel Corporation 82576NS Gigabit Network Connection (rev 01)
    01:00.1 Ethernet controller: Intel Corporation 82576NS Gigabit Network Connection (rev 01)
    04:00.0 Ethernet controller: Intel Corporation 82576NS Gigabit Network Connection (rev 01)
    04:00.1 Ethernet controller: Intel Corporation 82576NS Gigabit Network Connection (rev 01)

    In case the card is not detected, it is likely that the hardware virtualization support in the BIOS/EFI has not been enabled.

  2. Check whether the SR-IOV driver is already loaded by running lsmod. In the following example a check for the igb driver (for the Intel 82576NS network card) returns a result. That means the driver is already loaded. If the command returns nothing, the driver is not loaded.

    tux > sudo /sbin/lsmod | egrep "^igb "
    igb                   185649  0
  3. Skip this step if the driver is already loaded.

    If the SR-IOV driver is not yet loaded, the non-SR-IOV driver needs to be removed first, before loading the new driver. Use rmmod to unload a driver. The following example unloads the non-SR-IOV driver for the Intel 82576NS network card:

    sudo /sbin/rmmod igbvf

    Load the SR-IOV driver subsequently using the modprobe command—the VF parameter (max_vfs) is mandatory:

    sudo /sbin/modprobe igb max_vfs=8

    Or load the driver via SYSFS:

    Find the PCI ID of the physical NIC by listing Ethernet devices:

    tux > sudo lspci | grep Eth
    06:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Skyhawk) (rev 10)
    06:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Skyhawk) (rev 10)

    To enable VFs, echo the number of desired VFs to load to the sriov_numvfs parameter:

    tux > sudo echo 1 > /sys/bus/pci/devices/0000:06:00.1/sriov_numvfs

    Verify that the VF NIC was loaded:

    tux > sudo lspci | grep Eth
    06:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Skyhawk) (rev 10)
    06:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Skyhawk) (rev 10)
    06:08.0 Ethernet controller: Emulex Corporation OneConnect NIC (Skyhawk) (rev 10)

    Obtain the maximum number of VFs available:

    tux > sudo lspci -vvv -s 06:00.1 | grep 'Initial VFs'
                           Initial VFs: 32, Total VFs: 32, Number of VFs: 0,
    Function Dependency Link: 01
  4. Create a before.service file which loads VF via SYSFS on boot:

    [Unit]
    Before=
    [Service]
    Type=oneshot
    RemainAfterExit=true
    ExecStart=/bin/bash -c "echo 1 > /sys/bus/pci/devices/0000:06:00.1/sriov_numvfs"
    # beware, executable is run directly, not through a shell, check the man pages
    # systemd.service and systemd.unit for full syntax
    [Install]
    # target in which to start the service
    WantedBy=multi-user.target
    #WantedBy=graphical.target

    And copy it to /etc/systemd/system.

    Additionally, it is required to create another service file (after-local.service) pointing to /etc/init.d/after.local script that detaches the NIC prior to starting the VM, otherwise the VM would fail to start:

    [Unit]
    Description=/etc/init.d/after.local Compatibility
    After=libvirtd.service
    Requires=libvirtd.service
    [Service]
    Type=oneshot
    ExecStart=/etc/init.d/after.local
    RemainAfterExit=true
    
    [Install]
    WantedBy=multi-user.target

    And copy it to /etc/systemd/system.

    #! /bin/sh
    #
    # Copyright (c) 2010 SuSE LINUX Products GmbH, Germany.  All rights reserved.
    # ...
    virsh nodedev-detach pci_0000_06_08_0

    Then save it as /etc/init.d/after.local.

  5. Reboot the machine and check if the SR-IOV driver is loaded by re-running the lspci command from the first step of this procedure. If the SR-IOV driver was loaded successfully you should see additional lines for the VFs:

    01:00.0 Ethernet controller: Intel Corporation 82576NS Gigabit Network Connection (rev 01)
    01:00.1 Ethernet controller: Intel Corporation 82576NS Gigabit Network Connection (rev 01)
    01:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
    01:10.1 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
    01:10.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
    [...]
    04:00.0 Ethernet controller: Intel Corporation 82576NS Gigabit Network Connection (rev 01)
    04:00.1 Ethernet controller: Intel Corporation 82576NS Gigabit Network Connection (rev 01)
    04:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
    04:10.1 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
    04:10.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
    [...]

14.12.3 Adding a VF Network Device to an Existing VM Guest

When the SR-IOV hardware is properly set up on the VM Host Server, you can add VFs to VM Guests. To do so, you need to collect some data first.

Note: The following procedure is using example data. Make sure to replace it by appropriate data from your setup.

  1. Use the virsh nodedev-list command to get the PCI address of the VF you want to assign and its corresponding PF. Numerical values from the lspci output shown in Section 14.12.2, Loading and Configuring the SR-IOV Host Drivers (for example 01:00.0 or 04:00.1) are transformed by adding the prefix "pci_0000_" and by replacing colons and dots with underscores. So a PCI ID listed as "04:00.0" by lspci is listed as "pci_0000_04_00_0" by virsh. The following example lists the PCI IDs for the second port of the Intel 82576NS network card:

    tux > sudo virsh nodedev-list | grep 0000_04_
    pci_0000_04_00_0
    pci_0000_04_00_1
    pci_0000_04_10_0
    pci_0000_04_10_1
    pci_0000_04_10_2
    pci_0000_04_10_3
    pci_0000_04_10_4
    pci_0000_04_10_5
    pci_0000_04_10_6
    pci_0000_04_10_7
    pci_0000_04_11_0
    pci_0000_04_11_1
    pci_0000_04_11_2
    pci_0000_04_11_3
    pci_0000_04_11_4
    pci_0000_04_11_5

    The first two entries represent the PFs, whereas the other entries represent the VFs.

  2. Get more data that will be needed by running the command virsh nodedev-dumpxml on the PCI ID of the VF you want to add:

    tux > sudo virsh nodedev-dumpxml pci_0000_04_10_0
    <device>
      <name>pci_0000_04_10_0</name>
      <parent>pci_0000_00_02_0</parent>
      <capability type='pci'>
        <domain>0</domain>
        <bus>4</bus>
        <slot>16</slot>
        <function>0</function>
        <product id='0x10ca'>82576 Virtual Function</product>
        <vendor id='0x8086'>Intel Corporation</vendor>
        <capability type='phys_function'>
          <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
        </capability>
      </capability>
    </device>

    The following data is needed for the next step:

    • <domain>0</domain>

    • <bus>4</bus>

    • <slot>16</slot>

    • <function>0</function>

  3. Create a temporary XML file (for example /tmp/vf-interface.xml containing the data necessary to add a VF network device to an existing VM Guest. The minimal content of the file needs to look like the following:

    <interface type='hostdev'>
     <source>
      <address type='pci' domain='0' bus='11' slot='16' function='0'2/>
     </source>
    </interface>

    VFs do not get a fixed MAC address; it changes every time the host reboots. When adding network devices the traditional way with <hostdev>, it would require to reconfigure the VM Guest's network device after each reboot of the host, because of the MAC address change. To avoid this kind of problem, libvirt introduced the interface type='hostdev' directive, which sets up network-specific data before assigning the device.

    Specify the data you acquired in the previous step here.

  4. In case a device is already attached to the host, it cannot be attached to a guest. To make it available for guests, detach it from the host first:

    virsh nodedev-detach pci_0000_04_10_0
  5. Last, add the VF interface to an existing VM Guest:

    virsh attach-device GUEST /tmp/vf-interface.xml --OPTION

    GUEST needs to be replaced by the domain name, id or uuid of the VM Guest and --OPTION can be one of the following:

    --persistent

    This option will always add the device to the domain's persistent XML. In addition, if the domain is running, it will be hotplugged.

    --config

    This option will only affect the persistent XML, even if the domain is running. The device will only show up in the guest on next boot.

    --live

    This option will only affect a running domain. If the domain is inactive, the operation will fail. The device is not persisted in the XML and will not be available in the guest on next boot.

    --current

    This option affects the current state of the domain. If the domain is inactive, the device is added to the persistent XML and will be available on next boot. If the domain is active, the device is hotplugged but not added to the persistent XML.

    To detach a VF interface, use the virsh detach-device command, which also takes the options listed above.

14.12.4 Dynamic Allocation of VFs from a Pool

If you define the PCI address of a VF into a guest's configuration statically as described in Section 14.12.3, Adding a VF Network Device to an Existing VM Guest, it is hard to migrate such guest to another host. The host must have identical hardware in the same location on the PCI bus, or the guest configuration must be modified prior to each start.

Another approach is to create a libvirt network with a device pool that contains all the VFs of an SR-IOV device. The guest then references this network, and each time it is started, a single VF is dynamically allocated to it. When the guest is stopped, the VF is returned to the pool, available for another guest.

Defining Network with Pool of VFs on VM Host Server

The following example of network definition creates a pool of all VFs for the SR-IOV device with its physical function (PF) at the network interface eth0 on the host:

<network>
  <name>passthrough</name>
    <forward mode='hostdev' managed='yes'>
      <pf dev='eth0'/>
    </forward>
  </network>

To use this network on the host, save the above code to a file, for example /tmp/passthrough.xml, and execute the following commands. Remember to replace eth0 with the real network interface name of your SR-IOV device's PF:

virsh net-define /tmp/passthrough.xml
virsh net-autostart passthrough
virsh net-start passthrough

Configuring VM Guest to Use VF from the Pool

The following example of guest device interface definition uses a VF of the SR-IOV device from the pool created in Defining Network with Pool of VFs on VM Host Server. libvirt automatically derives the list of all VFs associated with that PF the first time the guest is started.

<interface type='network'>
  <source network='passthrough'>
</interface>

To verify the list of associated VFs, run virsh net-dumpxml passthrough on the host after the first guest that uses the network with the pool of VFs starts.

<network connections='1'>
  <name>passthrough</name>
  <uuid>a6a26429-d483-d4ed-3465-4436ac786437</uuid>
  <forward mode='hostdev' managed='yes'>
    <pf dev='eth0'/>
    <address type='pci' domain='0x0000' bus='0x02' slot='0x10' function='0x1'/>
    <address type='pci' domain='0x0000' bus='0x02' slot='0x10' function='0x3'/>
    <address type='pci' domain='0x0000' bus='0x02' slot='0x10' function='0x5'/>
    <address type='pci' domain='0x0000' bus='0x02' slot='0x10' function='0x7'/>
    <address type='pci' domain='0x0000' bus='0x02' slot='0x11' function='0x1'/>
    <address type='pci' domain='0x0000' bus='0x02' slot='0x11' function='0x3'/>
    <address type='pci' domain='0x0000' bus='0x02' slot='0x11' function='0x5'/>
  </forward>
  </network>