Field Notes: Optimizing a DELL PowerStor CSI installation on SUSE Virtualization

Share
Share

What is this?

I’m starting a series of blog posts on various tips, suggestions and optimizations I’ve devised here and there with SUSE products as a Solutions Architect!

This one in particular are some notes and suggestions for improving the documentation at: https://infohub.delltechnologies.com/static/media/client/7phukh/DAM_4b73960f-163e-4463-bec1-d10280d8160e.pdf

DISCLAIMER: This is not official documentation, just some field exploration and experimentation. ALWAYS use the supported/tested versions of all mentioned products here!

 

Custom tools

These tools have been created by me to facilitate the creation of configuration files and automating processes:

Harvester Installation File Generator: https://doccaz.github.io/harvester-configurator/

This tool can be used to create a YAML file with pre-configured options DURING THE INSTALLATION. This is the file you can supply at the end of the installation wizard. Note that the syntax is different from the other ones!

 

Harvester Cloudinit CRD Architect: https://doccaz.github.io/harvester-cloud-init/

This tool can be used to create YAML files for automating/persisting the configuration of the Harvester servers themselves, with optional selectors to limit the scope of the nodes affected. It has a template library with the most common actions, like configuring networks, VLANs, storage, etc. This is used to configure things AFTER THE INSTALLATION.

 

Visual Cloud-Init Generator: https://doccaz.github.io/cloud-init/

This tool is used to create the cloud-config templates you can use to pre-configure virtual machines, services, install packages, configure users, network, etc. This tool is to be used for configuring VMs!

 

Official Product documentation

The official documentation for SUSE Virtualization can be found here: https://documentation.suse.com/en-us/cloudnative/virtualization/

PowerStore CSI Installation

The procedure described by Dell for installing the PowerStore CSI in page 10 was incomplete because their script required pulling files via GIT, and we cannot install packages on the Harvester appliance easily anymore (there is no transactional-update binary in 1.7.x+). However,upon analyzing the Dell package, I saw that there is a second script for generating an “offline bundle.” I generated the bundle (csi-powerstore-bundle-2.16.0.tar.gz), but it was necessary to pull a few more Helm dependencies that the offline bundle script did not handle (actually, just one, it’s in extra-charts.tgz in the same directory). The version I’m using is the latest CSI version (2.16).

Here are the steps I followed.

On the local machine:

# git clone -b v2.16.0 https://github.com/dell/csi-powerstore.git
# cd csi-powerstore
# cd helm-charts/charts/
# for f in *; do cd $f; helm dependency update; cd .. ; done
# tar cvzf ../../extra-charts.tgz csm*
# cd ../../dell-csi-helm-installer
# ./csi-offline-bundle.sh -c

 

(the namespace and the secret were already created in my lab)

Copy the bundle and extra-charts.tgz to one of the Harvester nodes.

Edit the my-powerstoresettings.yaml file if necessary (downloaded from https://github.com/dell/helm-charts/raw/csi-powerstore-2.16.0/charts/csi-powerstore/values.yaml)


# cd /home/rancher
# tar xvzf csi-powerstore-bundle-2.16.0.tar.gz
# cd csi-powerstore-bundle-2.16.0/helm-charts/charts
# tar xvzf ../../../extra-charts.tgz
# cd ../../dell-csi-helm-installer/
# NODEUSER=rancher ./csi-install.sh --namespace csi-powerstore --values ../../my-powerstoresettings.yaml

------------------------------------------------------
> Installing CSI Driver: csi-powerstore on 1.34
------------------------------------------------------
------------------------------------------------------
> Checking to see if CSI Driver is already installed
------------------------------------------------------
------------------------------------------------------
> Verifying Kubernetes and driver configuration
------------------------------------------------------
|- Kubernetes Version: 1.34
|
|- Driver: csi-powerstore
|
|- Verifying Kubernetes version
|
|--\> Verifying minimum Kubernetes version Success
|
|--\> Verifying maximum Kubernetes version Failed
|
|- Verifying that required namespaces have been created Success
|
|- Verifying that required secrets have been created Success
|
|- Verifying alpha snapshot resources
|
|--\> Verifying that alpha snapshot CRDs are not installed Success
|
|- Verifying Replication requirements
|
|--\> Verifying that Dell CSI Replication CRDs are available Failed
|
|- Verifying iSCSI installation Failed
|
|- Verifying NVMe installation Failed
|
|- Verifying NVMe installation Failed
|
|- Verifying helm version Success

------------------------------------------------------
> Verification Complete - With Warnings
------------------------------------------------------
Warnings:
- Kubernetes version 1.34 is newer than the version that has been tested. Latest tested version is: 1.33
- To ensure the driver is fully supported run cert-csi and make sure all tests pass. More details: [https://dell.github.io/csm-docs/docs/support/cert-csi/](https://dell.github.io/csm-docs/docs/support/cert-csi/)- The CRD for DellCSIReplicationGroups is not installed. This needs to be installed if you are going to enable replication support
- Either iSCSI client was not found on node: 172.16.160.110 or not able to verify
- Either iSCSI client was not found on node: 172.16.160.111 or not able to verify
- Either iscsid service is not running on node: 172.16.160.111 or not able to verify
- Either NVMeTCP module is not loaded on node: 172.16.160.111 or not able to verify
- Either NVMeFC module is not loaded on node: 172.16.160.110 or not able to verify
- Either NVMeFC module is not loaded on node: 172.16.160.111 or not able to verify
------------------------------------------------------
WARNING:
Kubernetes validation failed but installation can continue.
This may affect driver installation.

Press 'y' to continue or any other key to exit: y
|
|- Installing Driver |
|- Installing helm dependencies Success
|
|--\> Waiting for Deployment powerstore-controller to be ready Success
|
|--\> Waiting for DaemonSet powerstore-node to be ready Success
------------------------------------------------------
> Operation complete
------------------------------------------------------

Even though it’s complaining about a newer version of Kubernetes (1.34), so far no issues were found. It also complains about not finding the NVMe/iSCSI clients and daemons are available. Looking at csi-powerstore/dell-csi-helm-installer/verify.sh, I see that it is running “ssh $NODEUSER@${node} ” to check them, and those will fail.
This will fail because SSH will ask to accept the key fingerprints and password. Maybe adding some ssh-copy-id commands at the start of the document will solve this problem.

Additional iSCSI network configuration and iSCSI multipathd

Since the “wicked” network engine is no longer available in Harvester 1.7.x/SLE Micro 6.1 (only NetworkManager), the provided YAML that created ifcfg files is no longer valid. You need to create NetworkManager .nmconnection files (or run the corresponding nmcli commands).

PRO TIP: You can test everything manually first (like with “nmcli”, or starting multipathd manually, etc) BEFORE you commit to a persistent file on /oem. Also remember, since 1.7.x /etc is persistent by default.

When you’re ready to make things persistent on /oem, I created an online editor for CloudInit CRD files that can do this for you. It’s available at https://doccaz.github.io/harvester-cloud-init/

You can use “kubectl apply -f ”, “kubectl get cloudinit” to see the currently applied ones, and “kubectl delete cloudinit ” to remove one. The CRD files can be scoped to run on specific nodes or on all nodes at once.

Here’s an example of one such file combining all the necessary files mentioned above.

# cat storage-vlan-config.yaml 
apiVersion: node.harvesterhci.io/v1beta1
kind: CloudInit
metadata:
 name: storage-vlan-config
 namespace: harvester-system
spec:
 matchSelector:
   kubernetes.io/hostname: hrv01
 filename: 99_storage_vlan_config.yaml
 contents: |
   stages:
     initramfs:
       - name: "Action: service"
         systemctl:
           enable:
             - multipathd
       - name: "Action: service"
         systemctl:
           enable:
             - iscsid
       - name: "Action: service"
         systemctl:
           start:
             - iscsid
       - name: "Action: service"
         systemctl:
           start:
             - multipathd
     boot:
       - name: "Action: write_file"
         files:
           - path: /etc/NetworkManager/system-connections/vlan66.nmconnection
             permissions: 0600
             owner: 0
             content: |
               [connection]
               id=vlan66
               uuid=fd3a8fa2-4352-4ac1-b7ba-2a75224d3c6c
               type=vlan

                [ethernet]
               mtu=9000

                [vlan]
               flags=1
               id=66
               parent=enp101s0f1

                [ipv4]
               method=manual
               address1=172.16.66.130/24

                [ipv6]
               method=disabled

        - name: "Action: run_cmd"
         commands:
           - "nmcli connection reload"
       - name: "Action: write_file"
         files:
           - path: /etc/multipath.conf
             permissions: 0644
             owner: 0
             content: |
               defaults {
                  polling_interval 5
                  checker_timeout 15
                  disable_changed_wwids yes
                  find_multipaths no
                }
                devices {
                  device {
                     vendor DellEMC
                     product PowerStore
                     detect_prio "yes"
                     path_selector "queue-length 0"
                     path_grouping_policy "group_by_prio"
                     path_checker tur
                     failback immediate
                     fast_io_fail_tmo 5
                     no_path_retry 3
                     rr_min_io_rq 1
                     max_sectors_kb 1024
                     dev_loss_tmo 10
                  }
                  device {
                     vendor .*
                     product dellemc-powerstore
                     uid_attribute ID_WWN
                     prio ana
                     failback immediate
                     path_grouping_policy "group_by_prio"
                     path_checker "none"
                     path_selector "queue-length 0"
                     detect_prio "yes"
                     fast_io_fail_tmo 5
                     no_path_retry 3
                     rr_min_io_rq 1
                     max_sectors_kb 1024
                    dev_loss_tmo 10
                 }
               }

                blacklist {
                device {
                  vendor "!DellEMC"
                  product "!PowerStore"
                }
                device {
                  vendor "!.*"
                  product "!dellemc-powerstore"
                }
               }
               blacklist_exceptions {
                device {
                  vendor "DellEMC"
                  product "PowerStore"
                }
                device {
                  vendor ".*"
                  product "dellemc-powerstore"
                }
               }

      network:
       - name: "Action: write_file"
         files:
           - path: /var/lib/iscsi/ifaces/iface130
             permissions: 0644
             owner: 0
             content: |
               iface.transport_name = tcp
               iface.net_ifacename = enp101s0f1.66
               iface.ipaddress = 172.16.66.130

IMPORTANT: Set one the Powerstore StorageClasses as DEFAULT, or else the Images upload will get stuck at 0% due to the way the handing over of the scratch/prime volumes happens between Longhorn and PowerStore as these temporary volumes will be created on Longhorn by default!

Troubleshooting Tips

How to test for jumbo frames


Try using a large window with ping:

# ping -M do -s 8972 172.16.66.230


If it cannot ping, check the current MTU window size:

# ip link show ens1f0 
6: ens1f0: mtu 1500 qdisc mq master mgmt-bo state UP mode DEFAULT group default qlen 1000
link/ether 70:ca:9b:ce:a2:94 brd ff:ff:ff:ff:ff:ff altname enp18s0f0

Set the window size manually:

# ip link set dev ens1f0 mtu 9000


Check it was set:

# ip link show ens1f0 
6: ens1f0: mtu 9000 qdisc mq master mgmt-bo state UP mode DEFAULT group default qlen 1000
link/ether 70:ca:9b:ce:a2:94 brd ff:ff:ff:ff:ff:ff altname enp18s0f0

iSCSI debugging


Checking the status of the current sessions:

# iscsiadm -m session 


Restarting the sessions:

# iscsiadm -m session --rescan

PowerStore debugging

Checking the controller pods/deployment:

# kubectl get pods -n csi-powerstore
# kubectl describe deploy -n csi-powerstore powerstore-controller

There are 6 containers internally:

  • attacher
  • resizer
  • provisioner
  • snapshotter
  • csi-metadata-retriever
  • driver

You can look at the logs for each container directly:

# kubectl logs -l name=powerstore-controller -c driver -n csi-powerstore

Add “-f” to follow the logs in real-time.


PVC creation debugging

You can do a “kubectl get pvc -n ” to see all the PVCs and their current status. Use “kubectl describe pvc -n ” to get a detailed description of the current status. Cross-check the IDs found here with the logs in the PowerStore logs, especially the provisioner and attacher containers to find out possible issues.

Be aware that in Harvester whenever you’re uploading an “Image” (like an ISO or qcow2), internally it will pre-allocate a “prime PVC” and a “scratch PVC”. These are used for pre-allocating the necessary space while the data is streamed to it. After the upload is done, the data is cloned to the real PVC provided by PowerStore, and the temporary PVCs are deleted.

IMPORTANT: There is a known issue where you MUST define a default StorageClass that uses PowerStore or else the uploads get stuck. That occurs because by default Harvester uses Longhorn to provision the prime/scratch PVCs, and that can take a while before control is passed on to PowerStore, which would be responsible for allocating the “final” PVC. This results in PowerStore CSI “thinking” that the provisioning of the prime/scratch PVCs failed and gets it into a loop, even if those PVCs are successfully provisioned by Longhorn.


Seeing the global cluster events

You can do a “kubectl get events -A” to see all the events stored on the cluster, or “kubectl get events -A -w” to keep following the events as they occur. This can be useful to correlate what parts/containers are being involved in each step. For example, volume operations like attach or mount happening in PowerStore, while there are virt-launcher operations happening on the Kubevirt side.

How to restart a Deployment

Restarting a deployment will terminate all the related pods and recreate them. This is useful when you changed something (or something else in the cluster changed status) and you want to re-deploy the pods again without waiting for the regular timeouts. Don’t be afraid! This is how Kubernetes is supposed to work. The pods will be recreated exactly as intended on the deployment object.

First, let’s find out exactly what is the deployment name. If you already know the namespace:

# kubectl get deploy -n 

 

If you don’t know the namespace:

# kubectl get deploy -A

 

Look for the deployment name on the second column. The “READY” column shows actual replicas/desired replicas.

Restart the affected deployment:

# kubectl rollout restart deployment -n <NAMESPACE>


You can follow the deployment with:

# kubectl rollout status deployment -n <NAMESPACE>

 

Or, if you’re a bit impatient like me, monitor for the pods directly:

# kubectl get pods -n <NAMESPACE> -w

 

I’ll keep posting tips and tricks whenever I can. Enjoy! 

Share
(Visited 1 times, 1 visits today)
Avatar photo
45 views
Erico Mendonca Solution Architect @ SUSE