Get Up and Running with NVIDIA GPUs in Rancher Kubernetes Clusters


Discover what’s new in Rancher 2.5

With massive adoption of Kubernetes at enterprises worldwide, we are seeing Kubernetes going to new extremes. On the one hand, Kubernetes is being adopted for workloads at the edge and delivering value beyond the data center. On the other hand, Kubernetes is being used to drive Machine Learning (ML) and high-quality, high-speed data analysis capabilities. The activity we are seeing with ML results from developments in Kubernetes starting around v1.10 when Graphical Processing Units (GPUs) became a schedulable resource – now in beta. Taken independently, these are both really exciting developments in Kubernetes. What is perhaps even more exciting is the adoption of GPUs with Kubernetes both in the data center and at the edge. In the data center, GPU is a means to develop ML libraries. Those trained libraries are moved out to edge Kubernetes clusters as ML inference tools that provide data analysis as close as possible to where the data is collected.

In the early days, Kubernetes was about providing a pool of CPU and RAM resources as a foundation for distributed applications. If we can have CPU and RAM pools, why not a GPU? Sure, no problem, but not all servers have GPU available. So, how can we make our servers with GPU available in Kubernetes?

In this blog, I’ll lay out the absolute easiest way to start using GPU resources in Kubernetes clusters. In a future blog, we’ll take our GPUs to the edge with Kubernetes and show you how to accomplish that as well. To really simplify things, I will describe the process for enabling GPU in terms of the Rancher user interface. The Rancher UI is simply a client to the Rancher RESTful APIs. You can use other clients to the APIs, such as Golang, Python and Terraform, in GitOps, DevOps and other automated solutions. We won’t delve into any of those here.

Fundamentally, the process is simple:

  1. Build out your infrastructure for a Kubernetes cluster
  2. Install Kubernetes
  3. Install the gpu-operator via Helm

Getting Up and Running with Rancher and Available GPU Resources

Rancher is a multi-cluster management solution and is the initiator and glue for the process steps above. You can find a pure NVIDIA solution to simplified GPU management in their blog, along with some great information about how the gpu-operator differs from building the GPU driver stack without the operator.


Here is the bill of materials (BOM) for getting up and running with GPUs in Rancher:

  1. Rancher
  2. GPU Operator
  3. Infrastructure – we’ll use GPU nodes on AWS

There is plenty of documentation to set up an HA installation of Rancher, so we will assume you already have Rancher installed.

Process Steps

Install a Kubernetes cluster with GPUs

With Rancher installed, we will first build and configure a Kubernetes cluster (you can use any cluster with NVIDIA GPUs.)

Under the Global context, we select Add Cluster

Image 01

and under the section “With RKE and new nodes in an infrastructure provider,” we select the Amazon EC2 provider.

Image 02

We are doing this with the node drivers – a set of preconfigured infrastructure templates, some of which have GPU resources.

Image 03

Notice that there are three node pools: One for the masters, one for standard workers and the other for workers with GPU. The template for GPU is based on the p3.2xlarge machine type and uses an Ubuntu 18.04 Amazon Machine Image or AMI (ami-0ac80df6eff0e70b5). These selections are, of course, variable according to each infrastructure provider and enterprise needs. Also, we left the Kubernetes options in the “Add Cluster” form set to the defaults.

Set Up the GPU Operator

Now we will set up a catalog in Rancher using the GPU Operator repository. (Alternative solutions to expose the GPUs include using the Linux for Tegra [L4T] Linux distribution or the device plugin.) As of this writing, the GPU Operator has been tested and verified with the NVIDIA Tesla Driver 440.

Using the Rancher Global context menu, we select the cluster we want to install into

Image 04

and then use the Tools menu to see the list of catalogs.

Image 05

We click the Add Catalog button and give it a name and add the URL

We have also selected Helm v3 and the cluster scope. We click Create to add the catalog to Rancher. When using automation, we can do this step as part of the cluster buildout. Depending on enterprise policies, we could add this catalog to every cluster even if it does not yet have GPU nodes or node pools. This step provides us with access to the GPU Operator chart, which we will install next.

Image 06

Now we want to use the Rancher context menu in the top left to get into the “System” project for the cluster where we added the GPU Operator capability.

Image 07

In the System project, select Apps

Image 08

and click the Launch button on the top right.

Image 09

We can search for “nvidia” or scroll down to the catalog we just created.

Image 10

Click on the gpu-operator app and at the bottom of the following page and click Launch.

Image 11

All the defaults should work without issue in this case. Again, we can add this step to automation through the Rancher APIs.

Make use of the GPUs

Now that GPUs are accessible, we can deploy a GPU-capable workload. We can also verify that this installation was successful by looking at the Cluster -> Nodes page in Rancher. We see that the GPU Operator has installed Node Feature Discovery (NFD) and kindly labeled our nodes for GPU usage.

Image 12


This simplified approach to getting Kubernetes up and running with GPUs takes advantage of these three things:

  1. The GPU Operator from NVIDIA
  2. Node Feature Discovery from the Kubernetes SIG of the same name
  3. Rancher’s cluster deployment and catalog app integration

Watch for another blog in this space, where we take our GPUs to the edge.

Discover what’s new in Rancher 2.5

(Visited 1 times, 1 visits today)