Get Up and Running with NVIDIA GPUs in Rancher Kubernetes Clusters
With massive adoption of Kubernetes at enterprises worldwide, we are seeing Kubernetes going to new extremes. On the one hand, Kubernetes is being adopted for workloads at the edge and delivering value beyond the data center. On the other hand, Kubernetes is being used to drive Machine Learning (ML) and high-quality, high-speed data analysis capabilities. The activity we are seeing with ML results from developments in Kubernetes starting around v1.10 when Graphical Processing Units (GPUs) became a schedulable resource – now in beta. Taken independently, these are both really exciting developments in Kubernetes. What is perhaps even more exciting is the adoption of GPUs with Kubernetes both in the data center and at the edge. In the data center, GPU is a means to develop ML libraries. Those trained libraries are moved out to edge Kubernetes clusters as ML inference tools that provide data analysis as close as possible to where the data is collected.
In the early days, Kubernetes was about providing a pool of CPU and RAM resources as a foundation for distributed applications. If we can have CPU and RAM pools, why not a GPU? Sure, no problem, but not all servers have GPU available. So, how can we make our servers with GPU available in Kubernetes?
In this blog, I’ll lay out the absolute easiest way to start using GPU resources in Kubernetes clusters. In a future blog, we’ll take our GPUs to the edge with Kubernetes and show you how to accomplish that as well. To really simplify things, I will describe the process for enabling GPU in terms of the Rancher user interface. The Rancher UI is simply a client to the Rancher RESTful APIs. You can use other clients to the APIs, such as Golang, Python and Terraform, in GitOps, DevOps and other automated solutions. We won’t delve into any of those here.
Fundamentally, the process is simple:
- Build out your infrastructure for a Kubernetes cluster
- Install Kubernetes
- Install the gpu-operator via Helm
Getting Up and Running with Rancher and Available GPU Resources
Rancher is a multi-cluster management solution and is the initiator and glue for the process steps above. You can find a pure NVIDIA solution to simplified GPU management in their blog, along with some great information about how the gpu-operator differs from building the GPU driver stack without the operator.
Here is the bill of materials (BOM) for getting up and running with GPUs in Rancher:
There is plenty of documentation to set up an HA installation of Rancher, so we will assume you already have Rancher installed.
Install a Kubernetes cluster with GPUs
With Rancher installed, we will first build and configure a Kubernetes cluster (you can use any cluster with NVIDIA GPUs.)
Under the Global context, we select Add Cluster
and under the section “With RKE and new nodes in an infrastructure provider,” we select the Amazon EC2 provider.
We are doing this with the node drivers – a set of preconfigured infrastructure templates, some of which have GPU resources.
Notice that there are three node pools: One for the masters, one for standard workers and the other for workers with GPU. The template for GPU is based on the p3.2xlarge machine type and uses an Ubuntu 18.04 Amazon Machine Image or AMI (ami-0ac80df6eff0e70b5). These selections are, of course, variable according to each infrastructure provider and enterprise needs. Also, we left the Kubernetes options in the “Add Cluster” form set to the defaults.
Set Up the GPU Operator
Now we will set up a catalog in Rancher using the GPU Operator repository. (Alternative solutions to expose the GPUs include using the Linux for Tegra [L4T] Linux distribution or the device plugin.) As of this writing, the GPU Operator has been tested and verified with the NVIDIA Tesla Driver 440.
Using the Rancher Global context menu, we select the cluster we want to install into
and then use the Tools menu to see the list of catalogs.
We click the Add Catalog button and give it a name and add the URL https://nvidia.github.io/gpu-operator.
We have also selected Helm v3 and the cluster scope. We click Create to add the catalog to Rancher. When using automation, we can do this step as part of the cluster buildout. Depending on enterprise policies, we could add this catalog to every cluster even if it does not yet have GPU nodes or node pools. This step provides us with access to the GPU Operator chart, which we will install next.
Now we want to use the Rancher context menu in the top left to get into the “System” project for the cluster where we added the GPU Operator capability.
In the System project, select Apps
and click the Launch button on the top right.
We can search for “nvidia” or scroll down to the catalog we just created.
Click on the gpu-operator app and at the bottom of the following page and click Launch.
All the defaults should work without issue in this case. Again, we can add this step to automation through the Rancher APIs.
Make use of the GPUs
Now that GPUs are accessible, we can deploy a GPU-capable workload. We can also verify that this installation was successful by looking at the Cluster -> Nodes page in Rancher. We see that the GPU Operator has installed Node Feature Discovery (NFD) and kindly labeled our nodes for GPU usage.
This simplified approach to getting Kubernetes up and running with GPUs takes advantage of these three things:
- The GPU Operator from NVIDIA
- Node Feature Discovery from the Kubernetes SIG of the same name
- Rancher’s cluster deployment and catalog app integration
Watch for another blog in this space, where we take our GPUs to the edge.