Scaling Kubernetes GitOps with Fleet: Experiment Results and Lessons Learnt | SUSE Communities

Scaling Kubernetes GitOps with Fleet: Experiment Results and Lessons Learnt

Share

Fleet, Rancher’s built-in GitOps engine, is designed to scale up to thousands of clusters. However, “how far” can it scale in a real world scenario, you might ask?
Earlier this year, we wrote about the Fleet benchmark tool and we made a few discoveries that were very instructive, especially concerning resource consumption and its impact on deployments’ performances.

To come up with some limits and a set of best practices to run the operators safely, I deployed Fleet on a four node RKE2 cluster on AWS.
Our focus is on scaling the number of clusters, so all tests will use up to 50 bundles. A Fleet ‘bundle’ represents deployable content, e.g. a single folder from the Git repository. While perhaps not representative of typical Fleet usage, deploying all bundles to all clusters simultaneously offers an excellent stress test.

The scaling experiments did not cover polling large numbers of Git repositories, large bundles with huge manifests or lots of resources. It did not cover the latency and agility aspects of resource updates  and all the other possible dimensions of scaling. Instead, I focused on deploying workloads from Git to a large fleet of clusters.       

The experiments tested Fleet in “standalone” mode, without a Rancher manager to eliminate noise while testing. Furthermore, I aimed to exceed the 500 clusters covered in Rancher’s official documentation.

Fleet deployments at scale

Fleet deployments at scale

The requirements say that to deploy 50 applications to 500 clusters, one needs 16vCPU and 64GB of RAM. However, I wanted to use an identical setup for all experiments. So, I chose the c5d.12xlarge instance type. With 48 vCPUs and 96 GB of RAM, that instance type is larger than the requirement. The Fleet controllers ran on a dedicated RKE2 agent node. The other three nodes of my RKE2 setup ran etcd and the k8s API server.

Setting up the downstream clusters was the most complicated part of the test setup. Finally, I was able to create and register 500 virtual k3k clusters in 35 minutes. The 500 virtual clusters ran in a separate cluster, on six r6gd.8xlarge agent nodes. Each node ran about 330 pods. This number of pods is not officially supported, but the virtual clusters only ran the Fleet agent.

Running such small, virtual downstream clusters made the test environment easier to set up. However, a key limitation of sharing a node across multiple virtual clusters is the restricted complexity of deployable bundles. Should one virtual cluster over-utilize resources, it risks destabilizing several other clusters on the same node Especially with so many pods per cluster and little head room. Using bigger, preferably real, downstream clusters would have made the experiment too costly and the setup more complicated. The experiments used a bundle that contains a single config map only. This works nicely, as I did not intend to measure downstream’s deployment speed, but how well the Fleet control-plane can produce inputs for the agents.

With a fleet agent running on each cluster, I could start measuring.

Deployment Workflow

Fleet bundles deployment workflow

Medium Sized Experiment

Fleet was able to consistently deploy 50 bundles to 500 clusters in 90 seconds. Updating all bundles in Git and waiting for the change to be applied to all clusters took the same time. 

Running the full benchmark suite against 500 clusters took 8 minutes, as opposed to 1 minute for two clusters. The individual experiments showed:

  • initial bundle creation, git cloning: 3 seconds for one repo, 8 seconds for 50
  • targeting: 40s
  • deployment phase: 4 minutes

This worked well, but the setup required some special attention, like an increase to the reconcile worker number and using a node selector to ensure enough CPU is available. More on that in the “lessons learnt” section.

Large Experiment

Building on the success of the medium-sized experiment, I wanted to push the limits further.
I tried 2000 downstream clusters. The setup remained the same, but I used 23 nodes to run virtual clusters, instead of 7.

It took 7 minutes, instead of 90 seconds, to roll out 50 bundles to 2000 clusters. CPU usage for the controllers, the API server and etcd was very high.
At this scale, the number of resources stored in etcd clearly became a limiting factor. Fleet creates a BundleDeployment resource for each bundle-cluster combination. That’s 100k resources. Listing that many resources with kubectl can take up to 20s.

Fleet resources relation

Fleet resources relationship

Additionally, sensitive information, like Helm values, is stored in secrets. So, the number of objects stored in etcd doubles when using Helm-style bundles. The actual number of resources in etcd is:

  • Manifest- or kustomize-style bundles: clusters * bundles
  • Helm-style bundles: clusters * bundles * 2
Waiting for 100k resources

Waiting for 100k resources

The performance impact was surprisingly high. Both resources, secret and BundleDeployment, are normally very small. The actual bundle content is stored in another resource, which Fleet deduplicates. In theory it could happen that 100k bundle deployments re-use the same content resource. It’s the sheer number of resources, reading them, updating them, that overwhelms the k8s API.
While the size of a resource makes etcd grow bigger, the number of resources makes the API server slower.

Additionally, etcd keeps copies for every update to a resource. At this scale the etcd blob store size had to be increased from 2GB to 8GB.

Too Big Experiment

Creating 50 bundles on 4000 clusters is where the experiment broke. The API servers and etcd need much more tuning to be able to handle that many resources. Fleet reached its limits, too. 

For example, Fleet stores the deployment results in the resources status. But it doesn’t always truncate the status correctly to match the maximum resource size of Kubernetes. At this scale features like “report resource state” per cluster, would need to be toggled off or the resource status itself would become too large.

Fleet could scale much higher, if bundledeployments and their secrets were not stored natively in Kubernetes. CPU and RAM don’t seem to be a problem. 

When the experiment failed, the failures were often not easy to diagnose. The OOM killer was removing random processes, the etcd database reached its limit and needed to be compacted manually while taking the nodes offline. I also observed the k8s API server taking 20 seconds to return a list of resources. Controllers got stuck as the reconcilers increased the exponential backoff due to conflicts. The event queue could not be completed as there were too many events and not enough workers.

These problems remind me of my dev setup: One of the best things about Kubernetes is that you can run most setups on your laptop. I can use Rancher Desktop and run k3d inside. With just 4GB of RAM, that works for most tasks, but add too many clusters in k3d, create too many resources in k8s and it quickly hits limits and starts crashing. Very much like a bare metal linux server, when you create a million small files or run out of swap…

Lessons Learnt

Scale Description Number of Bundles Number of Clusters Total BundleDeployments Control Plane Hardware
Medium 50 500 25,000 16 vCPUs
Large 50 2,000  100,000 dedicated node, 3GHz CPU
Too Large 50 4,000 200,000

The number of created resources is key. That means, deploying 25 bundles to 4000 clusters should work well (25*4000=100k). Deploying a hundred bundles to 2000 clusters breaks the limits.
Of course, managing that amount of clusters becomes possible when not all bundles are deployed to all clusters. Having 4000 clusters and targeting 10 different bundles at a thousand each, is still 40 bundles total, but only 40k resources. That’s closer to a medium setup than a large one. 

Tuning the Control Plane

Fleet Installation Settings

Fleet deployments workers count

Fleet deployments workers count

  • Worker Count: Can be changed when installing Fleet, they allow Fleet to process more events in parallel:
      • controller.reconciler.workers.bundle:  Needed to create bundle deployments from bundles, for 50 bundles, 50 workers are sufficient.
      • controller.reconciler.workers.bundledeployment: Needed for status updates, and cleanup, with 100k bundle deployments, this can be set to  5000.
      • controller.reconciler.workers.gitrepo: If polling more than 50 Git repositories, this can be safely set to 200.
      • controller.reconciler.workers.cluster: Needed to update and delete bundles. For 500 clusters, this can be set to 500.
  • Sharding: Fleet supports static sharding to distribute the workload of controllers, where each shard can be assigned to a node. This is helpful if the nodes running fleet controllers  do not have enough CPU.

Additionally, a bundle’s rollout settings influence deployment speed, the default are batches of 50 clusters. For more information, refer to https://fleet.rancher.io/rollout.
A larger partition size will drastically reduce the number of bundle controller reconciliations and create bundledeployments faster. Agents only react to bundledeployments, so they would have to do more work in parallel.

Finally, if this is not enough and you need 5000 or 10000 clusters, one could run multiple Fleets, which poll the same Git repositories for bundles. That way one can scale beyond the limits of the k8s control plane. However, the fleets would be  independent of each other. Which could be advantageous, each fleet could represent a geographical region, for example.

The discussion continues

Fleet v0.14.0 is now available and addresses several issues identified during the experiment, including better management of Fleet’s internal resources. You can review the release details on GitHub. What are your experiences with scaling GitOps, we would like to know. Join us on Rancher’s Slack in the #fleet channel.

(Visited 1 times, 1 visits today)