Accelerating Machine Learning with MLOps and FuseML: Part One

星期日, 25 七月, 2021

Building successful machine learning (ML) production systems requires a specialized re-interpretation of the traditional DevOps culture and methodologies. MLOps, short for machine learning operations, is a relatively new engineering discipline and a set of practices meant to improve the collaboration and communication between the various roles and teams that together manage the end-to-end lifecycle of machine learning projects.

Helping enterprises adapt and succeed with open source is one of SUSE’s key strengths. At SUSE, we have the experience to understand the difficulties posed by adopting disruptive technologies and accelerating digital transformation. Machine learning and MLOps are no different.

The SUSE AI/ML team has recently launched FuseML, an open source orchestration framework for MLOps. FuseML brings a novel holistic interpretation of MLOps advocated practices to help organizations reshape the lifecycle of their Machine Learning projects. It facilitates frictionless interaction between all roles involved in machine learning development while avoiding massive operational changes and vendor lock-in.

This is the first in a series of articles that provides a gradual introduction to machine learning, MLOps and the FuseML project. We start here by rediscovering some basic facts about machine learning and why it is a fundamentally atypical technology. In the next articles, we will look at some of the key MLOps findings and recommendations and how we interpret and incorporate them into the FuseML project principles.

MLOps Overview

Old habits that need changing can be difficult to unlearn, even more difficult than re-learning everything. It’s true for people, and it’s even truer for teams and organizations where the combined inertia that makes important changes difficult to implement is several orders of magnitude greater.

With the AI hype on the rise, organizations have been investing more and more in machine learning to make better and faster business decisions or automate key aspects of their operations and production processes. But if history taught us anything about adopting disruptive software technologies like virtualization, containerization and cloud computing, it’s that getting results doesn’t happen overnight. It often requires significant operational and cultural changes. With machine learning, this challenge is very pronounced, with more than 80 percent of AI projects failing to deliver business outcomes, as reported by Gartner in 2019 and repeatedly confirmed by business analysts and industry leaders throughout 2020 and 2021.

Naturally, following this realization about the challenges of using machine learning in production, a lot of effort went into investigating the “whys” and “whats” about this state of affairs. Today, the main causes of this phenomenon are better understood. A brand new engineering discipline – MLOps – was created to tackle the specific problems that machine learning systems encounter in production.

The recommendations and best practices assembled under the MLOps label are rooted in the recognition that machine learning systems have specialized requirements that demand changes in the development and operational project lifecycle and organizational culture. MLOps doesn’t propose to reinvent how we do DevOps with software projects. It’s still DevOps but pragmatically applied to machine learning.

MLOps ideas can be traced back to the defining characteristics of machine learning. The remainder of this article is focused on revisiting what differentiates machine learning from conventional programming. We’ll use the fundamental insights in this exercise as stepping stones when we dive deeper into MLOps in the next chapter of this series.

Machine Learning Characteristics

Solving a problem with traditional programming requires a human agent to formulate a solution, usually in the form of one or more algorithms, and then translate it into a set of explicit instructions that the computer can execute efficiently and reliably. Generally speaking, conventional programs, when correctly developed, are expected to give accurate results and to have highly predictable and easily reproducible behaviors. When a program produces an erroneous result, we treat that as a defect that needs to be reproduced and fixed. As a best practice, we also process conventional software through as much testing as possible before deploying it in production, where the business cost incurred for a defect could be substantial. We rely on the results of proactive testing to give us some guarantees about how the program will behave in the future, another characteristic derived from the predictability aspect of conventional software. As a result, once released, a software product is expected to take significantly less effort to maintain compared to development.

Some of these statements are highly generic. One might say they could even be used to describe products in general, software or otherwise. They all have in common that they no longer hold as entirely valid when applied to machine learning.

Machine learning algorithms are distinguished by their ability to learn from experience (i.e., from patterns in input data) to behave in a desired way, rather than being programmed to do so through explicit instructions. Human interaction is only required during the so-called training phase when the ML algorithm is carefully calibrated and data is fed into it, resulting in a trained program, also called an ML model. With proper automation in place, it may even seem that human interaction could be eliminated. Still, as we’ll see later in this post, it’s just that the human responsibilities shift from programming to other activities, such as data collection and processing and ML algorithm selection, tuning and monitoring.

Machine learning can be used to solve a specific class of problems:

  • the problem is extremely difficult to solve mathematically or programmatically, or it has only solutions that are too computationally expensive to be practical
  • a fair amount of data exists (or can be generated) containing a pattern that an ML algorithm can learn

Let’s look at two examples, similar but situated at opposite ends of the spectrum as far as utility is concerned.

Sum of Two Numbers

A very simple example, albeit with no practical application whatsoever, is training an ML model to calculate the sum of two real numbers. Doing this with conventional programming is trivial and always yields very accurate results.

Training and using an ML model for the same task could be summarized by the following phases:

Data Preparation

First, we need to prepare the input data that will be used to train the ML model. Generally speaking, training data is structured as a set of entries. Each entry associates a concrete set of values used as input for the target problem with the correct answer (sometimes known as a target or label in ML terms). In our example, each entry maps a pair of real input values (X, Y) to the desired result (X+Y) that we expect the model to learn to compute. For this purpose, we can generate the training data entirely using conventional programming. Still, it’s often the case with machine learning that training data is not readily available and expensive to acquire and prepare. The code used to generate the input dataset could look like this:

import numpy as np 
train_data = np.array([[1.0,1.0]])
train_targets = np.array([2.0])
for i in range(3,10000,2):
  train_data = np.append(train_data,[[i,i]],axis=0)
  train_targets = np.append(train_targets,[i+i])

Deciding what kind of data is needed, how much of it and how it needs to be structured and labeled to yield acceptable results during ML training is the realm of data science. The data collection and preparation phase is critical to ensuring the success of ML projects. It takes experimentation and experience to find out which approach yields the best result, and data scientists often need to iterate several times through this phase and improve the quality of their training data to raise the accuracy of ML models.

Model Training

Next, we need to define the ML algorithm and train it (also known as fitting) on the input data. For our goal, we can use an Artificial Neural Network (ANN) suitable for this type of problem (regression). The code for it could look like this:

import tensorflow as tf
from tensorflow import keras
import numpy as np


model = keras.Sequential([
  keras.layers.Flatten(input_shape=(2,)),
  keras.layers.Dense(20, activation=tf.nn.relu),
  keras.layers.Dense(20, activation=tf.nn.relu),
  keras.layers.Dense(1)
])


model.compile(optimizer='adam', 
  loss='mse',
  metrics=['mae'])


model.fit(train_data, train_targets, epochs=10, batch_size=1)

Similar to data preparation, deciding which ML algorithm to use and what values should be configured for its parameters for best results (e.g., the neural network architecture, optimizer, loss, epochs) requires specific ML knowledge and iterative experimentation. However, by now, ML is mature enough to make finding an algorithm that fits the problem not difficult, especially given that there are countless open source libraries, examples, ready-to-use ML models and documented use-case patterns and recipes available for all major classes of problems that can be solved with ML, that one can start from. Moreover, many of the decisions and activities required to develop a high-performing ML model (e.g., hyper-parameter tuning, neural architecture search) can already be fully automated or accelerated through partial automation through a special category of tools called AutoML.

Model Prediction

We now have a trained ML model that we can use to calculate the sum of any two numbers (i.e. make predictions):

def sum(x, y):
  s = model.predict([[x, y]])[0][0]
  print("%f + %f = %f" % (x, y, s))

The first thing to note is that the summation results produced by the trained model are not at all accurate. It’s fair to say that the ML model is not behaving like it’s calculating the result, but more like it’s giving a ballpark estimation of what the result might be, as shown in this set of examples:

# sum(2000, 3000)
2000.000000 + 3000.000000 = 4857.666992
# sum(4, 5)
4.000000 + 5.000000 = 9.347977

Another notable characteristic is, as we move further away from the pattern of values on which the model was trained, the model’s predictions get worse. In other words, the model is better at estimating summation results for input values that are more similar to the examples on which it was trained:

# sum(10, 10000)
10.000000 + 10000.000000 = 8958.944336
# sum(1000000, 4)
1000000.000000 + 4.000000 = 1318969.375000
# sum(4, 1000000)
4.000000 + 1000000.000000 = 895098.750000
# sum(0.1, 0.1)
0.100000 + 0.100000 = 0.724608
# sum(0.01, 0.01)
0.010000 + 0.010000 = 0.549576

This phenomenon is well known to ML engineers. If not properly understood and addressed, it can lead to ML specific problems that take various forms and names:

  • bias: using incomplete, faulty or prejudicial data to train ML models that end up producing biased results
  • training-serving skew: training an ML model on a dataset that is not representative of the real-world conditions in which the ML model will be used
  • data drift, concept drift or model decay: the degradation, in time, of the model quality, as the real-world data used for predictions changes to the point where the initial assumptions on which the ML model was trained are no longer valid

In our case, it’s easy to see that the model is performing poorly due to a skew situation: we inadvertently trained the model on pairs of equal numbers, which is not representative of the real-world conditions in which we want to use it. Our model also completely missed the point that addition is commutative, but that’s not surprising, given that we didn’t use training data representative of this property either.

When developing ML models to solve complex, real-world problems, detecting and fixing this type of problem is rarely that simple. Machine learning is as much an art as it is a science and engineering endeavor.

In training ML models, there is usually also a validation step involved, where the labeled input data is split, and part of it is used to test the trained model and calculate its accuracy. This step is intentionally omitted here for the sake of simplicity. The full exercise of implementing this example, with complete code and detailed explanations, is covered in this article.

The Three-Body Problem

At the other end of the spectrum is a physics (classical mechanics) problem that has inspired one of the greatest mathematicians of all times, Isaac Newton, to invent an entirely new branch of math and nowadays a source of constant frustration among high school students: Calculus.

Finding the solution to the set of equations that describe the motion of two celestial bodies (e.g., the Earth and the Moon) given their initial positions and velocities is already a complicated problem. Extending the problem to include a third body (e.g., the Sun) complicates things to the point where a solution cannot be found, and the entire system starts behaving chaotically. With no mathematical solution in sight, Newton himself felt that supernatural powers had to be at play to account for the apparent stability of our solar system.

This problem and its generalized form, the many-body problem, are so famous because solving them is a fundamental part of space travel, space exploration, cosmology and astrophysics. Partial solutions can be calculated using analytical and numerical methods, but it requires immense computational power.

All life forms on this planet are constantly used to dealing with gravity. We are well equipped to learn from experience, and we’re able to make pretty accurate predictions regarding its effects on our bodies and the objects we interact with. It is not entirely surprising that Machine Learning can estimate the motion of objects under the effect of gravity.

Using Machine Learning, researchers at the University of Edinburgh have been able to train an ML model capable of solving the three-body problem 100 million times faster than traditional means. The full story covering this achievement is available here, and the original scientific paper can be read here.

Solving the three-body problem with ML is similar to our earlier trivial example of adding two numbers together. The training and validation datasets are also generated through simulation, and an ANN is also involved here, albeit one with a more complex structure. The main differences are the complexity of the problem and ML’s immediate practical application to this use case. However, the observations previously stated about general ML characteristics apply equally to both cases, regardless of complexity and utility.

Conclusion

We haven’t even begun to look at MLOps in detail. Still, we can already identify and summarize key takeaways representative of ML in general just by comparing classical programming to Machine Learning:

  1. Not all problems are good candidates for machine learning
  2. The process of developing ML models is iterative, exploratory and experimental
  3. Developing a machine learning system requires dealing with new categories of artifacts with specialized behaviors that don’t fit the patterns of conventional software
  4. It’s usually not possible to produce fully accurate results with ML models
  5. Developing and working with machine learning based systems requires a specialized set of skills, in addition to those needed for traditional software engineering
  6. Running ML systems in the real world is far less predictable than what we’re used to with regular software
  7. Finally, developing ML systems would be next to impossible without specialized tools

Machine Learning characteristics summarized here are reflected in the MLOps discipline and distilled in the principles on which we based the FuseML orchestration framework project. The next article will give a detailed account of MLOps recommendations and how an MLOps orchestration framework like FuseML can make developing and operating ML systems an automated and frictionless experience.

The Business Case for Container Adoption

星期二, 2 四月, 2019

Developers often believe that demonstrating the need for an IT-based solution should be very easy. They should be able to point to the business problem that needs a solution, briefly explain what technology should be selected, and the funds, staff, and computer resources will be provided by the organization. Unfortunately, this is seldom the actual process that is followed.

Developing a Business Case for New Technology Isn’t Always Easy

Most organizations require that both a business and a technical case be made before a project can be approved. Depending on the size and culture of the organization, building both cases can be a long, and sometimes arduous, process.

Part of the challenge developers face can be summed up simply: business decision-makers and technical decision-makers have different priorities, use different metrics, and, in short, think differently.

Business Managers Think in Different Terms Than Developers

Business decision-makers are almost always thinking in terms of the investment required, the costs expected, and the revenues the organization can expect that can be attributed to the successful completion of the project not the technical merit, the tools selected, or the development methodology that will be deployed to complete the project.

They may use technology every day, but many think of it as a means to an end, not something they enjoy using.

As David Ingram pointed out in his recent article on business decision making, managers often use a 7-step process:

  1. Identify the problem
  2. Seek information to clarify what’s actually happening
  3. Brainstorm potential solutions
  4. Weigh the alternatives
  5. Choose an alternative
  6. Implement the chosen plan
  7. Evaluate the outcome

You’ll note that the best technology, the best approach to development, the best platform, how to achieve the best performance, how to achieve the highest levels of availability, and other technical factors that technologists consider may be seen as secondary issues. From the perspective of a business decision-maker, the extensive work that constitutes this type of evaluation might all be wrapped up into the “weigh the alternatives” step.

Factors of the Business Decision

Let’s break this down a bit. Business decision-makers will consider the *overall** investment required and weigh it against the potential benefits that might be received. This includes a number of factors that may not appear to be directly associated with a specific project.

They also will be considering if this the right project to be addressing at this time or whether other issues are more pressing.

While working with an executive at a major IT supplier, I was once told “solving the wrong problem, no matter how efficiently and well-done, is still solving the wrong problem.”

Here are a few of the factors they are likely to consider:

  • Staff: the number of staff, the levels of expertise, the amount of time they’ll need to be assigned to the project, the business overhead associated with having those people on staff, whether they should be full-time, part-time, or contractors
  • Costs: the costs of all resources required, including:
    • Data center operational costs: floor space, power, air conditioning, networking, maintenance, real estate
    • Systems: number of systems, memory required, external storage, maintenance
    • Software: software licenses, software maintenance
  • Time to market: can this project be completed quickly enough to address the needs of the market. This sometimes is called “time to profit.”
  • Revenues: will the project directly or indirectly lead to increased revenues?

If the costs of doing the project outweigh the projected revenues that can be attributed to the completion of the project, the business decision-makers are likely to look for another solution which may include not doing it at all, purchasing a packaged software product that will solve the problem in a general way, or subscribing to an online service that will address the issue.

In the end, business decision-makers will be focused on increasing the organization’s revenues and decreasing its costs.

What Developers Think About

Developers, on the other hand, tend to think more about the technical problem in front of them and how it can be solved.

What Needs to Be Accomplished

Often, a developer’s first consideration is to fully understand what needs to be accomplished to address the situation. It is quite possible that the developers will be unable to focus on the issues in a way that takes into account the needs of the whole organization. This siloed perspective sometimes results in several business units solving the same problem in different, and sometimes incompatible, ways.

How It Can Be Accomplished

The next consideration for developers is how a solution can be accomplished. Developers are very busy people and need to get things done quickly and efficiently. This often means that they select development tools and methodology they are most familiar with rather than casting about to discover new, and potentially better, approaches. The result is that, from an outsider’s perspective, developers will select the same tool regardless of if it is the best one for the job. As Abraham Maslow pointed out, “I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail” (“The Psychology of Science” 1962).

How To Systematize or Automate Solutions

Developers have a tendency to also focus on how to systematize or automate the approach to accomplishing a solution. Developers who have experience introducing new systems will not only consider how to accomplish this difficult task, but also whether the current manual processes have some merit as well.

Costs Are Often Ignored or Secondary to Other Considerations

Developers often do not have access to reports showing the overall costs, the investment required, or even the revenues of a given project. Since they are busy working on projects, they often don’t think about those factors at all. This situation, by the way, is the root of many communication challenges faced when developers are attempting to persuade business decision-makers to approve a project. They don’t have all of the data they need.

I’m reminded of a conversation with a CFO of my company who didn’t understand the need for a different type of database than the one used by the company for another purpose. At first I thought of him a “a man who knows the price of everything and the value of nothing,” to quote Oscar Wilde.

After thinking about his comments, I built a different justification that focused on speaking to him in his own language by discussing the project in terms of the investment required, the costs that were going to be incurred, and the revenue potential the new approach would provide. It took some work to obtain that information, but it was worth the effort in the end.

It was only after a longer conversation with the CFO that he began to be able to understand why Lotus Notes wasn’t the best tool for the creation of a transaction-oriented system for research and analysis.

Are you speaking to your business decision-makers using acronyms, development procedures and the names of open source projects you’d like to deploy? If so, you’re not helping your cause.

Where to Start

A good place to start is to think in terms of where and how money can be saved, where and how previous investments can be enhanced or reused rather than being discarded, and how your proposed project would result in increased opportunities for revenue.

It would also be wise to offer a vision of how the use of containers will help the organization achieve its overall goals, including factors such as:

  • Scaling to address the needs of a larger or at least a new market
  • Reducing overall IT costs
  • Allow the organization to rapidly adapt to a rapidly changing environment, to take advantage of emerging opportunities
  • Quickly develop new products or services
  • Reach new customers while being able to maintain relationships with today’s customer base

For Many Companies Adoption of Containers Must Be Carefully Justified

The move to a Container-based environment is one of those journeys that developers can easily understand as beneficial that can be challenging to justify to a business decision-maker.

After all, some things aren’t fully known until they’ve been done at least once. So, quantifying investments required, cost savings that will be realized, and the actual size of revenue increases can be difficult.

What can be said is that adopting Containers can reduce costs and reduce risk by supporting rapid and inexpensive prototyping of solutions. Pointing out that doing this prototyping in inexpensive cloud computing services rather than acquiring new systems would help them understand that you are focused on meeting your objects while still helping the organization keep costs under control. Tell the business decision makers that this approach also offers them a choice in the future. Once something is developed, documented, and proven to be able to do the job, it can either stay where it is or be moved in-house depending upon which will be the best overall business decision.

Where Can Containers Help a Company Reduce Costs?

Developers understand that being able to decompose a problem into smaller, more manageable problems can improve their efficiency, reduce their time-to-solution, and make reuse of code and services easier.

Reducing the Number of Operating System Instances to Maintain

Explain that containerized applications need fewer copies of operating systems when compared to using virtual machine technology, less processor power, less system memory, and less external storage. Developers can to speak in terms of reducing system requirements and how they can result in a direct savings that the business decision-makers can appreciate.

A few related factors are helpful to bring up as well. This approach reduces the number of software licenses that are required and the cost of software maintenance agreements.

Increasing the Amount of Useful Work Systems Can Accomplish

Since the systems won’t be carrying the heavy weight of unneeded operating systems for each application component or service, performance should be improved. After all, switching from one container to another is much faster than switching from one VM to another. There is no need to roll huge images into and out of storage.

Improving Productivity

Since productivity is important to most organizations, show that a move to containers is a great foundation for the use of a rapid application development and deployment (DevOps) strategy. By decomposing applications into functions, application development can be faster because functions are easier to build, document, and support. This should result in lower development costs while improving overall time to solution.

This approach also can reduce the time to deployment because functions can be developed in parallel by smaller independent teams.

Improving Application Capabilities

Adopting a container-based approach provides a number of other benefits that should be mentioned as well, including:

  • Container management and automation functions are improving all the time which should result in lower costs of administration and operations
  • Container workload management and migration technology is also improving all the time which should result in higher levels of application availability, higher levels of performance, and fewer losses due to downtime
  • Decomposing applications into independent functions and services also makes them easier to develop and maintain which should reduce the costs of development, support, and operations

Facilitating a Move to the Cloud

Most business decision-makers have read about cloud computing, but don’t really understand how it can be adopted. Help them understand that the adoption of containers can facilitate the organization’s ability to deploy functions or complete applications locally, in the cloud, or in a combined hybrid environment, quickly and easily.

So, the answer to the question of whether to move to the cloud or continue on-premise computing is “yes, both.”

Reducing Time to Profit

When the business decision-maker begins to understand the business benefits of containerization, they’ll also see that this approach not only can reduce the overall time to market for applications, but, more importantly, it can reduce the time to profit. Lower development and support costs combined with rapid development can lead to quicker streams of revenue and profit.

Establishing a Foundation for the Future

It is also helpful for the business decision-maker to understand that one of your goals is establishing a platform for the future. Containers are supported in many different computing environments, by many different suppliers, and the organization gets the benefits.

Some of those benefits are:

  • Containerized functions can be used as part of many applications without having to be rearchitected or redeveloped
  • They can be enhanced or updated as needed without requiring other unrelated functions to be changed.
  • Support of the application can be easier and less costly.
  • Scalability is improved since the same functions can be run in multiple places with the help of workload management technology

How Can Containers Help a Company Increase Revenue?

A key question to consider is how adopting Containers can help the company increase its revenues. There are a number of elements that directly and indirectly address that question.

Since applications can be developed quicker, perform better, and can be supported more easily, the organization can address a rapidly changing business and regulatory environment more effectively. This also means that the organization can capture additional market share from organizations that continue to only use older approaches to information systems.

It also means that the organization can conduct experiments and prototype solutions quickly. This means that the organization can succeed or fail quicker and that organizational learning will be accelerated.

Where an application or its components execute are flexible. This means that a successful solution can execute locally, in the cloud, or in both places as needed. Business decision-makers usually appreciate flexible solutions that don’t impose extra costs.

This approach also ensures that the resulting solutions can scale from small to large as needed. So, organizations can feel more comfortable trying out something new and know that if it succeeds, it can be put into production effectively. Business decision-makers are often encouraged by approaches that allow for a low investment at first and with opportunities for growth as revenues increase rather than forcing a heavy investment up front. This means that the organization is exposed to lever levels of risk.

Summary

Adopting a container-focused approach can be beneficial to both technical and business decision-makers because it addresses the needs for rapid and effective solution development and reduction in overall costs and risks. It also results in a foundation for future growth and the ability to address a changing market.

This approach brings greater complexity along with it, but the benefits outweigh the challenges in many environments. The rapid improvement in container system management, automation, as well as the strong industry support for this approach makes it a safer choice.

If developers focus on helping business decision-makers understand how this approach also facilitates lower costs, improved time to market, and time to profit, the business side is likely to get on board quicker. They are likely to appreciate the reduced costs of solution support, operations, and development. They are also likely to be pleased that future investment can be based on revenue production rather than facing investing up front based upon a rosy forecast for future revenues.

Developing a Strategy for Kubernetes adoption

Like containers, Kubernetes sits at the intersection of DevOps and ITOps and many organizations are trying to figure out key questions such as: who should own kubernetes, how many clusters to deploy, how to deliver it as a service, how to build a security policy, and how much standardization is critical for adoption. Rancher co-founder Shannon Williams discusses these questions and more in the free online class Building an Enterprise Kubernetes Strategy.

Tags: ,, Category: 未分类 Comments closed

Rancher 2.2 Hits the GA Milestone

星期二, 26 三月, 2019
Expert Training in Kubernetes and Rancher
Join our free online training sessions to learn more about Kubernetes, containers, and Rancher.

We released version 2.2.0 of Rancher today, and we’re beyond excited. The latest release is the culmination of almost a year’s work and brings new features to the product that will make your Kubernetes installations more stable and easier to manage.

When we released Preview 1 in December and Preview 2 in February, we
covered their features extensively in blog articles, meetups, videos,
demos, and at industry events. I won’t make this an article that
rehashes what others have already written, but in case you haven’t seen
the features we’ve packed into this release, I’ll do a quick recap.

Rancher Global DNS

There’s a telco concept of the “last mile,” which is the final
communications link between the infrastructure and the end user. If
you’re all in on Kubernetes, then you’re using tools like CI/CD or some
other automation to deploy workloads. Maybe it’s only for testing, or
maybe your teams have full control over what they deploy.

DNS is the last mile for Kubernetes applications. No one wants to deploy
an app via automation and then go manually add or change a DNS record.

Rancher Global DNS solves this by provisioning and maintaining an
external DNS record that corresponds to the IP addresses of the
Kubernetes Ingress for an application. This, by itself, isn’t a new
concept, but Rancher will also do it for applications deployed to
multiple clusters.

Imagine what this means. You can now deploy an app to as many clusters
as you want and have DNS automatically update to point to the Ingress
for that application on all of them.

Rancher Cluster BDR

This is probably my favorite feature in Rancher 2.2. I’m a huge fan of
backup and disaster recovery (BDR) solutions. I’ve seen too many things
fail, and when I know I have backups in place, failure isn’t a big deal.
It’s just a part of the job.

When Rancher spins up a cluster on cloud compute instances, vSphere, or
via the Custom option, it deploys Rancher Kubernetes Engine (RKE).
That’s the CNCF-certified Kubernetes distribution that Rancher
maintains.

Rancher 2.2 adds support for backup and restore of the etcd datastore
directly into the Rancher UI/API and the Kubernetes API. It also adds
support for S3-compatible storage as the endpoint, so you can
immediately get your backups off of the hosts without using NFS.

When the unthinkable happens, you can restore those backups directly
into the cluster via the UI.

You’ve already been making snapshots of your cluster data and moving
them offsite, right? Of course you have.…but just in case you
haven’t, it’s now so easy to do that there’s no reason not to do it.

Rancher Advanced Monitoring

Rancher has always used Prometheus for monitoring and alerts. This
release enables Prometheus to reach even further into Kubernetes and
deliver even more information back to you. One of the flagship features
in Rancher is single cluster
multi-tenancy
,
where one or more users have access to a Project and can only see the
resources within that
Project

even if there are other users or other Projects on the cluster.

Rancher Advanced Monitoring deploys Prometheus and Grafana in a way that
respects the boundaries of a multi-tenant environment. Grafana installs
with pre-built cluster and Project dashboards, so once you check the box
to activate the advanced metrics, you’ll be looking at useful graphs a
few minutes later.

Rancher Advanced Monitoring covers everything from the cluster nodes to
the Pods within each Project, and if your application exposes its own
metrics, Prometheus will scrape those and make them available for you to
use.

Multi-Cluster Applications

Rancher is built to manage multiple clusters. It has a strong
integration with Helm via the Application
Catalog
, which takes
Helm’s key/value YAML and turns it into a form that anyone can use.

In Rancher 2.2 the Application Catalog also exists at the Global level,
and you can deploy apps via Helm simultaneously to multiple Projects in
any number of clusters. This saves a tremendous amount of time for
anyone who has to maintain applications in different environments,
particularly when it’s time to upgrade all of those applications.
Rancher will batch upgrades and rollbacks using Helm’s features for
atomic releases.

Because multi-cluster apps are built on top of Helm, they’ll work out of
the box with CI/CD systems or any other automated provisioner.

Multi-Tenant Catalogs

In earlier versions of Rancher the configuration for the Application
Catalog and any external Helm repositories existed at the Global level
and propagated to the clusters. This meant that every cluster had access
to the same Helm charts, and while that worked for most installations,
it didn’t work for all of them.

Rancher 2.2 has cluster-specific and project-specific configuration for
the Application Catalog. You can remove it completely, change what a
particular cluster or project has access to, or add new Helm
repositories for applications that you’ve approved.

Conclusion

The latest version of Rancher gives you the tools that you need for “day
two” Kubernetes operations — those tasks that deal with the management
and maintenance of your clusters after launch. Everything focuses on
reliability, repeatability, and ease of use, because using Rancher is
about helping your developers accelerate innovation and drive value for
your business.

Rancher 2.2 is available now for deployment in dev and staging environments as rancher/rancher:latest. Rancher recommends that production environments hold out for rancher/rancher:stable before upgrading, and that tag will be available in the coming days.

If you haven’t yet deployed Rancher, now is a great time to start! With two easy steps you can have Rancher up and running, ready to help you manage Kubernetes.

Join the Rancher 2.2 Online Meetup on April 3rd

To kick off this release and explain in detail each of these new, powerful features, we’re hosting an Online Meetup on April 3rd. It’s free to join and there will be live Q&A with the engineers who directly worked on the project. Get your spot here.

Expert Training in Kubernetes and Rancher
Join our free online training sessions to learn more about Kubernetes, containers, and Rancher.

Continuous Delivery of Everything with Rancher, Drone, and Terraform

星期三, 16 八月, 2017

It’s 8:00 PM. I just deployed to production, but nothing’s working.
Oh, wait. the production Kinesis stream doesn’t exist, because the
CloudFormation template for production wasn’t updated.
Okay, fix that.
9:00 PM. Redeploy. Still broken. Oh, wait. The production config file
wasn’t updated to use the new database.
Okay, fix that. Finally, it
works, and it’s time to go home. Ever been there? How about the late
night when your provisioning scripts work for updating existing servers,
but not for creating a brand new environment? Or, a manual deployment
step missing from a task list? Or, a config file pointing to a resource
from another environment? Each of these problems stems from separating
the activity of provisioning infrastructure from that of deploying
software, whether by choice, or limitation of tools. The impact of
deploying should be to allow customers to benefit from added value or
validate a business hypothesis. In order to accomplish this,
infrastructure and software are both needed, and they normally change
together. Thus, a deployment can be defined as:

  • reconciling the infrastructure needed with the infrastructure that
    already exists; and
  • reconciling the software that we want to run with the software that
    is already running.

With Rancher, Terraform, and Drone, you can build continuous delivery
tools that let you deploy this way. Let’s look at a sample system:
This simple
architecture has a server running two microservices,
[happy-service]
and
[glad-service].
When a deployment is triggered, you want the ecosystem to match this
picture, regardless of what its current state is. Terraform is a tool
that allows you to predictably create and change infrastructure and
software. You describe individual resources, like servers and Rancher
stacks, and it will create a plan to make the world match the resources
you describe. Let’s create a Terraform configuration that creates a
Rancher environment for our production deployment:

provider "rancher" {
  api_url = "${var.rancher_url}"
}

resource "rancher_environment" "production" {
  name = "production"
  description = "Production environment"
  orchestration = "cattle"
}

resource "rancher_registration_token" "production_token" {
  environment_id = "${rancher_environment.production.id}"
  name = "production-token"
  description = "Host registration token for Production environment"
}

Terraform has the ability to preview what it’ll do before applying
changes. Let’s run terraform plan.

+ rancher_environment.production
    description:   "Production environment"
    ...

+ rancher_registration_token.production_token
    command:          "<computed>"
    ...

The pluses and green text indicate that the resource needs to be
created. Terraform knows that these resources haven’t been created yet,
so it will try to create them. Running terraform apply creates the
environment in Rancher. You can log into Rancher to see it. Now let’s
add an AWS EC2 server to the environment:

# A look up for rancheros_ami by region
variable "rancheros_amis" {
  default = {
      "ap-south-1" = "ami-3576085a"
      "eu-west-2" = "ami-4806102c"
      "eu-west-1" = "ami-64b2a802"
      "ap-northeast-2" = "ami-9d03dcf3"
      "ap-northeast-1" = "ami-8bb1a7ec"
      "sa-east-1" = "ami-ae1b71c2"
      "ca-central-1" = "ami-4fa7182b"
      "ap-southeast-1" = "ami-4f921c2c"
      "ap-southeast-2" = "ami-d64c5fb5"
      "eu-central-1" = "ami-8c52f4e3"
      "us-east-1" = "ami-067c4a10"
      "us-east-2" = "ami-b74b6ad2"
      "us-west-1" = "ami-04351964"
      "us-west-2" = "ami-bed0c7c7"
  }
  type = "map"
}


# this creates a cloud-init script that registers the server
# as a rancher agent when it starts up
resource "template_file" "user_data" {
  template = <<EOF
#cloud-config
write_files:
  - path: /etc/rc.local
    permissions: "0755"
    owner: root
    content: |
      #!/bin/bash
      for i in {1..60}
      do
      docker info && break
      sleep 1
      done
      sudo docker run -d  --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.2.1 $${registration_url}
EOF

  vars {
    registration_url = "${rancher_registration_token.production_token.registration_url}"
  }
}

# AWS ec2 launch configuration for a production rancher agent
resource "aws_launch_configuration" "launch_configuration" {
  provider = "aws"
  name = "rancher agent"
  image_id = "${lookup(var.rancheros_amis, var.terraform_user_region)}"
  instance_type = "t2.micro"
  key_name = "${var.key_name}"
  user_data = "${template_file.user_data.rendered}"

  security_groups = [ "${var.security_group_id}"]
  associate_public_ip_address = true
}


# Creates an autoscaling group of 1 server that will be a rancher agent
resource "aws_autoscaling_group" "autoscaling" {
  availability_zones        = ["${var.availability_zones}"]
  name                      = "Production servers"
  max_size                  = "1"
  min_size                  = "1"
  health_check_grace_period = 3600
  health_check_type         = "ELB"
  desired_capacity          = "1"
  force_delete              = true
  launch_configuration      = "${aws_launch_configuration.launch_configuration.name}"
  vpc_zone_identifier       = ["${var.subnets}"]
}

We’ll put these in the same directory as environment.tf, and run
terraform plan again:

+ aws_autoscaling_group.autoscaling
    arn:                            ""
    ...

+ aws_launch_configuration.launch_configuration
    associate_public_ip_address: "true"
    ...

+ template_file.user_data
    ...

This time, you’ll see that rancher_environment resources is missing.
That’s because it’s already created, and Rancher knows that it
doesn’t have to create it again. Run terraform apply, and after a few
minutes, you should see a server show up in Rancher. Finally, we want to
deploy the happy-service and glad-service onto this server:

resource "rancher_stack" "happy" {
  name = "happy"
  description = "A service that's always happy"
  start_on_create = true
  environment_id = "${rancher_environment.production.id}"

  docker_compose = <<EOF
    version: '2'
    services:
      happy:
        image: peloton/happy-service
        stdin_open: true
        tty: true
        ports:
            - 8000:80/tcp
        labels:
            io.rancher.container.pull_image: always
            io.rancher.scheduler.global: 'true'
            started: $STARTED
EOF

  rancher_compose = <<EOF
    version: '2'
    services:
      happy:
        start_on_create: true
EOF

  finish_upgrade = true
  environment {
    STARTED = "${timestamp()}"
  }
}

resource "rancher_stack" "glad" {
  name = "glad"
  description = "A service that's always glad"
  start_on_create = true
  environment_id = "${rancher_environment.production.id}"

  docker_compose = <<EOF
    version: '2'
    services:
      glad:
        image: peloton/glad-service
        stdin_open: true
        tty: true
        ports:
            - 8000:80/tcp
        labels:
            io.rancher.container.pull_image: always
            io.rancher.scheduler.global: 'true'
            started: $STARTED
EOF

  rancher_compose = <<EOF
    version: '2'
    services:
      glad:
        start_on_create: true
EOF

  finish_upgrade = true
  environment {
    STARTED = "${timestamp()}"
  }
}

This will create two new Rancher stacks; one for the happy service and
one for the glad service. Running terraform plan once more will show
the two Rancher stacks:

+ rancher_stack.glad
    description:              "A service that's always glad"
    ...

+ rancher_stack.happy
    description:              "A service that's always happy"
    ...

And running terraform apply will create them. Once this is done,
you’ll have your two microservices deployed onto a host automatically
on Rancher. You can hit your host on port 8000 or on port 8001 to see
the response from the services:
We’ve created each
piece of the infrastructure along the way in a piecemeal fashion. But
Terraform can easily do everything from scratch, too. Try issuing a
terraform destroy, followed by terraform apply, and the entire
system will be recreated. This is what makes deploying with Terraform
and Rancher so powerful – Terraform will reconcile the desired
infrastructure with the existing infrastructure, whether those resources
exist, don’t exist, or require modification. Using Terraform and
Rancher, you can now create the infrastructure and the software that
runs on the infrastructure together. They can be changed and versioned
together, too. In the future blog entries, we’ll look at how to
automate this process on git push with Drone. Be sure to check out the
code for the Terraform configuration are hosted on
[github].
The
[happy-service]
and
[glad-service]
are simple nginx docker containers. Bryce Covert is an engineer at
pelotech. By day, he helps teams accelerate
engineering by teaching them functional programming, stateless
microservices, and immutable infrastructure. By night, he hacks away,
creating point and click adventure games. You can find pelotech on
Twitter at @pelotechnology.

Tags: , Category: 未分类 Comments closed

Joining as VP of Business Development

星期一, 19 六月, 2017

Nick Stinemates, VP Business DevelopmentI am incredibly excited to be
joining such a talented, diverse group at Rancher Labs as Vice President
of Business Development. In this role, I’ll be building upon my
experience of developing foundational and strategic relationships based
on open source technology. This change is motivated by my desire to go
back to my roots, working with small, promising companies with
passionate teams. I joined Docker, Inc. in 2013, just as it started to
bring containers out of the shadows and empower developers to write
software with the tools of their choice, while redefining their
relationship with infrastructure. Now that Docker is available in every
cloud environment, embedded in developer tools, and integrated in
development pipelines, the focus has shifted to making it more efficient
and sustainable for business. As users look for more integrated
solutions, the complexity of interrelated services and software rises
dramatically, giving an advantage to vendors that are proactively
reaching out and collaborating with best of breed tools. This is, I
believe, one of Rancher Labs’ strengths.

The Rancher container management
platform implements a layer of infrastructure services and drivers
designed specifically to power containerized applications. Since
networking, storage, load balancer, DNS, and security services are
deployed as containers, Rancher is in a unique position to integrate
technology efficiently, holistically, and at scale. Similarly, Rancher
also makes ISV and open source applications available via
its application catalog. The public
catalog delivers more than 90 popular applications and development
tools, many of which are contributed by the Rancher community. In
addition to further developing the Rancher ecosystem via technology and
ISV partnerships, I will be working to expand the Rancher Labs Partner
Network
. We will be building a
comprehensive partner program designed to expand the company’s global
reach, increase enterprise adoption, and provide partners and customers
with tools for success. From what I can tell after my first week, I am
in the right place. I’m looking forward to becoming part of the Rancher
Labs family, and collaborating with the broader ecosystem while
developing new relationships. As for immediate plans, I am coming up to
speed as fast as I can, and spending as much time talking to as many
people in the ecosystem as possible. If you’d like to explore
opportunities to collaborate, please consider becoming a
partner
. Nick is the
Vice President of Business Development at Rancher Labs where he is
focused on defining and executing Partner strategy. Prior to joining
Rancher Labs, Nick was the Vice President of Business Development and
Technical Alliances at Docker for four years. At Docker, Nick was
responsible for creating and driving the overall partner engagement and
strategy, as well as cultivating many company-defining strategic
alliances. Nick has over 15 years’ experience participating in and
contributing to the open source ecosystem as well as 10 years in
management functions in the enterprise financial space.

Tags: , Category: 未分类 Comments closed

Unlocking the Business Value of Docker

星期二, 25 四月, 2017

Why Smart Container Management is Key

For anyone working in IT, the excitement around containers has been hard
to miss. According to RightScale, enterprise deployments of Docker over
doubled in 2016 with 29% of organizations using the software versus just
14% in 2015 [1]. Even more impressive, fully 67%
of organizations surveyed are either using Docker or plan to adopt it.
While many of these efforts are early stage, separate research shows
that over two thirds of organizations who try Docker report that it
meets or exceeds expectations [2], and the
average Docker deployment quintuples in size in just nine months.

Clearly, Docker is here to stay. While exciting, containers are hardly
new. They’ve existed in various forms for years. Some examples include
BSD jails, Solaris Zones, and more modern incarnations like Linux
Containers (LXC). What makes Docker (based on LXC) interesting is that
it provides the tooling necessary for users to easily package
applications along with their dependencies in a format readily portable
between environments. In other words, Docker has made containers
practical and easy to use.

Re-thinking Application Architectures

It’s not a coincidence that Docker exploded in popularity just as
application architectures were themselves changing. Driven by the
global internet, cloud, and the explosion of mobile apps, application
services are increasingly designed for internet scale. Cloud-native
applications are comprised of multiple connected components that are
resilient, horizontally scalable, and wired together via secured virtual
networks. As these distributed, modular architectures have become the
norm, Docker has emerged as a preferred way to package and deploy
application components. As Docker has matured, the emphasis has shifted
from the management of the containers themselves to the orchestration
and management of complete, ready-to-run application services. For
developers and QA teams, the potential for productivity gains are
enormous. By being able to spin up fully-assembled dev, test and QA
environments, and rapidly promote applications to production, major
sources of errors, downtime and risk can be avoided. DevOps teams
become more productive, and organizations can get to market faster with
higher quality software. With opportunities to reduce cost and improve
productivity, Docker is no longer interesting just to technologists –
it’s caught the attention of the board room as well.

New Opportunities and Challenges for the Enterprise

Done right, deploying a containerized application environment can bring
many benefits:

  • Improved developer and QA productivity
  • Reduced time-to-market
  • Enhanced competitiveness
  • Simplified IT operations
  • Improved application reliability
  • Reduced infrastructure costs

While Docker provides real opportunities for enterprise deployments, the
devil is in the details. Docker is complex, comprised of a whole
ecosystem of rapidly evolving open-source projects. The core Docker
projects are not sufficient for most deployments, and organizations
implementing Docker from open-source wrestle with a variety of
challenges including management of virtual private networks, managing
databases and object stores, securing applications and registries, and
making the environment easy enough to use that it is accessible to
non-specialists. They also are challenged by skills shortages and
finding people knowledgeable about various aspects of Docker
administration. A business guide to effective
container app management –
Compounding these challenges, orchestration technologies essential to
realizing the value of Docker are also evolving quickly. There are
multiple competing solutions, including Kubernetes, Docker Swarm and
Mesos. The same is true with private cloud management frameworks.
Because Docker environments tend to grow rapidly once deployed,
organizations are concerned about making a misstep, and finding
themselves locked into a particular technology. In the age of rapid
development and prototyping, what is a sandbox one day may be in
production the next. It is important that the platform used for
evaluation and prototyping has the capacity to scale into production.
Organizations need to retain flexibility to deploy on bare-metal, public
or private clouds, and use their choice of orchestration solutions and
value-added components. For many, the challenge is not whether to deploy
Docker, but how do so cost-effectively, quickly and in a way that
minimizes business and operational risk so the potential of the
technology can be fully realized.

Reaping the Rewards with Rancher

In a sense, the Rancher® container management platform is to Docker what
Docker is to containers: just as Docker makes it easy to package,
deploy and manage containers, Rancher software does the same for the
entire application environment and Docker ecosystem. Rancher software
simplifies the management of Docker environments helping organizations
get to value faster, reduce risk, and avoid proprietary lock-in.
Written with a
technology and business audience in mind, in a recently published
whitepaper, Unlocking the Value of Docker in the Enterprise,
Rancher Labs explores the challenges of container management and
discusses and quantifies some of the specific areas that Rancher
software can provide value to the business. To learn more about Rancher,
and understand why it has become the choice of leading organizations
deploying Docker, download the whitepaper and
learn what Rancher can do for your business.

[1]
http://assets.rightscale.com/uploads/pdfs/rightscale-2016-state-of-the-cloud-report-devops-trends.pdf
[2]
https://www.twistlock.com/2016/09/23/state-containers-industry-reports-shed-insight/

Tags: ,, Category: Rancher Blog, Rancher Blog Comments closed

NeuVector UI Extension for Rancher Enhances Secure Cloud Native Stack

星期四, 14 三月, 2024

We have officially released the first version of the NeuVector UI Extension for Rancher! This release is an exciting first step for integrating NeuVector security monitoring and enforcement into the Rancher Manager UI. 

The security vision for SUSE and its enterprise container management (ECM) products has always been to enable easy deployment, monitoring and management of a secure cloud native stack. The full-lifecycle container security solution NeuVector offers a comprehensive set of security observability and controls, and by integrating this with Rancher, users can protect the sensitive data flows and business-critical applications managed by Rancher.

Rancher users can deploy NeuVector through Rancher and monitor the key security metrics of each cluster through the NeuVector UI extension. This extension includes a cluster security score, ingress/egress connection risks and vulnerability risks for nodes and pods.

 

 

Thanks to the single sign-on (SSO) integration with Rancher, users can then open the full NeuVector console (through the convenient links in the upper right above) without logging in again. Through the NeuVector console, users can do a deeper analysis of security events and vulnerabilities, configure admission control policies and manage the zero trust run-time security protections NeuVector provides.

The NeuVector UI Extension also supports user interaction to investigate security details from the dashboard. In particular, it displays a dynamic Security Risk Score for the entire cluster and its workloads and offers a guided wizard for ‘How to Improve Your Score.’ As shown below, one action turns on automated scanning of nodes and pods for vulnerabilities and compliance violations.

 

Rancher Extensions Architecture provides a decoupling of releases

Extensions allow users, developers, partners and customers to extend and enhance the Rancher UI. In addition, users can make changes and create enhancements to their UI functionality independent of Rancher releases. Extensions will enable users to build on top of Rancher to tailor it to their respective environments better. In this case, the NeuVector extension can be continuously enhanced and updated independent of Rancher releases.

 

Rancher Prime and NeuVector Prime

The new UI extension for NeuVector is available as part of the Rancher Prime and NeuVector Prime commercial offerings. Commercial subscribers can install the extension directly from the Rancher Prime registry, and it comes pre-installed with Rancher Prime.

 

What’s next: The Rancher-NeuVector Integration roadmap

This is an exciting first phase for UI integration, with many more phases planned over the following months. For example, the ability to view scan results for pods and nodes directly in the Rancher cluster resources views and manually trigger scanning is planned for the next phase. We are also working on more granular SSO/RBAC integration between Rancher users/groups and NeuVector roles, as well as integrating admission controls from Kubewarden and NeuVector.

 

Want to learn more?

For more information, see the NeuVector documentation and release notes. The NeuVector UI Extension requires NeuVector version 5.3.0+ and Rancher version 2.7.0+.

Getting Started with Cluster Autoscaling in Kubernetes

星期二, 12 九月, 2023

Autoscaling the resources and services in your Kubernetes cluster is essential if your system is going to meet variable workloads. You can’t rely on manual scaling to help the cluster handle unexpected load changes.

While cluster autoscaling certainly allows for faster and more efficient deployment, the practice also reduces resource waste and helps decrease overall costs. When you can scale up or down quickly, your applications can be optimized for different workloads, making them more reliable. And a reliable system is always cheaper in the long run.

This tutorial introduces you to Kubernetes’s Cluster Autoscaler. You’ll learn how it differs from other types of autoscaling in Kubernetes, as well as how to implement Cluster Autoscaler using Rancher.

The differences between different types of Kubernetes autoscaling

By monitoring utilization and reacting to changes, Kubernetes autoscaling helps ensure that your applications and services are always running at their best. You can accomplish autoscaling through the use of a Vertical Pod Autoscaler (VPA)Horizontal Pod Autoscaler (HPA) or Cluster Autoscaler (CA).

VPA is a Kubernetes resource responsible for managing individual pods’ resource requests. It’s used to automatically adjust the resource requests and limits of individual pods, such as CPU and memory, to optimize resource utilization. VPA helps organizations maintain the performance of individual applications by scaling up or down based on usage patterns.

HPA is a Kubernetes resource that automatically scales the number of replicas of a particular application or service. HPA monitors the usage of the application or service and will scale the number of replicas up or down based on the usage levels. This helps organizations maintain the performance of their applications and services without the need for manual intervention.

CA is a Kubernetes resource used to automatically scale the number of nodes in the cluster based on the usage levels. This helps organizations maintain the performance of the cluster and optimize resource utilization.

The main difference between VPA, HPA and CA is that VPA and HPA are responsible for managing the resource requests of individual pods and services, while CA is responsible for managing the overall resources of the cluster. VPA and HPA are used to scale up or down based on the usage patterns of individual applications or services, while CA is used to scale the number of nodes in the cluster to maintain the performance of the overall cluster.

Now that you understand how CA differs from VPA and HPA, you’re ready to begin implementing cluster autoscaling in Kubernetes.

Prerequisites

There are many ways to demonstrate how to implement CA. For instance, you could install Kubernetes on your local machine and set up everything manually using the kubectl command-line tool. Or you could set up a user with sufficient permissions on Amazon Web Services (AWS), Google Cloud Platform (GCP) or Azure to play with Kubernetes using your favorite managed cluster provider. Both options are valid; however, they involve a lot of configuration steps that can distract from the main topic: the Kubernetes Cluster Autoscaler.

An easier solution is one that allows the tutorial to focus on understanding the inner workings of CA and not on time-consuming platform configurations, which is what you’ll be learning about here. This solution involves only two requirements: a Linode account and Rancher.

For this tutorial, you’ll need a running Rancher Manager server. Rancher is perfect for demonstrating how CA works, as it allows you to deploy and manage Kubernetes clusters on any provider conveniently from its powerful UI. Moreover, you can deploy it using several providers, including these popular options:

If you are curious about a more advanced implementation, we suggest reading the Rancher documentation, which describes how to install Cluster Autoscaler on Rancher using Amazon Elastic Compute Cloud (Amazon EC2) Auto Scaling groups. However, please note that implementing CA is very similar on different platforms, as all solutions leverage Kubernetes Cluster API for their purposes. Something that will be addressed in more detail later.

What is Cluster API, and how does Kubernetes CA leverage it

Cluster API is an open source project for building and managing Kubernetes clusters. It provides a declarative API to define the desired state of Kubernetes clusters. In other words, Cluster API can be used to extend the Kubernetes API to manage clusters across various cloud providers, bare metal installations and virtual machines.

In comparison, Kubernetes CA leverages Cluster API to enable the automatic scaling of Kubernetes clusters in response to changing application demands. CA detects when the capacity of a cluster is insufficient to accommodate the current workload and then requests additional nodes from the cloud provider. CA then provisions the new nodes using Cluster API and adds them to the cluster. In this way, the CA ensures that the cluster has the capacity needed to serve its applications.

Because Rancher supports CA and RKE2, and K3s works with Cluster API, their combination offers the ideal solution for automated Kubernetes lifecycle management from a central dashboard. This is also true for any other cloud provider that offers support for Cluster API.

Link to the Cluster API blog

Implementing CA in Kubernetes

Now that you know what Cluster API and CA are, it’s time to get down to business. Your first task will be to deploy a new Kubernetes cluster using Rancher.

Deploying a new Kubernetes cluster using Rancher

Begin by navigating to your Rancher installation. Once logged in, click on the hamburger menu located at the top left and select Cluster Management:

Rancher's main dashboard

On the next screen, click on Drivers:

**Cluster Management | Drivers**

Rancher uses cluster drivers to create Kubernetes clusters in hosted cloud providers.

For Linode LKE, you need to activate the specific driver, which is simple. Just select the driver and press the Activate button. Once the driver is downloaded and installed, the status will change to Active, and you can click on Clusters in the side menu:

Activate LKE driver

With the cluster driver enabled, it’s time to create a new Kubernetes deployment by selecting Clusters | Create:

**Clusters | Create**

Then select Linode LKE from the list of hosted Kubernetes providers:

Create LKE cluster

Next, you’ll need to enter some basic information, including a name for the cluster and the personal access token used to authenticate with the Linode API. When you’ve finished, click Proceed to Cluster Configuration to continue:

**Add Cluster** screen

If the connection to the Linode API is successful, you’ll be directed to the next screen, where you will need to choose a region, Kubernetes version and, optionally, a tag for the new cluster. Once you’re ready, press Proceed to Node pool selection:

Cluster configuration

This is the final screen before creating the LKE cluster. In it, you decide how many node pools you want to create. While there are no limitations on the number of node pools you can create, the implementation of Cluster Autoscaler for Linode does impose two restrictions, which are listed here:

  1. Each LKE Node Pool must host a single node (called Linode).
  2. Each Linode must be of the same type (eg 2GB, 4GB and 6GB).

For this tutorial, you will use two node pools, one hosting 2GB RAM nodes and one hosting 4GB RAM nodes. Configuring node pools is easy; select the type from the drop-down list and the desired number of nodes, and then click the Add Node Pool button. Once your configuration looks like the following image, press Create:

Node pool selection

You’ll be taken back to the Clusters screen, where you should wait for the new cluster to be provisioned. Behind the scenes, Rancher is leveraging the Cluster API to configure the LKE cluster according to your requirements:

Cluster provisioning

Once the cluster status shows as active, you can review the new cluster details by clicking the Explore button on the right:

Explore new cluster

At this point, you’ve deployed an LKE cluster using Rancher. In the next section, you’ll learn how to implement CA on it.

Setting up CA

If you’re new to Kubernetes, implementing CA can seem complex. For instance, the Cluster Autoscaler on AWS documentation talks about how to set permissions using Identity and Access Management (IAM) policies, OpenID Connect (OIDC) Federated Authentication and AWS security credentials. Meanwhile, the Cluster Autoscaler on Azure documentation focuses on how to implement CA in Azure Kubernetes Service (AKS), Autoscale VMAS instances and Autoscale VMSS instances, for which you will also need to spend time setting up the correct credentials for your user.

The objective of this tutorial is to leave aside the specifics associated with the authentication and authorization mechanisms of each cloud provider and focus on what really matters: How to implement CA in Kubernetes. To this end, you should focus your attention on these three key points:

  1. CA introduces the concept of node groups, also called by some vendors autoscaling groups. You can think of these groups as the node pools managed by CA. This concept is important, as CA gives you the flexibility to set node groups that scale automatically according to your instructions while simultaneously excluding other node groups for manual scaling.
  2. CA adds or removes Kubernetes nodes following certain parameters that you configure. These parameters include the previously mentioned node groups, their minimum size, maximum size and more.
  3. CA runs as a Kubernetes deployment, in which secrets, services, namespaces, roles and role bindings are defined.

The supported versions of CA and Kubernetes may vary from one vendor to another. The way node groups are identified (using flags, labels, environmental variables, etc.) and the permissions needed for the deployment to run may also vary. However, at the end of the day, all implementations revolve around the principles listed previously: auto-scaling node groups, CA configuration parameters and CA deployment.

With that said, let’s get back to business. After pressing the Explore button, you should be directed to the Cluster Dashboard. For now, you’re only interested in looking at the nodes and the cluster’s capacity.

The next steps consist of defining node groups and carrying out the corresponding CA deployment. Start with the simplest and follow some best practices to create a namespace to deploy the components that make CA. To do this, go to Projects/Namespaces:

Create a new namespace

On the next screen, you can manage Rancher Projects and namespaces. Under Projects: System, click Create Namespace to create a new namespace part of the System project:

**Cluster Dashboard | Namespaces**

Give the namespace a name and select Create. Once the namespace is created, click on the icon shown here (ie import YAML):

Import YAML

One of the many advantages of Rancher is that it allows you to perform countless tasks from the UI. One such task is to import local YAML files or create them on the fly and deploy them to your Kubernetes cluster.

To take advantage of this useful feature, copy the following code. Remember to replace <PERSONAL_ACCESS_TOKEN> with the Linode token that you created for the tutorial:

---
apiVersion: v1
kind: Secret
metadata:
  name: cluster-autoscaler-cloud-config
  namespace: autoscaler
type: Opaque
stringData:
  cloud-config: |-
    [global]
    linode-token=<PERSONAL_ACCESS_TOKEN>
    lke-cluster-id=88612
    defaut-min-size-per-linode-type=1
    defaut-max-size-per-linode-type=5
    do-not-import-pool-id=88541

    [nodegroup "g6-standard-1"]
    min-size=1
    max-size=4

    [nodegroup "g6-standard-2"]
    min-size=1
    max-size=2

Next, select the namespace you just created, paste the code in Rancher and select Import:

Paste YAML

A pop-up window will appear, confirming that the resource has been created. Press Close to continue:

Confirmation

The secret you just created is how Linode implements the node group configuration that CA will use. This configuration defines several parameters, including the following:

  • linode-token: This is the same personal access token that you used to register LKE in Rancher.
  • lke-cluster-id: This is the unique identifier of the LKE cluster that you created with Rancher. You can get this value from the Linode console or by running the command curl -H "Authorization: Bearer $TOKEN" https://api.linode.com/v4/lke/clusters, where STOKEN is your Linode personal access token. In the output, the first field, id, is the identifier of the cluster.
  • defaut-min-size-per-linode-type: This is a global parameter that defines the minimum number of nodes in each node group.
  • defaut-max-size-per-linode-type: This is also a global parameter that sets a limit to the number of nodes that Cluster Autoscaler can add to each node group.
  • do-not-import-pool-id: On Linode, each node pool has a unique ID. This parameter is used to exclude specific node pools so that CA does not scale them.
  • nodegroup (min-size and max-size): This parameter sets the minimum and maximum limits for each node group. The CA for Linode implementation forces each node group to use the same node type. To get a list of available node types, you can run the command curl https://api.linode.com/v4/linode/types.

This tutorial defines two node groups, one using g6-standard-1 linodes (2GB nodes) and one using g6-standard-2 linodes (4GB nodes). For the first group, CA can increase the number of nodes up to a maximum of four, while for the second group, CA can only increase the number of nodes to two.

With the node group configuration ready, you can deploy CA to the respective namespace using Rancher. Paste the following code into Rancher (click on the import YAML icon as before):

---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
  name: cluster-autoscaler
  namespace: autoscaler
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["events", "endpoints"]
    verbs: ["create", "patch"]
  - apiGroups: [""]
    resources: ["pods/eviction"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["pods/status"]
    verbs: ["update"]
  - apiGroups: [""]
    resources: ["endpoints"]
    resourceNames: ["cluster-autoscaler"]
    verbs: ["get", "update"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["watch", "list", "get", "update"]
  - apiGroups: [""]
    resources:
      - "namespaces"
      - "pods"
      - "services"
      - "replicationcontrollers"
      - "persistentvolumeclaims"
      - "persistentvolumes"
    verbs: ["watch", "list", "get"]
  - apiGroups: ["extensions"]
    resources: ["replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["policy"]
    resources: ["poddisruptionbudgets"]
    verbs: ["watch", "list"]
  - apiGroups: ["apps"]
    resources: ["statefulsets", "replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses", "csinodes"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["batch", "extensions"]
    resources: ["jobs"]
    verbs: ["get", "list", "watch", "patch"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["create"]
  - apiGroups: ["coordination.k8s.io"]
    resourceNames: ["cluster-autoscaler"]
    resources: ["leases"]
    verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cluster-autoscaler
  namespace: autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["create","list","watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
    verbs: ["delete", "get", "update", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: autoscaler

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cluster-autoscaler
  namespace: autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: autoscaler

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: autoscaler
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8085'
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - image: k8s.gcr.io/autoscaling/cluster-autoscaler-amd64:v1.26.1
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=2
            - --cloud-provider=linode
            - --cloud-config=/config/cloud-config
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
            - name: cloud-config
              mountPath: /config
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-certificates.crt"
        - name: cloud-config
          secret:
            secretName: cluster-autoscaler-cloud-config

In this code, you’re defining some labels; the namespace where you will deploy the CA; and the respective ClusterRole, Role, ClusterRoleBinding, RoleBinding, ServiceAccount and Cluster Autoscaler.

The difference between cloud providers is near the end of the file, at command. Several flags are specified here. The most relevant include the following:

  • Cluster Autoscaler version v.
  • cloud-provider; in this case, Linode.
  • cloud-config, which points to a file that uses the secret you just created in the previous step.

Again, a cloud provider that uses a minimum number of flags is intentionally chosen. For a complete list of available flags and options, read the Cloud Autoscaler FAQ.

Once you apply the deployment, a pop-up window will appear, listing the resources created:

CA deployment

You’ve just implemented CA on Kubernetes, and now, it’s time to test it.

CA in action

To check to see if CA works as expected, deploy the following dummy workload in the default namespace using Rancher:

Sample workload

Here’s a review of the code:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox-workload
  labels:
    app: busybox
spec:
  replicas: 600
  strategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
      - name: busybox
        image: busybox
        imagePullPolicy: IfNotPresent
        
        command: ['sh', '-c', 'echo Demo Workload ; sleep 600']

As you can see, it’s a simple workload that generates 600 busybox replicas.

If you navigate to the Cluster Dashboard, you’ll notice that the initial capacity of the LKE cluster is 220 pods. This means CA should kick in and add nodes to cope with this demand:

**Cluster Dashboard**

If you now click on Nodes (side menu), you will see how the node-creation process unfolds:

Nodes

New nodes

If you wait a couple of minutes and go back to the Cluster Dashboard, you’ll notice that CA did its job because, now, the cluster is serving all 600 replicas:

Cluster at capacity

This proves that scaling up works. But you also need to test to see scaling down. Go to Workload (side menu) and click on the hamburger menu corresponding to busybox-workload. From the drop-down list, select Delete:

Deleting workload

A pop-up window will appear; confirm that you want to delete the deployment to continue:

Deleting workload pop-up

By deleting the deployment, the expected result is that CA starts removing nodes. Check this by going back to Nodes:

Scaling down

Keep in mind that by default, CA will start removing nodes after 10 minutes. Meanwhile, you will see taints on the Nodes screen indicating the nodes that are candidates for deletion. For more information about this behavior and how to modify it, read “Does CA respect GracefulTermination in scale-down?” in the Cluster Autoscaler FAQ.

After 10 minutes have elapsed, the LKE cluster will return to its original state with one 2GB node and one 4GB node:

Downscaling completed

Optionally, you can confirm the status of the cluster by returning to the Cluster Dashboard:

**Cluster Dashboard**

And now you have verified that Cluster Autoscaler can scale up and down nodes as required.

CA, Rancher and managed Kubernetes services

At this point, the power of Cluster Autoscaler is clear. It lets you automatically adjust the number of nodes in your cluster based on demand, minimizing the need for manual intervention.

Since Rancher fully supports the Kubernetes Cluster Autoscaler API, you can leverage this feature on major service providers like AKS, Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS). Let’s look at one more example to illustrate this point.

Create a new workload like the one shown here:

New workload

It’s the same code used previously, only in this case, with 1,000 busybox replicas instead of 600. After a few minutes, the cluster capacity will be exceeded. This is because the configuration you set specifies a maximum of four 2GB nodes (first node group) and two 4GB nodes (second node group); that is, six nodes in total:

**Cluster Dashboard**

Head over to the Linode Dashboard and manually add a new node pool:

**Linode Dashboard**

Add new node

The new node will be displayed along with the rest on Rancher’s Nodes screen:

**Nodes**

Better yet, since the new node has the same capacity as the first node group (2GB), it will be deleted by CA once the workload is reduced.

In other words, regardless of the underlying infrastructure, Rancher makes use of CA to know if nodes are created or destroyed dynamically due to load.

Overall, Rancher’s ability to support Cluster Autoscaler out of the box is good news; it reaffirms Rancher as the ideal Kubernetes multi-cluster management tool regardless of which cloud provider your organization uses. Add to that Rancher’s seamless integration with other tools and technologies like Longhorn and Harvester, and the result will be a convenient centralized dashboard to manage your entire hyper-converged infrastructure.

Conclusion

This tutorial introduced you to Kubernetes Cluster Autoscaler and how it differs from other types of autoscaling, such as Vertical Pod Autoscaler (VPA) and Horizontal Pod Autoscaler (HPA). In addition, you learned how to implement CA on Kubernetes and how it can scale up and down your cluster size.

Finally, you also got a brief glimpse of Rancher’s potential to manage Kubernetes clusters from the convenience of its intuitive UI. Rancher is part of the rich ecosystem of SUSE, the leading open Kubernetes management platform. To learn more about other solutions developed by SUSE, such as Edge 2.0 or NeuVector, visit their website.

Advanced Monitoring and Observability​ Tips for Kubernetes Deployments

星期一, 28 八月, 2023

Cloud deployments and containerization let you provision infrastructure as needed, meaning your applications can grow in scope and complexity. The results can be impressive, but the ability to expand quickly and easily makes it harder to keep track of your system as it develops.

In this type of Kubernetes deployment, it’s essential to track your containers to understand what they’re doing. You need to not only monitor your system but also ensure your monitoring delivers meaningful observability. The numbers you track need to give you actionable insights into your applications.

In this article, you’ll learn why monitoring and observability matter and how you can best take advantage of them. That way, you can get all the information you need to maximize the performance of your deployments.

Why you need monitoring and observability in Kubernetes

Monitoring and observability are often confused but worth clarifying for the purposes of this discussion. Monitoring is the means by which you gain information about what your system is doing.

Observability is a more holistic term, indicating the overall capacity to view and understand what is happening within your systems. Logs, metrics and traces are core elements. Essentially, observability is the goal, and monitoring is the means.

Observability can include monitoring as well as logging, tracing, continuous integration and even chaos engineering. Focusing on each facet gets you as close as possible to full coverage. Correcting that can improve your observability if you’ve overlooked one of these areas.

In addition, using black boxes, such as third-party services, can limit observability by making monitoring harder. Increasing complexity can also add problems. Your metrics may not be consistent or relevant if collected from different services or regions.

You need to work to ensure the metrics you collect are taken in context and can be used to provide meaningful insights into where your systems are succeeding and failing.

At a higher level, there are several uses for monitoring and observability. Performance monitoring tells you whether your apps are delivering quickly and what resources they’re consuming.

Issue tracking is also important. Observability can be focused on specific tasks, letting you see how well they’re doing. This can be especially relevant when delivering a new feature or hunting a bug.

Improving your existing applications is also vital. Examining your metrics and looking for areas you can improve will help you stay competitive and minimize your costs. It can also prevent downtime if you identify and fix issues before they lead to performance drops or outages.

Best practices and tips for monitoring and observability in Kubernetes

With distributed applications, collecting data from all your various nodes and containers is more involved than with a standard server-based application. Your tools need to handle the additional complexity.

The following tips will help you build a system that turns information into the elusive observability that you need. All that data needs to be tracked, stored and consolidated. After that, you can use it to gain the insights you need to make better decisions for the future of your application.

Avoid vendor lock-in

The major Kubernetes management services, including Amazon Elastic Kubernetes Service (EKS)Azure Kubernetes Service (AKS) and Google Kubernetes Engine (GKE), provide their own monitoring tools. While these tools include useful features, you need to beware of becoming overdependent on any that belong to a particular platform, which can lead to vendor lock-in. Ideally, you should be able to change technologies and keep the majority of your metric-gathering system.

Rancher, a complete software stack, lets you consolidate information from other platforms that can help solve issues arising when companies use different technologies without integrating them seamlessly. It lets you capture data from a wealth of tools and pipe your logs and data to external management platforms, such as Grafana and Prometheus, meaning your monitoring isn’t tightly coupled to any other part of your infrastructure. This gives you the flexibility to swap parts of your system in and out without too much expense. With platform-agnostic monitoring tools, you can replace other parts of your system more easily.

Pick the right metrics

Collecting metrics sounds straightforward, but it requires careful implementation. Which metrics do you choose? In a Kubernetes deployment, you need to ensure all layers of your system are monitored. That includes the application, the control plane components and everything in between.

CPU and memory usage are important but can be tricky to use across complex deployments. Other metrics, such as API response, request and error rates, along with latency, can be easier to track and give a more accurate picture of how your apps are performing. High disk utilization is a key indicator of problems with your system and should always be monitored.

At the cluster level, you should track node availability and how many running pods you have and make sure you aren’t in danger of running out of nodes. Nodes can sometimes fail, leaving you short.

Within individual pods, as well as resource utilization, you should check application-specific metrics, such as active users or parts of your app that are in use. You also need to track the metrics Kubernetes provides to verify pod health and availability.

Centralize your logging

Diagram showing multiple Kubernetes clusters piping data to Rancher, which sends it to a centralized logging store, courtesy of James Konik

Kubernetes pods keep their own logs, but having logs in different places is hard to keep track of. In addition, if a pod crashes, you can lose them. To prevent the loss, make sure any logs or metrics you require for observability are stored in an independent, central repository.

Rancher can help with this by giving you a central management point for your containers. With logs in one place, you can view the data you need together. You can also make sure it is backed up if necessary.

In addition to piping logs from different clusters to the same place, Rancher can also help you centralize authorization and give you coordinated role-based access control (RBAC).

Transferring large volumes of data will have a performance impact, so you need to balance your requirements with cost. Critical information should be logged immediately, but other data can be transferred on a regular basis, perhaps using a queued operation or as a scheduled management task.

Enforce data correlation

Once you have feature-rich tools in place and, therefore, an impressive range of metrics to monitor and elaborate methods for viewing them, it’s easy to lose focus on the reason you’re collecting the data.

Ultimately, your goal is to improve the user experience. To do that, you need to make sure the metrics you collect give you an accurate, detailed picture of what the user is experiencing and correctly identify any problems they may be having.

Lean toward this in the metrics you pick and in those you prioritize. For example, you might want to track how many people who use your app are actually completing actions on it, such as sales or logins.

You can track these by monitoring task success rates as well as how long actions take to complete. If you see a drop in activity on a particular node, that can indicate a technical problem that your other metrics may not pick up.

You also need to think about your alerting systems and pick alerts that spot performance drops, preferably detecting issues before your customers.

With Kubernetes operating in a highly dynamic way, metrics in different pods may not directly correspond to one another. You need to contextualize different results and develop an understanding of how performance metrics correspond to the user’s experience and business outcomes.

Artificial intelligence (AI) driven observability tools can help with that, tracking millions of data points and determining whether changes are caused by the dynamic fluctuations that happen in massive, scaling deployments or whether they represent issues that need to be addressed.

If you understand the implications of your metrics and what they mean for users, then you’re best suited to optimize your approach.

Favor scalable observability solutions

As your user base grows, you need to deal with scaling issues. Traffic spikes, resource usage and latency all need to be kept under control. Kubernetes can handle some of that for you, but you need to make sure your monitoring systems are scalable as well.

Implementing observability is especially complex in Kubernetes because Kubernetes itself is complicated, especially in multi-cloud deployments. The complexity has been likened to an iceberg.

It gets more difficult when you have to consider problems that arise when you have multiple servers duplicating functionality around the world. You need to ensure high availability and make your database available everywhere. As your deployment scales up, so do these problems.

Rancher’s observability tools allow you to deploy new clusters and monitor them along with your existing clusters from the same location. You don’t need to work to keep up as you deploy more widely. That allows you to focus on what your metrics are telling you and lets you spend your time adding more value to your product.

Conclusion

Kubernetes enables complex deployments, but that means monitoring and observability aren’t as straightforward as they would otherwise be. You need to take special care to ensure your solutions give you an accurate picture of what your software is doing.

Taking care to pick the right metrics makes your monitoring more helpful. Avoiding vendor lock-in gives you the agility to change your setup as needed. Centralizing your metrics brings efficiency and helps you make critical big-picture decisions.

Enforcing data correlation helps keep your results relevant, and thinking about scalability ahead of time stops your system from breaking down when things change.

Rancher can help and makes managing Kubernetes clusters easier. It provides a vast range of Kubernetes monitoring and observability features, ensuring you know what’s going on throughout your deployments. Check it out and learn how it can help you grow. You can also take advantage of free, community training for Kubernetes & Rancher at the Rancher Academy.