Istio: A New Routing Tier for Cloud Foundry
If you are an active user of Cloud Foundry, you may be familiar with gorouter which handles all the http requests in your deployment. You also may have heard of TCP router which is a wrapper around HAProxy. Today in any standard Cloud Foundry deployment if you want to have both deployed, you would just opt-in to deploy routing-release and that should take care of all your routing needs.
But how does it work behind the scenes?
When an app is pushed in Cloud Foundry, cloud controller creates identifiers for the app plus some routing metadata (DesiredLRP + routing metadata) and then forwards those to Diego. At this point, Diego is scheduling the application by trying to find a home for it in any of the available running containers. Once the application is up and running, Diego’s BBS API notifies route-emitters about this app. Route-emitters then forwards the routes along with the IP and Port for that application container to two places. The first is NATS which is used by gorouter to receive it’s route updates and the second is Routing-API which provides routes to TCP router. Gorouter and TCP router then update their routing tables using all the updates that they received from NATS and Routing-API respectively.
You may ask, how do Cloud Foundry components get their routes registered? Cloud Foundry components follow a slightly different workflow to register themselves. They use a job called route-registrar which is a co-locating job on every Cloud Foundry component’s VM (or pod in the case of containerized Cloud Foundry). On start up, route-registrar announces the IP and route for that job and forwards the details to NATS. An example of the workflow is outlined below:
What is the new routing change then?
In the routing world we are introducing something new called Istio. Istio is a project that initially started to provide a better routing tier for Kubernetes. However as the project grew, it started to become more platform agnostic. It was at this point that other platforms such as Cloud Foundry, Apache Mesos, consul, and others decided to integrate with Istio.
Before we talk about how Istio is going to architecturally change the routing tier in Cloud Foundry, let’s talk about what Istio is and how it works.
What is Istio?
At its core, Istio is a service mesh and an easy way to create a network in an existing infrastructure. It provides a global view of your network and all of its dependencies via a single point of control. Its key features are traffic management, policy enforcement, network observability, service identity and security. It is also loaded with a ton of features such as load balancing, service-to-service auth, monitoring, telemetry and so on.
What makes Istio shine is that it does not require any code changes to your application in order for it to deliver all those features!
In the past if I wanted to do all the fancy stuff above I had to rely on some sort of library and had to import that into my project. However, the problem begins when you demand a consistent behaviour across all your services in your network which are written in Go, Java, C++, python… can a library help you to create that consistent behaviour? The answer is simply no. I have tried it and it is just not consistent no matter how hard you try.
When Istio is used to manage the network, every application container is coupled with an instance of proxy (Envoy). Applications then use these proxies for east-west (to talk to each other) or north-south (to talk to the outside world) networking requests. At the same time these colocated proxies are configured to handle the load balancing, TLS, mTLS, timeout or other features, this is how consistent network behaviour is established.
Istio consists of other sub-components and each delivers a very specific task. These components are:
Envoy is the L4/L7 proxy for Istio. It’s a C++ based project with a very small memory footprint and battle tested with some serious benchmarks (I recommend looking into it). Envoy was initially started as a project by Lyft and later it was contributed to Cloud Native Computing Foundation as an open source software. You can consider Envoy in the same league as HAProxy and nginx, but what makes Envoy very special is you can configure it during runtime! What?! Yes, you can configure Envoy without having to restart it. You provide it with resource configuration through its discovery endpoints and this is how you propagate its routing table and traffic rules. Some of the discovery APIs that it exposes are LDS (Listener Discovery Service), RDS (Route Discovery Service), CDS (Cluster Discovery Service) and EDS (Endpoint Discovery Service) and other ones that you can find out from the docs.
Another reason Evoy is very special is that you can specify certain HTTP headers which ultimately translate into how Envoy applies retry or timeout to your request. This a cool feature because most of the time you would either have to write that logic yourself or rely on a 3rd party library to do it for you and, as mentioned previously, achieving consistent results across multiple platforms/services is NOT an easy task. You can find out more about these awesome headers from the docs.
Pilot (Traffic Management)
Pilot is responsible for managing and monitoring the service mesh and it provides Envoy(s) with configurations. It consumes Istio config (CRDs) and service discovery data from various platforms, it abstracts platform-specific service discovery logic and converts them into a standard platform-agnostic format which conforms to the Envoy data plane API. This is how Istio can operate in multiple environments (such as Kubernetes, consul, eureka, Cloud Foundry) and it exposes that data from its discovery APIs to Envoy(s) in your mesh. The configs that are fed into Pilot could be anything from individual routes for your services or routing logics specific to prod/staging testing, canary deployment, A/B testing, load balancing, retries or even fault injections such as circuit breakers or timeouts.
Mixer (Policies and Telemetry)
Mixer is the policy enforcement and telemetry server. At a high level it can perform access control, metric capturing, quota enforcement or even act as a billing system. When traffic starts flowing through the Envoys they start collecting certain attributes about the traffic. These attributes could be anything from destination IP, source IP, HTTP headers or even some annotations that you may specify through configuration. Envoys then use this data to check with Mixer to see if there are any applicable policies that should be enforced on that traffic. Monitoring and tracing are some more features that Mixer servers are responsible for. You can leverage these features without instrumenting or making any code changes to your apps.
Citadel is responsible for providing end user authentication and service-to-service security based on mutual TLS with a built in identity and credential management. Citadel implementation is currently based of spiffe.io which is defined as a set of standards for a workflow on how to receive and present an identity in a secure way. Security can be challenging which is why a lot of people find it easy to opt-out. However when you are part of an organization where you want to make all traffic in your network secure, and in some cases you may want to secure your internal traffic, there is a lot of work associated with that including cert management, rotation, distribution and more. Citadel is designed specifically for these operations and it can manage the identities in your network.
Now that we talked about how Istio and its components work, let’s talk about how it is going to fit into Cloud Foundry. We now know that Istio is platform agnostic and we can make it work as long as we feed it correct data and models. In Cloud Foundry, in order for us to make this work we had to introduce a new component called Copilot.
Copilot runs at the edge of your Cloud Foundry deployment and is responsible for collecting the route data and converting them into Istio-specific configuration and service discovery data that are compatible with what Pilot ingests as configurations. Pilot then ingests those configs and forwards them to the Envoy instances that it recognizes. That is how a service becomes routable.
What did we need to change in Cloud Foundry?
With the introduction of Copilot and Istio, we no longer need the gorouter and TCP router, since Envoys are capable of handling L4 and L7 traffic. As a result, the configuration providers for gorouter and TCP router are no longer needed either. This allowed us to also remove NATs and Routing-API. As mentioned earlier, NATS and routing-API receive their route info via the route-emitter. This left us with a choice of allowing route-emitter to directly talk to Copilot but we decided to take this one step further and remove route-emitter as well. However with route-emitter out of the picture, where does Copilot get its route info from? We decided to talk directly to cloud controller and Diego BBS which allows for a much simpler internal workflow with less points of failure and an overall cleaner architecture, as shown in the diagram below:
As you can see, we managed to remove quite a few components from the current routing tier in Cloud Foundry. You can also see that we only have one router which is responsible for all types of routing (TCP and HTTP). Additionally, the new architecture and the routing tier allows us to leverage some of the cool features that we already had on our road map. We can soon get these features out of the box with Istio and Envoy including HTTP2, weighted routing, TLS termination, SNI, health checks fault injection and more.
Where is it all at now?
The basic HTTP routing functionality along with weighted routing is available, however, this work is under heavy development and it is not production ready. Furthermore, not all features mentioned above are fully available. If you are interested to play around with the new routing tier in Cloud Foundry, you can take a look at Cloud Foundry istio-release.