How to setup Nodelocal DNS cache on Rancher 2.x

This document (000020174) is provided subject to the disclaimer at the end of this document.

Situation

Why use Nodelocal DNS cache?

Like many applications in a containerised architecture, CoreDNS or kube-dns runs in a distributed fashion. In certain circumstances, DNS reliability and latency can be impacted with this approach. The causes of this relate notably to conntrack race conditions or exhaustion, cloud provider limits, and the unreliable nature of the UDP protocol.

A number of workarounds exist, however long term mitigation of these and other issues has resulted in a redesign of the Kubernetes DNS architecture, and the result being the Nodelocal DNS cache project.

Requirements
  • A Kubernetes cluster of v1.15 or greater created by Rancher v2.x or RKE
  • A Linux cluster, Windows is currently not supported
  • Access to the cluster

Resolution

Installing

There are two installation approaches, both approaches should be non-invasive, pods that are currently running will not be modified. The DNS configuration will take effect for pods started after the install is complete.

RKE1: Using a Rancher version after v2.4.x, or RKE version after v1.1.0

Update the cluster using 'Edit as YAML' in the Rancher UI. With RKE, edit the cluster.yaml file instead.

Note: Updating the cluster using the below will create the node-local-dns Daemonset, and restart the kubelet container on each node.

As in the documentation, update or add the dns.nodelocal.ip_address field using the following as an example:

  dns:
  [..]
    nodelocal:
      ip_address: "169.254.20.10"

New pods created after the change will configure the node-local-dns link-local address as the nameserver in /etc/resolv.conf.

Note: No further action is needed to use node-local-dns (as in the option A/B below), the changes to /etc/resolv.conf will take effect for pods started from this point onwards.

RKE1: Using a Rancher version before v2.4.x, or RKE version before v1.1.0

Installing the YAML manifest by navigating to the cluster, and clicking the Launch kubectl button in the Rancher UI. This command can also be run from a terminal where a kubeconfig for the cluster is currently configured.

Environment variables are replaced before applying the manifest, one assumption is that the cluster service discovery domain name is cluster.local (default), adjust the command if needed.

curl -sL https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml \
  | sed -e 's/__PILLAR__DNS__DOMAIN__/cluster.local/g' \
  | sed -e "s/__PILLAR__DNS__SERVER__/$(kubectl get service --namespace kube-system kube-dns -o jsonpath='{.spec.clusterIP}')/g" \
  | sed -e 's/__PILLAR__LOCAL__DNS__/169.254.20.10/g' \
  | kubectl apply -f -

Ensure the node-local-dns pods start successfully, a pod should start on each control plane and worker node.

kubectl get -n kube-system pod -l k8s-app=node-local-dns

When deploying the YAML manifest there are two options to configure the cluster to use the new node-local-dns configuration, please choose from option A or B below.

Option A - Configure the Kubelet

By default, the Kubelet will configure the /etc/resolv.conf of pods with the kube-dns Service ClusterIP as the nameserver. Configuring all new pods to query node-local-dns will require updating the Kubelet arguments.

Note: Updating the arguments using the below will restart the kubelet container on each node.

  • If the cluster was provisioned by Rancher, edit the cluster in the UI and click on Edit as YAML.
  • If the cluster was provisioned by RKE, edit the cluster.yml file directly.

Update the kubelet service with the cluster-dns argument and IP Address. Click save, or run an rke up to put this change into effect.

services:
  kubelet:
    extra_args:
      cluster-dns: "169.254.20.10"

New pods created after the change will configure the node-local-dns link-local address as the nameserver in /etc/resolv.conf.

Option B - Configure Workloads

Alternatively, node-local-dns can be configured on a per-workload basis by updating the workload with a dnsConfig and dnsConfig .

  • If using the Rancher UI, edit the workload, navigate to Show advanced options > Networking > DNS Nameservers and add 169.254.20.10. Additionally, adjust the DNS Policy to None.

  • If configuring by YAML, patch in the following to the pod spec to adjust the dnsPolicy and dnsConfig:
    spec:
      dnsPolicy: "None"
      dnsConfig:
        nameservers:
        - 169.254.20.10

RKE2: Using any RKE2 Kubernetes version

Update the default HelmChart for CoreDNS, the nodelocal.enabled: true value will install node-local-dns in the cluster. Please see the documentation here for more details.

Testing

Once installed, start a new pod to test DNS queries.

kubectl run --restart=Never --rm -it --image=tutum/dnsutils dns-test -- dig google.com

Unless Option B was used to install node-local-dns, you should expect to see 169.254.20.10 as the server, and a successful answer to the query.

To verify a pod or container is using node-local-dns by checking the /etc/resolv.conf file, for example:

kubectl exec -it <pod name> -- grep nameserver /etc/resolv.conf
nameserver 169.254.20.10

Removing Nodelocal DNS cache

To remove from a cluster, the reverse steps are needed.

Note: Pods created with the node-local-dns nameserver in /etc/resolv.conf will need to be started again to use the kube-dns service as a nameserver again.

Using a Rancher version after v2.4.x, or RKE version after v1.1.0

Remove the dns.nodelocal configuration from the cluster YAML

Using a Rancher version before v2.4.x, or RKE version before v1.1.0

  1. Remove the Kubelet configuration (Option A), or remove the dnsConfig from workloads (Option B).

  2. If Option A was taken, delete any pods in workloads that were started since the Kubelet configuration change so that they are started with the kube-dns ClusterIP again.

  3. Remove the node-local-dns objects with the following command:

    curl -sL https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml | kubectl delete -f -

Note: it is important to perform these steps in order, and only complete step 3 once the pods using node-local-dns have been started with the kube-dns ClusterIP configured in /etc/resolv.conf again.

Additional Information

Troubleshooting

Node-local-dns will perform external lookups on behalf of pods, this lookup occurs from the node-local-dns DaemonSet pod running on the same node as the pod.

For internal lookups, CoreDNS will be used, node-local-dns will cache successful queries (30s), and negative queries (5s) by default. For an architecture overview please see the diagram here.

In no specific order, the following can help understand a DNS issue further.

Check all kube-dns and node-local-dns objects

Ensure there are no obvious issues with scheduling CoreDNS and node-local-dns pods in the cluster.

kubectl get all -n kube-system -l k8s-app=node-local-dns
kubectl get all -n kube-system -l k8s-app=kube-dns

All node-local-dns and kube-dns pods should be ready and running, the kube-dns Service should exist. Check the events if needed to locate any warning or failed event messages.

kubectl describe ds -n kube-system -l k8s-app=node-local-dns
kubectl describe rs -n kube-system -l k8s-app=kube-dns
Check the logs and ConfigMap of kube-dns and node-local-dns pods
kubectl logs -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=node-local-dns
kubectl get configmap -n kube-system coredns -o yaml
kubectl get configmap -n kube-system node-local-dns -o yaml
Enable logging and perform a DNS test

Note, query logging can increase the log output from CoreDNS, enabling this temporarily while investigating is suggested.

  • Run a DaemonSet to perform queries from a pod running on each node in the cluster
Ask questions to further eliminate the issue
  • Is it only DNS that is affected, or is all connectivity affected?
  • Are internal, external or all DNS queries failing?
  • Are all nodes and workloads experiencing the issue, or a specific node or workload? * Nodes use the upstream DNS configured in /etc/resolv.conf, queries failing from a node could indicate the issue is with upstream DNS
  • What is the error reported by applications? * If logs are aggregated, queries can be performed on the logs to identify timelines and impact
  • Is the issue intermittent or constantly occuring? * If the issue is intermittent, configure monitoring or a loop to identify when the issue occurs, when it does - are internal, external or all queries affected?

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000020174
  • Creation Date: 31-Oct-2021
  • Modified Date:27-Jul-2022
    • SUSE Rancher

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Join Our Community

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.


SUSE Customer Support Quick Reference Guide SUSE Technical Support Handbook Update Advisories
Support FAQ

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.

Go to Customer Center