How to setup Nodelocal DNS cache with Rancher, RKE1 and RKE2

This document (000020174) is provided subject to the disclaimer at the end of this document.

Situation

Why use Nodelocal DNS cache?

Like many applications in a containerised architecture, CoreDNS or kube-dns runs in a distributed fashion. In certain circumstances, DNS reliability and latency can be impacted with this approach. The causes of this relate notably to conntrack race conditions or exhaustion, cloud provider limits, and the unreliable nature of the UDP protocol.

A number of workarounds exist, however long term mitigation of these and other issues has resulted in a redesign of the Kubernetes DNS architecture, and the result being the Nodelocal DNS cache project.

Requirements

A Kubernetes cluster provisioned by Rancher v2.x, or directly with RKE1 and RKE2
A Linux cluster, Windows is currently not supported
Access to the cluster

Resolution

Installing

Once installed, pods will begin to resolve using the node-local-dns pod on the same node, below are details for RKE1 and RKE2 when provisioning using Rancher. These same steps can be applied in a similar way when directly provisioning a cluster.

RKE1

When provisioning or configuring an existing cluster, edit the cluster configuration in the Rancher dashboard, and click the 'Edit as YAML' button. When provisioning an RKE cluster directly, edit the cluster.yaml file instead.

Note: Updating the cluster using the below will create the node-local-dns Daemonset, and restart the kubelet container on each node.

As in the documentation, update or add the dns.nodelocal.ip_address field using the following as an example:

  dns:
  [..]
    nodelocal:
      ip_address: "169.254.20.10"

The kubelet will be updated to use the new IP address when configuring pod DNS resolution. Pods using the CoreDNS service address (default: 10.43.0.10) as the nameserver in /etc/resolv.conf will still resolve using the node-local-dns pod on the node. This is due to the way node-local-dns manages it's own interface and iptables rules.

RKE2

Update the default HelmChart for CoreDNS, the nodelocal.enabled: true value will install node-local-dns in the cluster.

When provisioning or configuring an existing cluster, edit the cluster configuration in the Rancher dashboard, and select Add-On Config. At the bottom of the page paste the following into the Additional Manifest text area:

apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rke2-coredns
  namespace: kube-system
spec:
  valuesContent: |-
    nodelocal:
      enabled: true

Save the changes, please see the documentation here for more details.

When provisioning an RKE2 cluster directly, this file can be copied into the /var/lib/rancher/rke2/server/manifests directory on each rke2-server node, manually or with user-data/configuration management.

Testing

Once installed, start a new pod to test DNS queries, for example:

kubectl run --restart=Never --rm -it --image=tutum/dnsutils dns-test -- dig google.com

To verify node-local-dns is available and handling DNS queries, here are some ways to confirm:

Check for a nodelocaldns interface on a node, for example:

# ip addr show nodelocaldns
21: nodelocaldns: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
    link/ether e2:a9:45:f9:29:94 brd ff:ff:ff:ff:ff:ff
    inet 169.254.20.10/32 scope global nodelocaldns
       valid_lft forever preferred_lft forever
    inet 10.43.0.10/32 scope global nodelocaldns
       valid_lft forever preferred_lft forever

Temporarily enable query logging for node-local-dns:

Edit the node-local-dns ConfigMap to add the log plugin, locate and edit the ConfigMap in the kube-system namespace in the Rancher dashboard, or use kubectl edit configmap -n kube-system node-local-dns
Add log to the cluster.local and :53 objects in the Corefile, for example for :53 (external queries):
1. ```
[...]
    .:53 {
        log
        errors
        cache 30
```
Check the node-local-dns pod logs once some DNS queries have been performed, the logs should indicate queries are being answered
Perform the reverse of steps 1-2 to disable query logging

Removing Nodelocal DNS cache

To remove from a cluster, the reverse steps are needed:

RKE1

Remove the dns.nodelocal field from the cluster configuration in the Rancher dashboard and save the change. When provisioning a cluster directly, run rke up to reconcile the change.

RKE2

Remove the additional manifest in the Rancher dashboard, or delete the manifest file from all of the rke2-server nodes when provisioning the cluster directly.

Additional Information

Troubleshooting

Nodelocal DNS will perform external lookups on behalf of pods, this lookup occurs from the node-local-dns DaemonSet pod running on the same node as the pod.

For internal lookups, CoreDNS will be used, node-local-dns pods will cache successful queries (30s), and negative queries (5s) by default. For an architecture overview please see the diagram here.

In no specific order, the following can help understand a DNS issue further.

Check all kube-dns and node-local-dns objects

Ensure there are no obvious issues with scheduling CoreDNS and node-local-dns pods in the cluster.

kubectl get all -n kube-system -l k8s-app=node-local-dns
kubectl get all -n kube-system -l k8s-app=kube-dns

All node-local-dns and kube-dns pods should be ready and running, the kube-dns Service should exist. Check the events if needed to locate any warning or failed event messages.

kubectl describe ds -n kube-system -l k8s-app=node-local-dns
kubectl describe rs -n kube-system -l k8s-app=kube-dns

Check the logs and ConfigMap of kube-dns and node-local-dns pods

kubectl logs -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=node-local-dns

kubectl get configmap -n kube-system coredns -o yaml
kubectl get configmap -n kube-system node-local-dns -o yaml

Enable logging and perform a DNS test

Note, query logging can increase the log output from CoreDNS, enabling this temporarily while investigating is suggested.

Enable query logging for node-local-dns with the following steps:
- Edit the node-local-dns ConfigMap to add the log plugin, locate and edit the ConfigMap in the kube-system namespace in the Rancher dashboard, or use kubectl edit configmap -n kube-system node-local-dns
- Add log to the cluster.local and :53 objects in the Corefile, for example for :53 (external queries):
  - ```
  [...]
      .:53 {
          log
          errors
          cache 30
```
- Check the node-local-dns pod logs once some DNS queries have been performed, the logs should indicate queries are being answered
- Perform the reverse of steps 1-2 to disable query logging
Query logging for CoreDNS can be enabled in a similar way, when Nodelocal DNS is enabled, this will only log internal (cluster.local) queries that were not already cached
Run a DaemonSet to perform queries from a pod running on each node in the cluster

Ask questions to further eliminate the issue

Is it only DNS that is affected, or is all connectivity affected?
Are internal, external, or all DNS queries failing?
Are all nodes and workloads experiencing the issue, or a specific node or workload? Nodes use the upstream DNS configured in /etc/resolv.conf, queries failing from a node could indicate the issue is with upstream DNS
What is the error reported by applications? If logs are aggregated, investigate the rate of the error in logs to identify timelines and impact
Is the issue intermittent or constantly occurring? If the issue is intermittent, configure monitoring or a loop to identify when the issue occurs, when it does - what is the error? are internal, external or all queries affected?

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.