Troubleshooting high ingress request times

This document (000020014) is provided subject to the disclaimer at the end of this document.

Environment

An RKE or RKE2 cluster
Use of the Nginx Ingress Controller

Situation

This article aims to provide steps to gather and analyse data to help troubleshoot ingress performance issues.

Resolution

Review request times

To narrow down on which requests are taking the longest, analyzing the ingress-nginx logs is very helpful.

Retrieve requests with a high (>2s) upstream_response_time. This log field represents the time taken for the response from the upstream target - the pod endpoint in the service.

kubectl logs -n ingress-nginx -l app=ingress-nginx -f --tail=2000 | awk '/- -/ && $(NF-2)>2.0'

The same can be done for request_time, this represents the time taken to complete the entire request, including the above upstream_response_time.

kubectl logs -n ingress-nginx -l app=ingress-nginx -f --tail=2000 | awk '/- -/ && $(NF-7)>2.0'

Please adjust the time to suit, where >2.0 will filter for any times greater than 2.0 seconds.

Comparing the diference in timings between request_time and upstream_response_time can help to understand the issue further:

Locate any potential upstream targets (pods), or nodes these may be running on, that are frequently associated with a higher upstream_response_time
If all upstream targets in a particular ingress/service are experiencing higher response times:
- What dependencies does the application have? For example, external APIs, databases, other services, etc - Investigate the application logs - Simulate the same requests directly to pods to bypass ingress-nginx, are they also slow?

If the upstream_response_time is much lower than request_time, the time is being spent elsewhere, check any tuning, performance or resource issues on the nodes

Note: The request_time metric is also used to create the ingress controller graphs when Cluster Monitoring is enabled.

Review request details

Along with the output in the previous step, it is also useful to analyse the request details, such as the request itself, source/destination IP address, response code, user agent, and the unique name for the ingress for common patterns.

You may need to review these with the related application teams. For example, a request to retrieve a large amount of data, or perform a complex query may genuinely take a long time, these can potentially be ignored.

Some requests may be opening a websocket, and in the scenario that the service scales up/down regularly, a small number of upstream targets could have a long-running connection creating an unfair distribution to occur on these targets.

It's also worthwhile to consider the time when the issue occurs, the number of pods in the service, performance metrics, and requests/limits in place. For example, do the requests occur during a peak load time? Is HPA configured to scale the deployment? Is monitoring data available to identify trends and correlate with the logs?

Check ingress-nginx logs

With the focus previously on requests themselves, it is also useful to exclude the access logs and ensure there are no fundamental issues with ingress-nginx.

The following command should exclude all access.log output, retrieving output from the ingress controller and the nginx error.log only.

kubectl logs -n ingress-nginx -l app=ingress-nginx -f --tail=100 | awk '!/- -/'

Please adjust the --tail flag as needed, this example retrieves the last 100 lines from each ingress-nginx pod.

Real-time view of all requests

Another option to get a broader overview is using a tool like goaccess. After installing the package, the below can be used to feed ingress-nginx logs to goaccess to get a real-time view of the logs.

kubectl logs -f -n ingress-nginx -l app=ingress-nginx --tail=2000 | goaccess --log-format="%h - - [%d:%t] \"%m %r %H\" %s %b \"%R\" \"%u\" %^ %T [%v]" --time-format '%H:%M:%S %z' --date-format "%d/%b/%Y"

Please adjust the history of logs with the --tail flag.

Measure requests to ingress-nginx

If you have isolated all areas so far, it might be worthwhile to focus on the Load Balancer or network devices that provide client access to ingress-nginx.

The following articles contain curl commands to perform SNI-compliant requests and measure statistics, these requests could also be compared from the ingress-nignx logs (as above) to understand what portion of the time was spend with ingress-nginx handling the request.

You may also be able to obtain metrics from your Load Balancer or infrastructure to troubleshoot this further.

Additional Information

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.