The kubectl-debug plugin for debugging in Kubernetes pods

At the end of last year, a plug-in for kubectl was introduced on Reddit , which helps to debug Kubernetes cluster pods - kubectl-debug . This idea immediately seemed interesting and useful to our engineers, so we decided to look at its implementation and are happy to share our results with the readers of Habra.

Why is it even needed?

At the moment there is a serious inconvenience in the process of debugging something within the framework of the pods. The main goal when assembling an image of a container is to minimize it, i.e. make as small as possible in size and containing as little as possible of the "extra" inside. However, when it comes to problems in the work of the final software in containers or debugging its communication with other services in the cluster / outside ... minimalism plays a cruel joke with us - after all, there is nothing in the containers for the actual process of finding problems. Utilities such as netstat / ip / ping / curl / wget, etc. are usually not available.

And often it all ends with the fact that the engineer in haste puts the necessary software right in the running container in order to “see the light” and see the problem. It is for such cases that the kubectl-debug plugin seemed to be a very useful tool, because it saves from the immediate pain.

With it, you can run a container with all the necessary tools on board in the context of the problem pod and study all the processes “from the side” while inside. If you’ve ever encountered troubleshooting at Kubernetes, it sounds good, doesn't it?

What is this plugin?

In general terms, the architecture of this solution looks like a bundle of a plug-in for kubectl and an agent , launched with the help of a DaemonSet controller. The plugin serves commands starting with kubectl debug …, and interacts with agents on the cluster nodes. The agent, in turn, runs on the host network, and the host pod is mounted in the agent pod docker.sockfor full access to the containers on this server.

Accordingly, when a request is made to launch a debug container in the specified pod: the pod
detection process is in hostIPprogress, and a request is sent to the agent (running on a suitable host) to start the debug container in the namespaces corresponding to the target pod.

A more detailed understanding of these stages is available in the project documentation .

What is required for work?

The author of kubectl-debug claims compatibility with Kubernetes 1.12.0+ client / cluster versions , but I had K8s 1.10.8 on hand, on which everything worked without visible problems ... with a single note: in order for the team to kubectl debugwork in As such, the kubectl version is exactly 1.12+ . Otherwise, all the commands are similar, but only called via kubectl-debug ….

When you start described in READMEthe DaemonSet template, you should not forget about the taint'es you use on the nodes: without the appropriate tolerations of the agent's pods, they will not live there and, as a result, the pods that live on such nodes cannot connect with a debugger.

Help at the debugger is quite complete and seems to describe all the current capabilities for launching / configuring the plugin. In general, the utility pleases with a large number of start-up directives: you can enclose certificates, specify the kubectl context, specify a separate kubectl config or the address of the cluster API server and more.

Work with debugger

Installation before the “everything works” is reduced to two stages:

perform kubectl apply -f agent_daemonset.yml;
directly install the plugin itself - in general, everything as described here .

How to use it? Suppose we have the following problem: the metrics of one of the services in the cluster are not collected - and we want to check if there are any network problems between Prometheus and the target service. As you can guess, the Prometheus image lacks the required tools.

Let's try to connect to the container with Prometheus (if there are several containers in the pod, you will need to specify which one to connect to, otherwise the debugger will choose the first one by default):

kubectl-debug --namespace kube-prometheus  prometheus-main-0                                    
Defaulting container name to prometheus.
pulling image nicolaka/netshoot:latest... 
latest: Pulling from nicolaka/netshoot
4fe2ade4980c: Already exists 
ad6ddc9cd13b: Pull complete 
cc720038bf2b: Pull complete 
ff17a2bb9965: Pull complete 
6fe9f5dade08: Pull complete 
d11fc7653a2e: Pull complete 
4bd8b4917a85: Pull complete 
2bd767dcee18: Pull complete 
Digest: sha256:897c19b0b79192ee5de9d7fb40d186aae3c42b6e284e71b93d0b8f1c472c54d3
Status: Downloaded newer image for nicolaka/netshoot:latest
starting debug container...
container created, open tty...
 [1]   → 
root @ /

Previously, we found out that the problem service lives on the address 10.244.1.214 and listens to port 8080. Of course, we can check availability from the hosts, however, for a reliable debugging process, these operations must be reproduced in identical (or as close as possible) conditions. Therefore, checking out pod / container with Prometheus is the best option. Let's start with the simple:

 [1]   → ping 10.244.1.214
PING 10.244.1.214 (10.244.1.214) 56(84) bytes of data.
64 bytes from 10.244.1.214: icmp_seq=1 ttl=64 time=0.056 ms
64 bytes from 10.244.1.214: icmp_seq=2 ttl=64 time=0.061 ms
64 bytes from 10.244.1.214: icmp_seq=3 ttl=64 time=0.047 ms
64 bytes from 10.244.1.214: icmp_seq=4 ttl=64 time=0.049 ms
^C
--- 10.244.1.214 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 61ms
rtt min/avg/max/mdev = 0.047/0.053/0.061/0.007 ms

All is well. Maybe the port is unavailable?

 [1]   → curl -I 10.244.1.214:8080
HTTP/1.1 200 OK
Date: Sat, 12 Jan 2019 14:01:29 GMT
Content-Length: 143
Content-Type: text/html; charset=utf-8

And there are no problems. Then check if the actual communication between Prometheus and the endpoint with metrics occurs:

 [2]   → tcpdump host 10.244.1.214
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:04:19.234101 IP prometheus-main-0.prometheus-operated.kube-prometheus.svc.cluster.local.36278 > 10.244.1.214.8080: Flags [P.], seq 4181259750:4181259995, ack 2078193552, win 1444, options [nop,nop,TS val 3350532304 ecr 1334757657], length 245: HTTP: GET /metrics HTTP/1.1
14:04:19.234158 IP 10.244.1.214.8080 > prometheus-main-0.prometheus-operated.kube-prometheus.svc.cluster.local.36278: Flags [.], ack 245, win 1452, options [nop,nop,TS val 1334787600 ecr 3350532304], length 0
14:04:19.290904 IP 10.244.1.214.8080 > prometheus-main-0.prometheus-operated.kube-prometheus.svc.cluster.local.36278: Flags [P.], seq 1:636, ack 245, win 1452, options [nop,nop,TS val 1334787657 ecr 3350532304], length 635: HTTP: HTTP/1.1 200 OK
14:04:19.290923 IP prometheus-main-0.prometheus-operated.kube-prometheus.svc.cluster.local.36278 > 10.244.1.214.8080: Flags [.], ack 636, win 1444, options [nop,nop,TS val 3350532361 ecr 1334787657], length 0
^C
4 packets captured
4 packets received by filter
0 packets dropped by kernel

Requests, answers come. As a result of these operations, we can conclude that there are no problems at the level of network interaction, which means (most likely) - we need to look at the application side. We connect to the container with exporter (also, of course, using the debugger in question, because exporters always have extremely minimalistic images) and ... we are surprised to find that there is a problem in the service configuration - for example, they forgot to send the exporter to the correct address of the final application. The case is solved!

Of course, in the situation described here, other ways of debugging are possible, but we leave them outside the article. The result is that kubectl-debug has plenty of opportunities to use: after all, you can run absolutely any image in the work, and if you want, you can even collect some of your specific (with the necessary set of tools).

What other application options immediately come to mind?

"Silent" application that ~~harmful~~ developers have not implemented normal logging. But he has the ability to connect to the service port and debug with a specific tool, which, of course, is not worth putting into the final image.
The launch next to the combat application is identical in the “manual” mode, but with debug enabled - to check the interaction with neighboring services.

In general, it is obvious that there are much more situations in which such a tool can be useful. Engineers who encounter them at work every day will be able to assess the potential of the utility in terms of “live” debugging.

findings

Kubectl-debug is a useful and promising tool. Of course, there are Kubernetes clusters and applications for which it does not make much sense, but it is more likely that it will provide invaluable help in debugging - especially if it comes to the combat environment and the need to quickly find the reasons the problem occurred.

The first experience of use revealed an acute need for connectivity to the pod / container, which is not fully launched (for example, “hangs” in CrashLoopbackOff), just with the aim to check the causes of the “non-launch” application on the fly. On this occasion, I created a corresponding issue.in the repository of the project, to which the developer responded positively and promised implementation in the near future. Very pleased with the fast and adequate feedback. So we will look forward to new features of the utility and its further development!