I recently had to do a proposal for a Kubernetes monitoring solution. Kubernetes is a tricky beast, especially with Prometheus because there is so much information. I was daunted by the amount of effort and understanding I’d need to begin on the task, and it took me a few days to actually open the ticket.

After reading through a lot of articles on Kubernetes monitoring, I knew the answer to my question wasn’t going to be a simple one. As I stared at the space above my monitor, pondering the question, a thought struck me. It was simple: what if I just wrote down all the questions I needed answered by my monitoring and then got all the appropriate metrics. In every job I’ve ever had, it’s always been “monitor x, y, and z”, but never “tell me when this isn’t working properly.”

I started off my question chain as follows:

  1. Can my pods be scheduled?
  2. Are there nodes available to schedule pods?
  3. Are there resources available on my nodes?
  4. Are my pods running as expected?

If I can answer each of these questions in the affirmative, then I can generally expect that a service is running correctly. Likewise if I ask questions for each layer (hardware, control plane, nodes, services, ingress, etc) and those answers are in the negative, I can understand cascading dependencies on my compute, advising customers and mitigating an outage as much as possible.

Actually getting the metrics to answer these questions was super easy. I just opened up the Kubernetes Dashboard and broke something. When the Dashboard went red, I noted down the affected metrics. Once I’d collected all my metrics, I created Prometheus Alarms for them. It’s that easy.

Have a good process, and good results will follow.

Trouble with monitoring Kubernetes?

Kubernetes is a complex beast, and luckily we’ve spent a lot of time taming it. Get in touch and we can chat about how to improve your cluster and application monitoring using Kubernetes.

Categories: Monitoring