blog

How to Monitor Kubernetes with ServicePilot?

How to Monitor <span class='blue'>Kubernetes</span> with ServicePilot?
April 2nd, 2026

Introduction to Kubernetes monitoring

Kubernetes has become one of the cornerstones of modern infrastructure. By orchestrating containers at scale, it enables businesses to deploy and manage their applications with remarkable flexibility. This power comes with increasing complexity: a Kubernetes cluster is a dynamic ecosystem, composed of pods, services, nodes and multiple internal components whose health must be continuously monitored.

It is precisely in this context that ServicePilot truly shines: a monitoring platform offering a clear, centralized view of all Kubernetes metrics while simplifying anomaly detection and incident management.

Before monitoring a Kubernetes cluster, it is crucial to monitor the underlying resources: CPU, memory, disk and network. Kubernetes clusters are complex distributed systems that require adequate resources to ensure high performance and maximum availability.

Why monitor Kubernetes?

From an operational standpoint, monitoring Kubernetes requires understanding an architecture based on the ephemeral nature of resources, the variability of topologies, the layering of abstraction and the often non-deterministic interactions between different services.

Monitoring a Kubernetes cluster is therefore not an option, it is a necessity. In an environment where pods appear and disappear based on load, where nodes can become overloaded and where services must remain accessible at all times, having complete visibility is essential.

With K8s cluster monitoring, it becomes possible to:

  • Ensure availability by quickly detecting faulty pods, overloaded nodes or unavailable services
  • Optimize performance by understanding CPU, memory, network and storage usage
  • Anticipate incidents by identifying trends, forecasting overloads and adjusting resources
  • Ensure security and compliance by monitoring abnormal behavior, activity spikes or unplanned deployments

The challenges are numerous. Kubernetes is a distributed and dynamic system with components that generate a wide variety of metrics. Correlating this information to understand the root cause of an incident can quickly become complex.

ServicePilot addresses these challenges by centralizing data, organizing and presenting it in an easy-to-understand format. Identifying abnormal behavior and understanding what is actually happening in the cluster is easier thanks to:

  • Automatic cluster discovery
  • Automatic collection of Kubernetes metrics
  • Preconfigured dashboards for nodes, pods and services
  • Intelligent alerting engine
  • Integration with DevOps environments

Monitoring nodes, pods and services

One of ServicePilot’s key strengths is its overview dashboard, which provides a comprehensive view of the cluster’s status. At a glance, you can see which pods are running correctly, which ones are encountering errors, which services are available and how the nodes are performing. Statuses are clearly highlighted, making it easy to immediately detect anomalies.

Each resource then has a detailed view. For nodes, they are analyzed in terms of load, usage and availability. As for pods, ServicePilot displays their status, number of restarts, CPU and memory usage. Services include metrics on their availability and latency. This level of detail allows you to pinpoint exactly where the bottlenecks are.

It is very easy to create custom widgets and dashboards to monitor performance metrics:

  • CPU: usage by node, by pod, by namespace
  • Memory: consumption, limits, risk of OOMKill
  • I/O: disk access, latency, saturation
  • Network: incoming/outgoing traffic, errors, bottlenecks

Examples of custom widgets:

  • Top CPU curves by node
  • Memory heatmap by pod
  • Network latency graphs
  • Histograms of pod restarts

Beyond infrastructure, ServicePilot allows you to monitor the actual performance of applications and workloads:

  • Response time
  • Error rate
  • Saturation
  • Dependencies between services

Teams can thus correlate an application incident with an infrastructure or configuration issue.

Incident and alert management

Alert management is a central component of monitoring. With ServicePilot, you can define different types of alerts based on your needs: for example, an administrator can be notified when a pod restarts too frequently, when a node exceeds a certain load level or when a service becomes unavailable.

Here are several types of alerts available in ServicePilot:

  • Threshold alerts: CPU, memory, network
  • Status alerts: pod error, unavailable node, unavailable service
  • Behavioral alerts: anomalies detected by AI

In addition to pre-built thresholds, you can also configure custom alerts based on these metrics, whether they involve simple thresholds or more complex trends.

When an incident occurs, ServicePilot does more than just report it. The platform classifies it by severity, gathers relevant information and provides an analysis to help understand the root cause of the problem.

Integration with other tools

In a DevOps environment, monitoring cannot be isolated. ServicePilot’s Webhooks enable seamless integration with the most common tools, such as GitHub, GitLab, Jenkins, Slack or Microsoft Teams. These integrations allow you to link alerts to CI/CD pipelines, send real-time notifications or automate certain actions.

Thanks to these connections, it becomes possible to trigger performance tests after a deployment, pause a pipeline in the event of an incident or automatically create tickets in a management tool. This automation enhances team responsiveness and contributes to greater stability of Kubernetes environments.

Real-world use cases

1. Detecting a scheduling issue

One of the most common incidents in a Kubernetes cluster involves the scheduler being unable to assign a node to a Pod. This type of issue typically manifests as:

  • Pods in the Pending state that never transition to Running
  • FailedScheduling events indicating that Kubernetes did not find any compatible nodes
  • CPU pressure on one or more nodes (Node CPU pressure) preventing the scheduling of new workloads

Thanks to monitoring, the team can quickly correlate these signals: a spike in CPU usage on a node pool, followed by an increase in the number of pending pods, is a classic sign of saturation.

Operational recommendations with two generally effective approaches:

  • Adjust requests/limits to avoid artificial resource over-allocation
  • Add a node to the cluster or the relevant node pool to absorb the actual load

A platform like ServicePilot allows you to visualize resource pressure and associated events at a glance, which significantly speeds up diagnosis.

2. Recurring CrashLoopBackOff

The CrashLoopBackOff status is another classic issue. It indicates that a container starts, crashes, restarts… and so on. To understand the cause, several observability sources are essential:

  • Container logs: for example, a repeated NullPointerException
  • OOMKilled events showing that the container is exceeding its memory limit
  • Memory consumption metrics revealing that the memory limit is too low relative to the application’s actual behavior

Correlating application logs, kubelet events and resource metrics allows you to quickly identify whether the problem is software-related (a bug) or infrastructure-related (too strict limits).

In the event of an OOM, the solution involves:

  • Adjusting memory limits to reflect actual needs
  • Optimizing the application if memory consumption is abnormal

Centralized monitoring facilitates this analysis by consolidating logs, metrics and events within a single context.

3. Application performance degradation

Performance issues in a microservices environment are often complex, as they involve multiple layers: application, network, external dependencies, storage, etc. Comprehensive monitoring allows you to identify:

  • High P99 latency, a sign of a slowdown affecting the majority of requests
  • Microservice saturation, visible through request rate, CPU usage or queue length
  • Distributed traces highlighting a bottleneck, such as an external database responding slowly
  • A dependency map revealing that a critical service is impacted and propagating latency throughout the entire chain

This type of analysis is particularly difficult without distributed observability. Traces allow you to pinpoint the faulty component, while metrics confirm the saturation.

The team can then prioritize actions: increase a service’s resources, optimize a SQL query, cache an external dependency, etc.

4. Intra-cluster network issue

Internal network incidents within the cluster are often underestimated, even though they can cause erratic behavior in microservices. Typical symptoms include:

  • TCP errors (retransmissions, resets) between services
  • Abnormally high inter-pod latency
  • Node network pressure indicating that the node’s network interface is saturated
  • Visualization of impacted flows showing which services can no longer communicate properly

Network monitoring integrated at the Kubernetes level allows you to quickly detect whether the problem stems from a node, a failing CNI, a link bottleneck or a service generating abnormal traffic.

Without this visibility, teams waste valuable time suspecting the application when the cause is purely network-related.

A unified view for DevOps, SRE and IT teams

One of ServicePilot’s major advantages is its ability to consolidate all Kubernetes data into a single platform:

  • Cluster overview
  • Detailed workload analysis
  • Dynamic service mapping

This approach drastically reduces mean time to resolution (MTTR) and improves the overall reliability of Kubernetes environments.

K8s is powerful, but its complexity requires modern monitoring capable of tracking dynamic, distributed environments. With ServicePilot, teams have a unified platform for monitoring, diagnosing and optimizing their clusters and workloads.

The result: fewer incidents, better application performance and smoother Kubernetes operations.

Did you like the article? Feel free to share it