Signalling Health for Improved Kubernetes Microservice Availability

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing polling-based container health monitoring (PCM) methods in Kubernetes suffer from complex parameter tuning, high fault detection latency (slow average response), and frequent false positives that degrade service availability. Method: We propose the first signal-driven health monitoring framework empirically validated in Kubernetes, replacing periodic polling with an event-notification mechanism and formalizing its performance advantages via a rigorous mathematical model. The approach requires no manual configuration. Results: Our method accelerates fault detection by 86%, eliminates false positives entirely, and prevents a 4% availability loss attributable to erroneous health assessments. It achieves readiness probe accuracy comparable to polling while reducing resource overhead. Evaluated on the SockShop benchmark across six comparative experiments, the framework is fully integrated into Kubernetes’ native probe infrastructure, significantly enhancing end-to-end availability of microservice systems.

Technology Category

Application Category

📝 Abstract

Microservices are often deployed and managed by a container orchestrator that can detect and fix failures to maintain the service availability critical in many applications. In Poll-based Container Monitoring (PCM), the orchestrator periodically checks container health. While a common approach, PCM requires careful tuning, may degrade service availability, and can be slow to detect container health changes. An alternative is Signal-based Container Monitoring (SCM), where the container signals the orchestrator when its status changes. We present the design, implementation, and evaluation of an SCM approach for Kubernetes and empirically show that it has benefits over PCM, as predicted by a new mathematical model. We compare the service availability of SCM and PCM over six experiments using the SockShop benchmark. SCM does not require that polling intervals are tuned, and yet detects container failure 86% faster than PCM and container readiness in a comparable time with limited resource overheads. We find PCM can erroneously detect failures, and this reduces service availability by 4%. We propose that orchestrators offer SCM features for faster failure detection than PCM without erroneous detections or careful tuning.

Problem

Research questions and friction points this paper is trying to address.

Compares SCM and PCM for Kubernetes microservice availability

Evaluates faster failure detection without tuning in SCM

Addresses erroneous failure detection and resource overhead in PCM

Innovation

Methods, ideas, or system contributions that make the work stand out.

Signal-based Container Monitoring for Kubernetes

Faster failure detection than polling

No tuning needed, reduces erroneous detections

🔎 Similar Papers

Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis