Rebooting Microreboot: Architectural Support for Safe, Parallel Recovery in Microservice Systems

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of cascading failures in modern microservice systems, where component restarts—often triggered by dense dependencies—can propagate faults, and existing autonomous repair mechanisms lack safety guarantees. The authors propose a novel three-agent collaborative architecture comprising diagnosis, planning, and verification agents that jointly generate repair plans encoded with explicit side-effect semantics via a typed set of seven atomic actions. These plans are executed and validated transactionally by a minimal kernel. Crucially, the approach introduces an online inference mechanism for recovery boundaries grounded in distributed tracing, which ensures operational safety while avoiding indiscriminate restarts. Empirical evaluation demonstrates that the system infers 99th-percentile recovery groups in just 21ms across Alibaba, Meta, and DeathStarBench datasets, reduces agent-induced harmful actions by 95% in simulations, and achieves zero harmful operations in production deployment.

Technology Category

Application Category

📝 Abstract

Microreboot enables fast recovery by restarting only the failing component, but in modern microservices naive restarts are unsafe: dense dependencies mean rebooting one service can disrupt many callers. Autonomous remediation agents compound this by actuating raw infrastructure commands without safety guarantees. We make microreboot practical by separating planning from actuation: a three-agent architecture (diagnosis, planning, verification) proposes typed remediation plans over a seven-action ISA with explicit side-effect semantics, and a small microkernel validates and executes each plan transactionally. Agents are explicitly untrusted; safety derives from the ISA and microkernel. To determine where restart is safe, we infer recovery boundaries online from distributed traces, computing minimal restart groups and ordering constraints. On industrial traces (Alibaba, Meta) and DeathStarBench with fault injection, recovery-group inference runs in 21 ms at P99; typed actuation reduces agent-caused harm by 95% in simulation and achieves 0% harm online. The primary value is safety, not speed: LLM inference overhead increases TTR for services with fast auto-restart.

Problem

Research questions and friction points this paper is trying to address.

microreboot

microservice recovery

safety

dependency management

autonomous remediation

Innovation

Methods, ideas, or system contributions that make the work stand out.

microreboot

recovery boundaries

typed remediation