E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work proposes a novel end-to-end paradigm for automated microservice repair that overcomes key limitations of existing large language model (LLM)-based approaches, which often rely on handcrafted prompts, lack runtime contextual knowledge, and suffer from the accuracy and efficiency constraints of general-purpose models. The proposed method directly generates executable Ansible playbooks from diagnostic reports and introduces MicroRemed, a comprehensive benchmark enabling automated deployment, fault injection, and repair validation. By leveraging empirically simulated data to perform reinforcement fine-tuning, the approach trains a specialized repair model that eliminates dependence on expert-crafted prompts and generic LLMs. Experimental results demonstrate that this method significantly outperforms nine representative LLMs on both public and industrial microservice platforms, achieving substantial improvements in both repair accuracy and execution efficiency.

Technology Category

Application Category

📝 Abstract

Contemporary microservice systems continue to grow in scale and complexity, leading to increasingly frequent and costly failures. While recent LLM-based auto-remediation approaches have emerged, they primarily translate textual instructions into executable Ansible playbooks and rely on expert-crafted prompts, lacking runtime knowledge guidance and depending on large-scale general-purpose LLMs, which limits their accuracy and efficiency. We introduce \textit{End-to-End Microservice Remediation} (E2E-MR), a new task that requires directly generating executable playbooks from diagnosis reports to autonomously restore faulty systems. To enable rigorous evaluation, we build \textit{MicroRemed}, a benchmark that automates microservice deployment, failure injection, playbook execution, and post-repair verification. We further propose \textit{E2E-REME}, an end-to-end auto-remediation model trained via experience-simulation reinforcement fine-tuning. Experiments on public and industrial microservice platforms, compared with nine representative LLMs, show that E2E-REME achieves superior accuracy and efficiency.

Problem

Research questions and friction points this paper is trying to address.

microservices

auto-remediation

diagnosis reports

executable playbooks

LLM-based repair

Innovation

Methods, ideas, or system contributions that make the work stand out.

end-to-end auto-remediation

experience-simulation reinforcement fine-tuning

microservice remediation