TrioXpert: An automated incident management framework for microservice system

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing event management approaches for large-scale microservice systems rely on unimodal data, struggle to jointly perform anomaly detection, failure classification, and root-cause localization, and lack interpretability. To address these limitations, this paper proposes the first end-to-end multi-task joint optimization framework for automated fault management. Our method introduces a novel three-channel heterogeneous data processing pipeline that fuses multimodal telemetry—metrics, logs, and distributed traces—and incorporates an LLM-augmented collaborative reasoning mechanism to enable modality-specific feature extraction and unified training. Evaluated on two mainstream microservice benchmarks, our framework achieves improvements of 4.7%–57.7% in anomaly detection, 2.1%–40.6% in failure attribution, and 1.6%–163.1% in root-cause localization. Moreover, it generates verifiable, step-by-step reasoning evidence, substantially enhancing model interpretability and operational utility.

Technology Category

Application Category

📝 Abstract

Automated incident management plays a pivotal role in large-scale microservice systems. However, many existing methods rely solely on single-modal data (e.g., metrics, logs, and traces) and struggle to simultaneously address multiple downstream tasks, including anomaly detection (AD), failure triage (FT), and root cause localization (RCL). Moreover, the lack of clear reasoning evidence in current techniques often leads to insufficient interpretability. To address these limitations, we propose TrioXpert, an end-to-end incident management framework capable of fully leveraging multimodal data. TrioXpert designs three independent data processing pipelines based on the inherent characteristics of different modalities, comprehensively characterizing the operational status of microservice systems from both numerical and textual dimensions. It employs a collaborative reasoning mechanism using large language models (LLMs) to simultaneously handle multiple tasks while providing clear reasoning evidence to ensure strong interpretability. We conducted extensive evaluations on two popular microservice system datasets, and the experimental results demonstrate that TrioXpert achieves outstanding performance in AD (improving by 4.7% to 57.7%), FT (improving by 2.1% to 40.6%), and RCL (improving by 1.6% to 163.1%) tasks.

Problem

Research questions and friction points this paper is trying to address.

Automated incident management in microservice systems

Handling multiple tasks with multimodal data

Improving interpretability with reasoning evidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages multimodal data for incident management

Uses independent pipelines for different data modalities

Employs LLMs for collaborative reasoning and interpretability

🔎 Similar Papers

Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis