TrioXpert: An automated incident management framework for microservice system

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing event management approaches for large-scale microservice systems rely on unimodal data, struggle to jointly perform anomaly detection, failure classification, and root-cause localization, and lack interpretability. To address these limitations, this paper proposes the first end-to-end multi-task joint optimization framework for automated fault management. Our method introduces a novel three-channel heterogeneous data processing pipeline that fuses multimodal telemetry—metrics, logs, and distributed traces—and incorporates an LLM-augmented collaborative reasoning mechanism to enable modality-specific feature extraction and unified training. Evaluated on two mainstream microservice benchmarks, our framework achieves improvements of 4.7%–57.7% in anomaly detection, 2.1%–40.6% in failure attribution, and 1.6%–163.1% in root-cause localization. Moreover, it generates verifiable, step-by-step reasoning evidence, substantially enhancing model interpretability and operational utility.

Technology Category

Application Category

📝 Abstract
Automated incident management plays a pivotal role in large-scale microservice systems. However, many existing methods rely solely on single-modal data (e.g., metrics, logs, and traces) and struggle to simultaneously address multiple downstream tasks, including anomaly detection (AD), failure triage (FT), and root cause localization (RCL). Moreover, the lack of clear reasoning evidence in current techniques often leads to insufficient interpretability. To address these limitations, we propose TrioXpert, an end-to-end incident management framework capable of fully leveraging multimodal data. TrioXpert designs three independent data processing pipelines based on the inherent characteristics of different modalities, comprehensively characterizing the operational status of microservice systems from both numerical and textual dimensions. It employs a collaborative reasoning mechanism using large language models (LLMs) to simultaneously handle multiple tasks while providing clear reasoning evidence to ensure strong interpretability. We conducted extensive evaluations on two popular microservice system datasets, and the experimental results demonstrate that TrioXpert achieves outstanding performance in AD (improving by 4.7% to 57.7%), FT (improving by 2.1% to 40.6%), and RCL (improving by 1.6% to 163.1%) tasks.
Problem

Research questions and friction points this paper is trying to address.

Automated incident management in microservice systems
Handling multiple tasks with multimodal data
Improving interpretability with reasoning evidence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages multimodal data for incident management
Uses independent pipelines for different data modalities
Employs LLMs for collaborative reasoning and interpretability