TAMO:Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Traditional root cause analysis (RCA) in cloud-native distributed systems suffers from heavy manual effort, while existing large language model (LLM)-based approaches struggle to model dynamic service dependencies and integrate heterogeneous, multi-source observability data (metrics, logs, traces). Method: This paper proposes a tool-augmented multimodal LLM agent architecture. It achieves temporally aligned fusion of multi-source observability data via time-series alignment representation, integrates dedicated root-cause localization tools and fault classifiers, and employs dependency-aware prompt engineering to enable precise root-cause identification and context-driven remediation strategy generation under dynamic service topologies. Results: Extensive experiments on multiple heterogeneous public datasets demonstrate significant improvements in root-cause identification accuracy and cross-scenario generalization capability. To the best of our knowledge, this is the first approach to achieve fine-grained, fully automated, structured, and context-consistent end-to-end fault diagnosis and actionable remediation recommendations.

Technology Category

Application Category

📝 Abstract

With the development of distributed systems, microservices and cloud native technologies have become central to modern enterprise software development. Despite bringing significant advantages, these technologies also increase system complexity and operational challenges. Traditional root cause analysis (RCA) struggles to achieve automated fault response, heavily relying on manual intervention. In recent years, large language models (LLMs) have made breakthroughs in contextual inference and domain knowledge integration, providing new solutions for Artificial Intelligence for Operations (AIOps). However, Existing LLM-based approaches face three key challenges: text input constraints, dynamic service dependency hallucinations, and context window limitations. To address these issues, we propose a tool-assisted LLM agent with multi-modality observation data, namely TAMO, for fine-grained RCA. It unifies multi-modal observational data into time-aligned representations to extract consistent features and employs specialized root cause localization and fault classification tools for perceiving the contextual environment. This approach overcomes the limitations of LLM in handling real-time changing service dependencies and raw observational data and guides LLM to generate repair strategies aligned with system contexts by structuring key information into a prompt. Experimental results show that TAMO performs well in root cause analysis when dealing with public datasets characterized by heterogeneity and common fault types, demonstrating its effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Automating root cause analysis in complex distributed systems

Overcoming LLM limitations in handling multi-modal data

Reducing manual intervention in fault diagnosis and repair

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies multi-modal data into time-aligned representations

Employs specialized tools for root cause localization

Structures key information into prompts for LLM guidance

🔎 Similar Papers

Large Language Models for Causal Discovery: Current Landscape and Future Directions