ErrorPrism: Reconstructing Error Propagation Paths in Cloud Service Systems

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Fault propagation paths in cloud services are difficult to trace due to error-wrapping, and existing approaches suffer from insufficient accuracy. This paper proposes an iterative backward search method that synergistically integrates static analysis with a large language model (LLM) agent: it first constructs a function call graph, then leverages the LLM to semantically model logs and match candidate functions, enabling multi-step reasoning to reconstruct the full propagation path from terminal log entries to root-cause faults. To our knowledge, this is the first work to jointly exploit static code structure and LLM-based semantic understanding for fault-chain reconstruction. Evaluated on 67 production microservices and 102 real-world failures at ByteDance, our method achieves a 97.0% path reconstruction accuracy—substantially outperforming both pure static analysis and state-of-the-art LLM-based baselines.

Technology Category

Application Category

📝 Abstract

Reliability management in cloud service systems is challenging due to the cascading effect of failures. Error wrapping, a practice prevalent in modern microservice development, enriches errors with context at each layer of the function call stack, constructing an error chain that describes a failure from its technical origin to its business impact. However, this also presents a significant traceability problem when recovering the complete error propagation path from the final log message back to its source. Existing approaches are ineffective at addressing this problem. To fill this gap, we present ErrorPrism in this work for automated reconstruction of error propagation paths in production microservice systems. ErrorPrism first performs static analysis on service code repositories to build a function call graph and map log strings to relevant candidate functions. This significantly reduces the path search space for subsequent analysis. Then, ErrorPrism employs an LLM agent to perform an iterative backward search to accurately reconstruct the complete, multi-hop error path. Evaluated on 67 production microservices at ByteDance, ErrorPrism achieves 97.0% accuracy in reconstructing paths for 102 real-world errors, outperforming existing static analysis and LLM-based approaches. ErrorPrism provides an effective and practical tool for root cause analysis in industrial microservice systems.

Problem

Research questions and friction points this paper is trying to address.

Reconstructs error propagation paths in microservice systems

Addresses traceability issues from error wrapping practices

Automates root cause analysis through static and dynamic methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Static analysis builds function call graph

LLM agent performs iterative backward search

Reconstructs multi-hop error propagation paths

🔎 Similar Papers

Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis