LogPilot: Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems

πŸ“… 2025-09-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address low efficiency, heavy human reliance, coarse-grained log scope definition, and insufficient data organization in root cause diagnosis for large-scale online services, this paper proposes an intent-aware log analysis framework. Methodologically, it introduces (1) an intent-driven dynamic log scoping technique; (2) a spatiotemporal log-chain-based clustering and sampling mechanism that significantly reduces LLM input size; and (3) an end-to-end diagnostic pipeline integrating PromQL semantic parsing, request execution reconstruction, and log-chain modeling. Evaluated on VolcEngine’s production environment, the framework achieves a 50.34% improvement in root-cause summary utility, a 54.79% increase in precise localization accuracy, sub-60-second diagnosis latency per alert, and a cost of only $0.074 per alert. The solution has been successfully deployed in production.

Technology Category

Application Category

πŸ“ Abstract
Effective alert diagnosis is essential for ensuring the reliability of large-scale online service systems. However, on-call engineers are often burdened with manually inspecting massive volumes of logs to identify root causes. While various automated tools have been proposed, they struggle in practice due to alert-agnostic log scoping and the inability to organize complex data effectively for reasoning. To overcome these limitations, we introduce LogPilot, an intent-aware and scalable framework powered by Large Language Models (LLMs) for automated log-based alert diagnosis. LogPilot introduces an intent-aware approach, interpreting the logic in alert definitions (e.g., PromQL) to precisely identify causally related logs and requests. To achieve scalability, it reconstructs each request's execution into a spatiotemporal log chain, clusters similar chains to identify recurring execution patterns, and provides representative samples to the LLMs for diagnosis. This clustering-based approach ensures the input is both rich in diagnostic detail and compact enough to fit within the LLM's context window. Evaluated on real-world alerts from Volcano Engine Cloud, LogPilot improves the usefulness of root cause summarization by 50.34% and exact localization accuracy by 54.79% over state-of-the-art methods. With a diagnosis time under one minute and a cost of only $0.074 per alert, LogPilot has been successfully deployed in production, offering an automated and practical solution for service alert diagnosis.
Problem

Research questions and friction points this paper is trying to address.

Automates log-based alert diagnosis for large-scale online service systems
Overcomes alert-agnostic log scoping and complex data organization challenges
Provides scalable root cause analysis using intent-aware LLM framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interprets alert definitions to identify relevant logs
Reconstructs request executions into spatiotemporal log chains
Clusters similar chains for efficient LLM diagnosis
πŸ”Ž Similar Papers
No similar papers found.