π€ AI Summary
To address low efficiency, heavy human reliance, coarse-grained log scope definition, and insufficient data organization in root cause diagnosis for large-scale online services, this paper proposes an intent-aware log analysis framework. Methodologically, it introduces (1) an intent-driven dynamic log scoping technique; (2) a spatiotemporal log-chain-based clustering and sampling mechanism that significantly reduces LLM input size; and (3) an end-to-end diagnostic pipeline integrating PromQL semantic parsing, request execution reconstruction, and log-chain modeling. Evaluated on VolcEngineβs production environment, the framework achieves a 50.34% improvement in root-cause summary utility, a 54.79% increase in precise localization accuracy, sub-60-second diagnosis latency per alert, and a cost of only $0.074 per alert. The solution has been successfully deployed in production.
π Abstract
Effective alert diagnosis is essential for ensuring the reliability of large-scale online service systems. However, on-call engineers are often burdened with manually inspecting massive volumes of logs to identify root causes. While various automated tools have been proposed, they struggle in practice due to alert-agnostic log scoping and the inability to organize complex data effectively for reasoning. To overcome these limitations, we introduce LogPilot, an intent-aware and scalable framework powered by Large Language Models (LLMs) for automated log-based alert diagnosis. LogPilot introduces an intent-aware approach, interpreting the logic in alert definitions (e.g., PromQL) to precisely identify causally related logs and requests. To achieve scalability, it reconstructs each request's execution into a spatiotemporal log chain, clusters similar chains to identify recurring execution patterns, and provides representative samples to the LLMs for diagnosis. This clustering-based approach ensures the input is both rich in diagnostic detail and compact enough to fit within the LLM's context window. Evaluated on real-world alerts from Volcano Engine Cloud, LogPilot improves the usefulness of root cause summarization by 50.34% and exact localization accuracy by 54.79% over state-of-the-art methods. With a diagnosis time under one minute and a cost of only $0.074 per alert, LogPilot has been successfully deployed in production, offering an automated and practical solution for service alert diagnosis.