🤖 AI Summary
In large language model (LLM) training, collective communication libraries—treated as black boxes—hinder root-cause diagnosis of reliability issues, leading to resource waste and performance degradation. To address this, we propose a lightweight, dependency-aware distributed tracing mechanism that enables runtime observability into collective communication internals for the first time. Our method constructs dynamic communication dependency graphs by integrating low-overhead tracing, online dependency modeling, and interpretable anomaly analysis. Deployed in ByteDance’s production LLM training infrastructure for over six months, it detects 90% of failures within 15 seconds and precisely identifies root causes in 60% of cases within 20 seconds. Controlled fault-injection experiments confirm its high accuracy and robustness. This work establishes a practical, production-ready observability infrastructure for enhancing the reliability of LLM training systems.
📝 Abstract
Reliability is essential for ensuring efficiency in LLM training. However, many real-world reliability issues remain difficult to resolve, resulting in wasted resources and degraded model performance. Unfortunately, today's collective communication libraries operate as black boxes, hiding critical information needed for effective root cause analysis. We propose Mycroft, a lightweight distributed tracing and root cause analysis system designed to address previously hidden reliability issues in collective communication. Mycroft's key idea is to trace collective communication states and leverage internal control and data dependencies to resolve reliability problems in LLM training. Mycroft has been deployed at ByteDance for over six months to debug collective communication related issues at runtime. It detected anomalies within 15 seconds in 90% of cases and identified the root cause within 20 seconds in 60% of cases. We also conducted extensive fault injection experiments to demonstrate Mycroft's capability and efficiency.