Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training

📅 2025-09-03

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

In large language model (LLM) training, collective communication libraries—treated as black boxes—hinder root-cause diagnosis of reliability issues, leading to resource waste and performance degradation. To address this, we propose a lightweight, dependency-aware distributed tracing mechanism that enables runtime observability into collective communication internals for the first time. Our method constructs dynamic communication dependency graphs by integrating low-overhead tracing, online dependency modeling, and interpretable anomaly analysis. Deployed in ByteDance’s production LLM training infrastructure for over six months, it detects 90% of failures within 15 seconds and precisely identifies root causes in 60% of cases within 20 seconds. Controlled fault-injection experiments confirm its high accuracy and robustness. This work establishes a practical, production-ready observability infrastructure for enhancing the reliability of LLM training systems.

Technology Category

Application Category

📝 Abstract

Reliability is essential for ensuring efficiency in LLM training. However, many real-world reliability issues remain difficult to resolve, resulting in wasted resources and degraded model performance. Unfortunately, today's collective communication libraries operate as black boxes, hiding critical information needed for effective root cause analysis. We propose Mycroft, a lightweight distributed tracing and root cause analysis system designed to address previously hidden reliability issues in collective communication. Mycroft's key idea is to trace collective communication states and leverage internal control and data dependencies to resolve reliability problems in LLM training. Mycroft has been deployed at ByteDance for over six months to debug collective communication related issues at runtime. It detected anomalies within 15 seconds in 90% of cases and identified the root cause within 20 seconds in 60% of cases. We also conducted extensive fault injection experiments to demonstrate Mycroft's capability and efficiency.

Problem

Research questions and friction points this paper is trying to address.

Tracing dependencies in collective communication for LLM training

Resolving hidden reliability issues in distributed systems

Debugging collective communication anomalies at runtime

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight distributed tracing system

Leverages control and data dependencies

Detects anomalies and root causes quickly

🔎 Similar Papers

No similar papers found.