π€ AI Summary
This work addresses the challenge of diagnosing subtle OS-level anomalies in production-scale AI training, which can induce GPU latency and network performance degradation yet remain elusive to existing tools due to their high overhead, single-layer focus, and lack of continuous monitoring. To overcome these limitations, we present the first low-overhead (merely 0.4%), continuously operating cross-layer observability system that integrates eBPF, adaptive hybrid stack unwinding, and NCCL event tracing to jointly analyze data across CPU, GPU, and communication layers for precise root cause identification. Deployed across Alibabaβs cluster of over 80,000 GPUs, the system has operated stably for more than a year, successfully diagnosing 94 real-world incidents and reducing average troubleshooting time from several days to approximately 10 minutes.
π Abstract
Performance diagnosis in production-scale AI training is challenging because subtle OS-level issues can trigger cascading GPU delays and network slowdowns, degrading training efficiency across thousands of GPUs. Existing profiling tools are limited to single system layers, incur prohibitive overhead (10--30%), or lack continuous deployment capabilities, resulting in manual analyses spanning days. We argue that continuous, cross-layer observability enabled by OS-level instrumentation and layered differential diagnosis is necessary to address this gap. We introduce SysOM-AI, a production observability system that continuously integrates CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via adaptive hybrid stack unwinding and eBPF-based tracing, incurring less than 0.4% overhead. Deployed at Alibaba across over 80,000 GPUs for more than one year, SysOM-AI helped diagnose 94 confirmed production issues, reducing median diagnosis time from days to approximately 10 minutes.