SysOM-AI: Continuous Cross-Layer Performance Diagnosis for Production AI Training

πŸ“… 2026-03-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of diagnosing subtle OS-level anomalies in production-scale AI training, which can induce GPU latency and network performance degradation yet remain elusive to existing tools due to their high overhead, single-layer focus, and lack of continuous monitoring. To overcome these limitations, we present the first low-overhead (merely 0.4%), continuously operating cross-layer observability system that integrates eBPF, adaptive hybrid stack unwinding, and NCCL event tracing to jointly analyze data across CPU, GPU, and communication layers for precise root cause identification. Deployed across Alibaba’s cluster of over 80,000 GPUs, the system has operated stably for more than a year, successfully diagnosing 94 real-world incidents and reducing average troubleshooting time from several days to approximately 10 minutes.
πŸ“ Abstract
Performance diagnosis in production-scale AI training is challenging because subtle OS-level issues can trigger cascading GPU delays and network slowdowns, degrading training efficiency across thousands of GPUs. Existing profiling tools are limited to single system layers, incur prohibitive overhead (10--30%), or lack continuous deployment capabilities, resulting in manual analyses spanning days. We argue that continuous, cross-layer observability enabled by OS-level instrumentation and layered differential diagnosis is necessary to address this gap. We introduce SysOM-AI, a production observability system that continuously integrates CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via adaptive hybrid stack unwinding and eBPF-based tracing, incurring less than 0.4% overhead. Deployed at Alibaba across over 80,000 GPUs for more than one year, SysOM-AI helped diagnose 94 confirmed production issues, reducing median diagnosis time from days to approximately 10 minutes.
Problem

Research questions and friction points this paper is trying to address.

performance diagnosis
AI training
cross-layer observability
production-scale
OS-level issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-layer observability
eBPF-based tracing
adaptive hybrid stack unwinding
continuous performance diagnosis
production AI training
πŸ”Ž Similar Papers
No similar papers found.
Yusheng Zheng
Yusheng Zheng
UC santa cruz
W
Wenan Mao
Alibaba Group, China
S
Shuyi Cheng
Alibaba Group, China
F
Fuqiu Feng
Alibaba Group, China
G
Guangshui Li
Alibaba Group, China
Z
Zhaoyan Liao
Alibaba Group, China
Y
Yongzhuo Huang
Alibaba Group, China
Z
Zhenwei Xiao
Alibaba Group, China
Yuqing Li
Yuqing Li
East China Normal University
Deep Learning Theory
A
Andi Quinn
UC Santa Cruz, USA
T
Tao Ma
Alibaba Group, China