🤖 AI Summary
This work addresses the inefficiency and error-proneness of manually crafting CUDA kernels that jointly optimize computation and communication in large-scale distributed training and inference of large language models. While existing approaches typically focus on optimizing computation alone, they largely neglect the synergistic co-design of computation and communication. To overcome this limitation, we propose CUCo, the first training-free, agent-driven framework that enables end-to-end automated co-optimization of computation and communication kernels—breaking away from the conventional decoupled design paradigm. By integrating an intelligent agent workflow, automatic CUDA kernel generation, and a co-scheduling mechanism, CUCo achieves up to a 1.57× reduction in end-to-end latency compared to the state-of-the-art baseline.
📝 Abstract
Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation, leaving communication kernels largely untouched even though they constitute a significant share of total execution time. We introduce CUCo, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks new optimization opportunities unavailable to existing approaches, outperforming state-of-the-art baselines and reducing end-to-end latency by up to $1.57\times$.