CUCo: An Agentic Framework for Compute and Communication Co-design

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

This work addresses the inefficiency and error-proneness of manually crafting CUDA kernels that jointly optimize computation and communication in large-scale distributed training and inference of large language models. While existing approaches typically focus on optimizing computation alone, they largely neglect the synergistic co-design of computation and communication. To overcome this limitation, we propose CUCo, the first training-free, agent-driven framework that enables end-to-end automated co-optimization of computation and communication kernels—breaking away from the conventional decoupled design paradigm. By integrating an intelligent agent workflow, automatic CUDA kernel generation, and a co-scheduling mechanism, CUCo achieves up to a 1.57× reduction in end-to-end latency compared to the state-of-the-art baseline.

Technology Category

Application Category

📝 Abstract

Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation, leaving communication kernels largely untouched even though they constitute a significant share of total execution time. We introduce CUCo, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks new optimization opportunities unavailable to existing approaches, outperforming state-of-the-art baselines and reducing end-to-end latency by up to $1.57\times$.

Problem

Research questions and friction points this paper is trying to address.

CUDA kernel

computation-communication co-design

distributed LLM training

GPU utilization

communication optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

compute-communication co-design

CUDA kernel generation

agentic framework