CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses a critical limitation in evaluating large language models (LLMs) for CUDA program debugging: existing approaches often allow LLMs to bypass genuine bugs via performance-degrading fixes that merely satisfy test cases, thereby overestimating true debugging capability. To remedy this, we introduce CUDABeaver—the first benchmark specifically designed for performance-preserving CUDA debugging—comprising 213 tasks derived from real-world LLM failure cases. Each task enforces single-file editing constraints and provides error logs and build commands. We further propose a protocol-aware evaluation metric, pass@k(M,C,A), enabling multidimensional assessment. Experiments reveal that slightly tightening performance tolerance thresholds reduces the debugging success rate of seven state-of-the-art LLMs by up to 40 percentage points, exposing substantial optimistic bias in current evaluation practices.

📝 Abstract

Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.

Problem

Research questions and friction points this paper is trying to address.

CUDA debugging

LLM evaluation

performance preservation

code repair

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

CUDA debugging

LLM-based repair

performance preservation