Counting Without Running: Evaluating LLMs' Reasoning About Code Complexity

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing code assistants lack the rigorous reasoning capability of expert GPU developers when predicting CUDA kernel performance, particularly regarding floating-point operation (FLOP) counts. Method: We introduce gpuFLOPBench—the first benchmark dedicated to FLOP prediction—built on 577 real-world CUDA kernels extracted from HeCBench. It systematically distinguishes statically analyzable *explicit* FLOPs (e.g., arithmetic operations) from *implicit* FLOPs influenced by compiler optimizations and runtime behavior (e.g., divisions, transcendental math function calls), and provides ground-truth single- and double-precision FLOP counts alongside eight execution attributes. Contribution/Results: Evaluation of leading closed-source, inference-only LLMs reveals strong accuracy on simple kernels but errors spanning multiple orders of magnitude for kernels involving implicit FLOPs—exposing a fundamental failure to model hardware microarchitectural effects. This work identifies a critical blind spot in LLM-based GPU performance forecasting and establishes a new evaluation paradigm and actionable directions for trustworthy AI-assisted high-performance programming.

Technology Category

Application Category

📝 Abstract

Modern GPU software stacks demand developers who can anticipate performance bottlenecks before ever launching a kernel; misjudging floating-point workloads upstream can derail tuning, scheduling, and even hardware procurement. Yet despite rapid progress in code generation, today's Large Language Models (LLMs) are rarely tested on this kind of forward-looking reasoning. We close that gap with gpuFLOPBench, a benchmark that asks models to "count without running" by predicting single and double-precision FLOP counts for 577 CUDA kernels drawn from HeCBench, annotated with ground-truth profiles and eight execution attributes that distinguish trivially analyzable code from kernels whose FLOPs depend on hidden compiler or runtime behavior. Evaluating current closed-source reasoning models shows clear but uneven progress: the newest LLMs achieve perfect classification on straightforward kernels but still incur multiple order-of-magnitude errors whenever implicit FLOPs arise from division, intrinsic math functions, or common subexpressions. These results surface a core limitation of existing code assistants -- the inability to internalize hardware-specific microcode effects -- and position gpuFLOPBench as a focused testbed for developing LLM tooling that can reason about performance with the same rigor as experienced GPU developers. Sources are available at our repository: https://github.com/Scientific-Computing-Lab/gpuFLOPBench

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to predict FLOP counts without code execution

Assessing forward-looking reasoning about GPU kernel performance bottlenecks

Testing models on implicit FLOPs from compiler and runtime behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark gpuFLOPBench evaluates LLMs' FLOP prediction

Models classify simple kernels but fail on implicit FLOPs

Testbed targets reasoning about hardware-specific microcode effects

🔎 Similar Papers

Assessing Programming Task Difficulty for Efficient Evaluation of Large Language Models