CoCoNUT: Structural Code Understanding does not fall out of a tree

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper reveals severe deficiencies in large language models’ (LLMs) understanding of code control flow—particularly for long-path tracing, recursion, parallel execution, and object-oriented programming (OOP), where OOP path accuracy often falls below 5%. Method: We introduce CoCoNUT, the first fine-grained benchmark dedicated to execution-path understanding, covering three core control-flow structures absent in existing benchmarks. Our methodology integrates HumanEval-based execution-trace generation, function-call sampling, and precise path-matching evaluation, augmented by a structurally enriched test suite. Contribution/Results: Experiments show that state-of-the-art LLMs—including Gemini—achieve at most 47% full-path accuracy; all others score significantly below 5%. CoCoNUT establishes a new, reproducible standard for evaluating code reasoning capabilities, enabling rigorous, structure-aware assessment of control-flow comprehension.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown impressive performance across a wide array of tasks involving both structured and unstructured textual data. Recent results on various benchmarks for code generation, repair, or completion suggest that certain models have programming abilities comparable to or even surpass humans. In this work, we demonstrate that high performance on such benchmarks does not correlate to humans' innate ability to understand structural control flow in code. To this end, we extract solutions from the HumanEval benchmark, which the relevant models perform strongly on, and trace their execution path using function calls sampled from the respective test set. Using this dataset, we investigate the ability of seven state-of-the-art LLMs to match the execution trace and find that, despite their ability to generate semantically identical code, they possess limited ability to trace execution paths, especially for longer traces and specific control structures. We find that even the top-performing model, Gemini, can fully and correctly generate only 47% of HumanEval task traces. Additionally, we introduce a subset for three key structures not contained in HumanEval: Recursion, Parallel Processing, and Object-Oriented Programming, including concepts like Inheritance and Polymorphism. Besides OOP, we show that none of the investigated models achieve an accuracy over 5% on the relevant traces. Aggregating these specialized parts with HumanEval tasks, we present Benchmark CoCoNUT: Code Control Flow for Navigation Understanding and Testing, which measures a model's ability to trace execution of code upon relevant calls, including advanced structural components. We conclude that current LLMs need significant improvement to enhance code reasoning abilities. We hope our dataset helps researchers bridge this gap.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Complex Code Structure
Advanced Programming Concepts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Code Understanding
CoCoNUT Framework
🔎 Similar Papers
No similar papers found.