Gistify! Codebase-Level Understanding via Runtime Execution

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This paper introduces the Gistify task to evaluate large language models’ deep understanding of and ability to model program execution within large codebases. The task requires models to synthesize a minimal, self-contained single-file program that precisely reproduces the runtime output of a specified entry-point command on the original codebase. Unlike conventional unit-level testing, Gistify establishes the first codebase-level, behaviorally aligned evaluation paradigm, achieved through end-to-end functional reconstruction via program execution tracing, inter-procedural dependency analysis, and minimal code fragment synthesis. Empirical results show that current state-of-the-art code generation models perform poorly on Gistify—particularly under long execution paths—failing to simultaneously ensure correctness and conciseness. This reveals fundamental limitations in their capacity to model cross-file control flow and capture dynamic, runtime semantics.

Technology Category

Application Category

📝 Abstract

As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.

Problem

Research questions and friction points this paper is trying to address.

Automatically evaluate coding agents on codebase-level understanding tasks

Generate minimal self-contained files reproducing specific codebase functionalities

Require structural understanding and execution flow modeling of large codebases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates minimal self-contained file replicating functionality

Requires structural understanding and execution flow modeling

Evaluates coding LLMs on codebase-level comprehension tasks

🔎 Similar Papers

No similar papers found.