🤖 AI Summary
This paper introduces the Gistify task to evaluate large language models’ deep understanding of and ability to model program execution within large codebases. The task requires models to synthesize a minimal, self-contained single-file program that precisely reproduces the runtime output of a specified entry-point command on the original codebase. Unlike conventional unit-level testing, Gistify establishes the first codebase-level, behaviorally aligned evaluation paradigm, achieved through end-to-end functional reconstruction via program execution tracing, inter-procedural dependency analysis, and minimal code fragment synthesis. Empirical results show that current state-of-the-art code generation models perform poorly on Gistify—particularly under long execution paths—failing to simultaneously ensure correctness and conciseness. This reveals fundamental limitations in their capacity to model cross-file control flow and capture dynamic, runtime semantics.
📝 Abstract
As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.