🤖 AI Summary
This work addresses a critical yet overlooked issue in code generation by large language models: functional failures caused by subtle transcription errors—such as omitting or misrepresenting high-precision decimal constants—despite their strong algorithmic reasoning capabilities. To bridge this gap, the paper introduces the first benchmark specifically designed to evaluate long-range state tracking and data fidelity in code transcription, requiring models to accurately embed high-precision decimal constants into Python programs and execute subsequent aggregations. Leveraging an evaluation protocol based on exact string matching, multiple prompting strategies, and a structured failure analysis framework, the study systematically uncovers pervasive silent transcription errors across current models. The proposed benchmark offers a reproducible stress-testing paradigm that advances the reliability assessment of code generation systems.
📝 Abstract
Many real-world software tasks require exact transcription of provided data into code, such as cryptographic constants, protocol test vectors, allowlists, and calibration tables. These tasks are operationally sensitive because small omissions or alterations can remain silent while producing syntactically valid programs. This paper introduces a deliberately minimal transcription-to-code benchmark to isolate this reliability concern in LLM-based code generation. Given a list of high-precision decimal constants, a model must generate Python code that embeds the constants verbatim and performs a simple aggregate computation. We describe the prompting variants, evaluation protocol based on exact-string inclusion, and analysis framework used to characterize state-tracking and long-horizon generation failures. The benchmark is intended as a compact stress test that complements existing code-generation evaluations by focusing on data integrity rather than algorithmic reasoning.