🤖 AI Summary
This work addresses the inefficiency of large language models (LLMs) in code editing, where full-code regeneration incurs high latency and computational cost, and conventional diff formats are poorly suited for LLMs. The study presents the first systematic analysis of how diff representations impact LLM-based code generation and introduces structure-aware BlockDiff and FuncDiff formats that model code changes as syntactically coherent block-level rewrites. Furthermore, it proposes AdaEdit, an adaptive output strategy that enables the model to dynamically select the most token-efficient representation. Experimental results demonstrate that, on long-context code editing tasks, the proposed approach reduces latency and cost by over 30% compared to full-code generation while maintaining comparable accuracy.
📝 Abstract
Large Language Models (LLMs) are increasingly used for code editing, yet the prevalent full-code generation paradigm suffers from severe efficiency bottlenecks, posing challenges for interactive coding assistants that demand low latency and cost. Despite the predominant focus on scaling model capabilities, the edit format itself has been largely overlooked in model training. In this paper, we begin with a systematic study of conventional diff formats and reveal that fragile offsets and fragmented hunks make generation highly unnatural for LLMs. To address it, we introduce BlockDiff and FuncDiff, two structure-aware diff formats that represent changes as block-level rewrites of syntactically coherent units such as control structures and functions. Furthermore, we propose AdaEdit, a general adaptive edit strategy that trains LLMs to dynamically choose the most token-efficient format between a given diff format and full code. Extensive experiments demonstrate that AdaEdit paired with structure-aware diff formats consistently matches the accuracy of full-code generation, while reducing both latency and cost by over 30% on long-code editing tasks.