🤖 AI Summary
Large language models (LLMs) struggle to keep pace with the rapid evolution of third-party library APIs due to the static nature of their pretraining data, often generating non-executable or low-quality code. This work introduces CODESYNC, the first systematic framework addressing dynamic synchronization of LLMs’ code knowledge. Its core components are: (1) a CODESYNC Data Engine that captures real-time Python library API changes and generates dynamic knowledge updates; (2) CODESYNCBENCH—a novel benchmark comprising 3,300 test cases and 220 real-world API updates—to quantitatively evaluate model synchronization capability; and (3) an update-aware instruction-tuning dataset with multi-strategy alignment experiments (DPO, ORPO, SimPO). Evaluation across 14 state-of-the-art models reveals significant lag between current methods and actual API evolution rates. All resources—including data, benchmarks, and code—are publicly released to advance research on real-time code knowledge updating.
📝 Abstract
Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: https://github.com/Lucky-voyage/Code-Sync.