CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

📅 2024-07-08

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the challenge of adapting large language models (LLMs) to dynamic API evolution in code generation—a capability poorly supported by existing approaches relying on runtime documentation or static knowledge, which fail to internalize semantic function updates. We propose the first benchmark specifically designed for API-evolution-aware code knowledge editing. We formally define and quantify LLMs’ code knowledge editing ability, emphasizing semantic reasoning over syntactic memorization. Our benchmark comprises 670 executable program instances across 54 Python functions, with all tasks requiring models to generate semantically correct code under new API specifications—without access to external documentation. Leveraging GPT-4–synthesized data, we systematically evaluate open-source models (e.g., DeepSeek, CodeLlama) and state-of-the-art knowledge editing methods. Results show a substantial performance gap between current techniques and human-level proficiency, underscoring both the difficulty of the task and the urgent need for dedicated research.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly being used to synthesize and reason about source code. However, the static nature of these models' knowledge does not reflect the fact that libraries and API functions they invoke are continuously evolving, with functionality being added or changing. While numerous benchmarks evaluate how LLMs can generate code, no prior work has studied how an LLMs' knowledge about code API functions can be updated. To fill this gap, we present CodeUpdateArena, a benchmark for knowledge editing in the code domain. An instance in our benchmark consists of a synthetic API function update paired with a program synthesis example that uses the updated functionality; our goal is to update an LLM to be able to solve this program synthesis example without providing documentation of the update at inference time. Compared to knowledge editing for facts encoded in text, success here is more challenging: a code LLM must correctly reason about the semantics of the modified function rather than just reproduce its syntax. Our dataset is constructed by first prompting GPT-4 to generate atomic and executable function updates. Then, for each update, we generate program synthesis examples whose code solutions are prone to use the update. Our benchmark covers updates of various types to 54 functions from seven diverse Python packages, with a total of 670 program synthesis examples. Our experiments show that prepending documentation of the update to open-source code LLMs (i.e., DeepSeek, CodeLlama) does not allow them to incorporate changes for problem solving, and existing knowledge editing techniques also have substantial room for improvement. We hope our benchmark will inspire new methods for knowledge updating in code LLMs.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking knowledge editing for API updates.

Updating LLMs on evolving code functionalities.

Improving LLMs' reasoning on modified function semantics.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarks knowledge editing in code LLMs

Uses GPT-4 to generate function updates

Tests updates across 54 Python functions

🔎 Similar Papers

No similar papers found.