GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

AI-based code generation suffers from poor compatibility with frequent Python library version updates. Method: We propose the first execution-verified, version-aware code generation benchmark, comprising 328 real-world programming tasks with precise version constraints and executable unit tests for automated, quantitative evaluation of LLMs, agents, and RAG systems. Contribution/Results: Our benchmark systematically reveals severe version-adaptation deficiencies in state-of-the-art LLMs—achieving only 48–51% base pass rates—even under controlled conditions. It establishes a novel evaluation paradigm tailored to dynamic dependency environments and is publicly released to advance research on adaptive code generation.

Technology Category

Application Category

📝 Abstract

The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI code generation for Python library version compatibility

Assessing functional accuracy of version-conditioned code via execution tests

Addressing challenges in AI adaptability to frequent library updates

Innovation

Methods, ideas, or system contributions that make the work stand out.

GitChameleon dataset with version-specific Python problems

Execution-based evaluation for library version compatibility

Assessing LLMs on version-conditioned code generation accuracy

🔎 Similar Papers

PCART: Automated Repair of Python API Parameter Compatibility Issues