AdaptEval: A Benchmark for Evaluating Large Language Models on Code Snippet Adaptation

📅 2025-11-16
🏛️ International Conference on Automated Software Engineering
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of effective evaluation benchmarks for large language models (LLMs) in code snippet adaptation tasks, which hinders a deeper understanding of their utility in software reuse. To bridge this gap, we introduce CAPT, the first benchmark specifically designed for code adaptation, constructed from real-world developer practices by integrating data from Stack Overflow and GitHub. CAPT features a dual-layer annotation scheme—task-level and adaptation-level—and a joint adaptation-function testing framework, enabling fine-grained assessment of models’ instruction-following and functional implementation capabilities. We evaluate six instruction-tuned and three reasoning-focused LLMs, revealing significant deficiencies in current models’ ability to follow explicit adaptation instructions. Our benchmark provides both critical insights and practical tools to advance future research in this domain.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models (LLMs) have automated various software engineering tasks, with benchmarks emerging to evaluate their capabilities. However, for adaptation, a critical activity during code reuse, there is no benchmark to assess LLMs’ performance, leaving their practical utility in this area unclear. To fill this gap, we propose AdaptEval, a benchmark designed to evaluate LLMs on code snippet adaptation. Unlike existing benchmarks, AdaptEval incorporates the following three distinctive features: First, practical context. Tasks in AdaptEval are derived from developers’ practices, preserving rich contextual information from Stack Overflow and GitHub communities. Second, multi-granularity annotation. Each task is annotated with requirements at both task and adaptation levels, supporting the evaluation of LLMs across diverse adaptation scenarios. Third, fine-grained evaluation. AdaptEval includes a two-tier testing framework combining adaptation-level and function-level tests, which enables evaluating LLMs’ performance across various individual adaptations. Based on AdaptEval, we conduct the first empirical study to evaluate six instruction-tuned LLMs and especially three reasoning LLMs on code snippet adaptation. Experimental results demonstrate that AdaptEval enables the assessment of LLMs’ adaptation capabilities from various perspectives. It also provides critical insights into their current limitations, particularly their struggle to follow explicit instructions. We hope AdaptEval can facilitate further investigation and enhancement of LLMs’ capabilities in code snippet adaptation, supporting their real-world applications.
Problem

Research questions and friction points this paper is trying to address.

code snippet adaptation
large language models
benchmark
code reuse
software engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

code snippet adaptation
benchmark
large language models
multi-granularity annotation
fine-grained evaluation
🔎 Similar Papers
2024-03-252024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (Forge) Conference Acronym:Citations: 22
Tanghaoran Zhang
Tanghaoran Zhang
National University of Defense Technology
software engineering
X
Xinjun Mao
College of Computer Science and Technology, National University of Defense and Technology, Changsha, China
Shangwen Wang
Shangwen Wang
National University of Defense Technology
software engineering
Y
Yuxin Zhao
College of Computer Science and Technology, National University of Defense and Technology, Changsha, China
Y
Yao Lu
College of Computer Science and Technology, National University of Defense and Technology, Changsha, China
J
Jin Zhang
Changsha University of Science and Technology, Changsha, China
Zhang Zhang
Zhang Zhang
National University of Defense Technology
Kang Yang
Kang Yang
National University of Defense Technology
AI4SE: Program ComprehensionCode Search/GenerationNLP: Text SummarizationGEC
Yue Yu
Yue Yu
Professor at Pengcheng Laboratory
Software EngineeringDistributed ComputingArtificial Intelligence System