🤖 AI Summary
Existing benchmarks for evaluating large language models (LLMs) on programming tasks are overly simplified, failing to capture the complexity of real-world scenarios and thereby risking both overestimation of model capabilities and data contamination. To address this, this work proposes GeneBench—a multi-objective optimization–based automated framework that injects realistic code complexity into any programming benchmark while preserving the original task semantics and code readability. GeneBench represents the first general-purpose, low-cost mechanism for complexity augmentation, substantially reducing the overhead and contamination risks associated with manual construction of high-quality benchmarks. Experiments across four widely used benchmarks demonstrate that applying GeneBench leads to an average performance drop of 35.2% (ranging from 14.9% to 60.5%) across 13 prominent LLMs, revealing a more accurate assessment of their true programming proficiency.
📝 Abstract
Evaluating Large Language Models (LLMs) with respect to real-world code complexity is essential. Otherwise, there is a risk of overestimating LLMs' programming abilities based on simplistic benchmarks, only to be disappointed when using them in real-world settings. Recently, researchers explored the construction of more realistic benchmarks by mining or augmenting open-source repositories. Such solutions are usually task-specific. Data quality control from real-world projects can also be time-consuming and error-prone. More importantly, evaluating LLMs on fixed benchmark problems is subject to data contamination and overfitting. We propose GeneBench, an automated technique to add real-world complexities to any programming benchmark. GeneBench leverages a multi-objective optimization to increase the complexity of programming problems while maintaining the readability of code similar to real-world programs. Transforming four widely-used programming benchmarks using GeneBench and evaluating 13 LLMs (including two reasoning LLMs) on them shows a notable performance drop across all programming tasks (14.9%-60.5%, avg=35.2%), demonstrating LLMs' struggle under real-world complexities. The struggle persists even when LLMs are few-shot prompted or fine-tuned with examples from different versions of GeneBench, demonstrating the challenging nature of the problems. Finally, we show that the performance of the studied LLMs in bug repair is similar under GeneBench and SWE-Bench. This, along with the consistent reproduction of performance drop of all studied LLMs across four tasks under different versions of GeneBench, makes the technique suitable to evaluate LLMs without costly construction of real-world benchmarks.