BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

📅 2024-06-22

🏛️ arXiv.org

📈 Citations: 103

✨ Influential: 12

career value

201K/year

🤖 AI Summary

Existing LLM tool-use benchmarks focus on simple, isolated tasks and fail to assess models’ ability to coordinate multiple tools and comprehend complex instructions in realistic settings. Method: We introduce BigCodeBench—the first fine-grained code-generation benchmark grounded in real-world programming scenarios—comprising 1,140 tasks across seven domains, requiring invocation of diverse functions from 139 widely used libraries. We propose a high-branch-coverage (99%) evaluation framework integrating multi-library API modeling, automated test-case generation, branch-coverage validation, and natural-language instruction distillation. We further present BigCodeBench-Instruct, an instruction-augmented variant to strengthen natural-language understanding assessment. Contribution/Results: Evaluated on 60 state-of-the-art models, the best-performing model achieves only 60% accuracy—substantially below human performance (97%)—revealing fundamental limitations in current LLMs’ capabilities for complex tool orchestration and instruction following.

Technology Category

Application Category

📝 Abstract

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs.To assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to handle diverse function calls for complex tasks

Assessing LLMs' performance in compositional reasoning with multiple tools

Benchmarking code generation accuracy under complex instructions and high coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking LLMs with diverse function calls

Evaluating compositional reasoning in code generation

Transforming docstrings into concise task instructions

🔎 Similar Papers

A Survey on Evaluating Large Language Models in Code Generation Tasks