BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

📅 2024-06-22
🏛️ arXiv.org
📈 Citations: 103
Influential: 12
📄 PDF
🤖 AI Summary
Existing LLM tool-use benchmarks focus on simple, isolated tasks and fail to assess models’ ability to coordinate multiple tools and comprehend complex instructions in realistic settings. Method: We introduce BigCodeBench—the first fine-grained code-generation benchmark grounded in real-world programming scenarios—comprising 1,140 tasks across seven domains, requiring invocation of diverse functions from 139 widely used libraries. We propose a high-branch-coverage (99%) evaluation framework integrating multi-library API modeling, automated test-case generation, branch-coverage validation, and natural-language instruction distillation. We further present BigCodeBench-Instruct, an instruction-augmented variant to strengthen natural-language understanding assessment. Contribution/Results: Evaluated on 60 state-of-the-art models, the best-performing model achieves only 60% accuracy—substantially below human performance (97%)—revealing fundamental limitations in current LLMs’ capabilities for complex tool orchestration and instruction following.

Technology Category

Application Category

📝 Abstract
Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs.To assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to handle diverse function calls for complex tasks
Assessing LLMs' performance in compositional reasoning with multiple tools
Benchmarking code generation accuracy under complex instructions and high coverage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking LLMs with diverse function calls
Evaluating compositional reasoning in code generation
Transforming docstrings into concise task instructions
🔎 Similar Papers
No similar papers found.
Terry Yue Zhuo
Terry Yue Zhuo
Researcher
Large Language ModelsCode GenerationAI4SECybersecurity
M
Minh Chien Vu
CSIRO’s Data61
Jenny Chim
Jenny Chim
Queen Mary University of London
natural language processingcomputational linguistics
H
Han Hu
Singapore Management University
W
Wenhao Yu
University of Notre Dame
Ratnadira Widyasari
Ratnadira Widyasari
Singapore Management University
Computer science
Imam Nur Bani Yusuf
Imam Nur Bani Yusuf
Singapore Management University
code generationAI for software engineeringdeep learning
Haolan Zhan
Haolan Zhan
Monash University
Natural Language ProcessingDialogue SystemsResponsible AI
Junda He
Junda He
Singapore Management University
software engineering
Indraneil Paul
Indraneil Paul
TU Darmstadt | Amazon Inc. & IIIT Hyderabad
Deep LearningNLPCode GenerationPreference LearningFunction Calling
S
Simon Brunner
Independent
C
Chen Gong
University of Virginia
T
Thong Hoang
CSIRO’s Data61
Armel Randy Zebaze
Armel Randy Zebaze
PhD Student, INRIA Paris
X
Xiaoheng Hong
Intel
Wen-Ding Li
Wen-Ding Li
Cornell University
Machine Learning
Jean Kaddour
Jean Kaddour
University College London
LLMs
M
Ming Xu
Independent
Zhihan Zhang
Zhihan Zhang
PhD student, University of Notre Dame
Natural Language Processing
Prateek Yadav
Prateek Yadav
PhD, University of North Carolina Chapel Hill
Continual LearningMoEModel MergingModular NetworkEfficient AI
N
Naman Jain
UC Berkeley
Alex Gu
Alex Gu
MIT
program synthesismachine learninglarge language modelscode generation
Zhoujun Cheng
Zhoujun Cheng
UC San Diego
Natural Language ProcessingArtificial Intelligence
J
Jiawei Liu
UIUC
Q
Qian Liu
Sea AI Lab
Z
Zijian Wang
AWS AI Labs
D
David Lo
Singapore Management University
Binyuan Hui
Binyuan Hui
Qwen Team, Alibaba Group
Large Language ModelsCodeLLMsReasoningAgent
Niklas Muennighoff
Niklas Muennighoff
Stanford University
large language modelsartificial intelligencemachine learning
Daniel Fried
Daniel Fried
Carnegie Mellon University
Natural Language ProcessingMachine Learning
Xiaoning Du
Xiaoning Du
Senior Lecturer (equivalent to U.S. Associate Professor), Monash University
Software EngineeringArtificial IntelligenceCybersecurityRuntime Verification
Harm de Vries
Harm de Vries
ServiceNow Research
L
Leandro Von Werra
Hugging Face