ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario

📅 2025-01-17

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing large language models lack systematic evaluation of multi-step, highly constrained function calling—particularly in realistic settings involving long parameters and large-context API orchestration (e.g., 128k-token contexts). Method: We introduce the first benchmark specifically designed for multi-step function calling across five real-world scenarios under 128k-context constraints. We formally define and quantify three core capabilities: parameter filling, value reasoning, and cross-step API orchestration. We further propose ComplexEval, an automated evaluation framework integrating LLM-driven data synthesis, rule-augmented validators, long-context-aware execution trajectory modeling, and multidimensional metrics (accuracy, consistency, robustness). Contribution/Results: Experiments expose pervasive failures in state-of-the-art models—including step disconnection, erroneous parameter instantiation, and constraint violations. Both the benchmark and implementation are open-sourced, establishing a standardized evaluation paradigm for API-augmented LLMs.

Technology Category

Application Category

📝 Abstract

Enhancing large language models (LLMs) with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the complexity of data collection and evaluation. In this work, we introduce ComplexFuncBench, a benchmark for complex function calling across five real-world scenarios. Compared to existing benchmarks, ComplexFuncBench encompasses multi-step and constrained function calling, which requires long-parameter filing, parameter value reasoning, and 128k long context. Additionally, we propose an automatic framework, ComplexEval, for quantitatively evaluating complex function calling tasks. Through comprehensive experiments, we demonstrate the deficiencies of state-of-the-art LLMs in function calling and suggest future directions for optimizing these capabilities. The data and code are available at url{https://github.com/THUDM/ComplexFuncBench}.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Rule-based Multi-step Function Calls

Long Context Processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

ComplexFuncBench

ComplexEval

Multistep Function Calling

🔎 Similar Papers

Evaluating Long Range Dependency Handling in Code Generation Models using Multi-Step Key Retrieval