Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing TaLM benchmarks suffer from uncontrolled numbers of functions, difficulty in adjusting task complexity, and susceptibility to data contamination. This paper introduces FuncBenchGen—the first data-contamination-free, fully controllable synthetic evaluation framework for tool-augmented language models. It formalizes multi-step tool invocation as traversal tasks over function-dependency directed acyclic graphs (DAGs), enabling precise control over key difficulty dimensions such as dependency depth and number of distractor functions. Its core innovations are a controllable task generation mechanism and a lightweight state-repair strategy: at each step, the model rephrases variable values to mitigate LLMs’ fragility in state tracking. Experiments across seven LLMs demonstrate that reasoning-optimized models outperform general-purpose ones, with GPT-5 achieving top performance. Performance degrades sharply as task difficulty increases; incorporating the rephrasing strategy boosts GPT-5’s success rate from 62.5% to 81.3%.

Technology Category

Application Category

📝 Abstract
As language models gain access to external tools via structured function calls, they become increasingly more capable of solving complex, multi-step tasks. However, existing benchmarks for tool-augmented language models (TaLMs) provide insufficient control over factors such as the number of functions accessible, task complexity, and input size, and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where nodes are function calls and an edge between nodes represents one function consuming the output of another. Given a set of external function schemas, initial variable values, and a target variable, models must compose the correct call sequence to compute the target variable. FuncBenchGen allows users to precisely control task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding data leakage. We apply our FuncBenchGen framework to evaluate seven LLMs on tool use tasks of varying difficulty. Reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other models. Performance declines sharply as dependency depth increases. Furthermore, connected irrelevant functions prove especially difficult to handle. We find that strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding a success rate improvement from 62.5% to 81.3% for GPT-5.
Problem

Research questions and friction points this paper is trying to address.

Evaluates tool-augmented language models on multi-step function calling tasks
Addresses data contamination and insufficient control in existing benchmarks
Analyzes performance degradation with increasing dependency depth and distractors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic multi-step tool-use tasks
Models tool use as function-dependency DAG traversal
Introduces explicit variable restatement for state tracking
🔎 Similar Papers
No similar papers found.