🤖 AI Summary
Existing tool-use benchmarks predominantly focus on single-turn interactions, failing to capture the complexities of real-world multi-turn, multi-party collaborative scenarios. Method: We propose DICE-BENCH, the first evaluation framework for high-complexity multi-turn tool invocation. It models function dependencies via tool graphs, simulates multi-role collaboration through embodied multi-agent interaction, and generates natural, cross-turn dependent dialogues using a hybrid rule-and-model approach. We further introduce DICE-SCORE—a novel metric quantifying tool information dispersion—and construct the DICE-BENCH dataset comprising 1,607 high-DICE-SCORE samples. Contribution/Results: Experiments reveal significant performance degradation across 19 state-of-the-art large language models on DICE-BENCH, underscoring the practical challenges in realistic tool orchestration. This work establishes a more realistic evaluation paradigm and provides a high-quality, challenging benchmark for advancing robust, collaborative tool use.
📝 Abstract
Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available: https://snuhcc.github.io/DICE-Bench/.