DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Existing tool-use benchmarks predominantly focus on single-turn interactions, failing to capture the complexities of real-world multi-turn, multi-party collaborative scenarios. Method: We propose DICE-BENCH, the first evaluation framework for high-complexity multi-turn tool invocation. It models function dependencies via tool graphs, simulates multi-role collaboration through embodied multi-agent interaction, and generates natural, cross-turn dependent dialogues using a hybrid rule-and-model approach. We further introduce DICE-SCORE—a novel metric quantifying tool information dispersion—and construct the DICE-BENCH dataset comprising 1,607 high-DICE-SCORE samples. Contribution/Results: Experiments reveal significant performance degradation across 19 state-of-the-art large language models on DICE-BENCH, underscoring the practical challenges in realistic tool orchestration. This work establishes a more realistic evaluation paradigm and provides a high-quality, challenging benchmark for advancing robust, collaborative tool use.

Technology Category

Application Category

📝 Abstract

Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available: https://snuhcc.github.io/DICE-Bench/.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' tool-use in multi-round dialogues

Addressing lack of real-world complexity in benchmarks

Proposing DICE-BENCH for practical function-calling evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DICE-SCORE for tool-use evaluation

Uses tool graph for multi-round dependencies

Employs multi-agent system for natural dialogues

🔎 Similar Papers

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues