TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

Existing evaluation benchmarks for Tool-Integrated Reasoning (TIR) suffer from limitations in data quality, task diversity, diagnostic comprehensiveness, and assessment efficiency. This work proposes TIDE-Bench, a comprehensive benchmark for TIR that integrates mathematical reasoning, knowledge-intensive question answering, and two newly designed dynamic interactive tasks, alongside the first introduction of a tool-anchored experimental design. TIDE-Bench employs a task-aware, multidimensional automatic evaluation protocol assessing performance across four key dimensions: answer correctness, process reliability, tool usage efficiency, and reasoning cost. By incorporating high-discriminability sample selection, the benchmark enhances both challenge level and evaluation efficiency. Empirical results reveal persistent bottlenecks in current models’ ability to ground reasoning in tool use, offering critical insights for future research while substantially reducing evaluation overhead.

📝 Abstract

Tool-integrated reasoning has emerged as a promising paradigm for enhancing large language models with external computation, retrieval, and execution capabilities. However, the field still lacks a high-quality and unified evaluation benchmark, and existing TIR evaluations remain limited in dataset quality, task diversity, diagnostic comprehensiveness, and evaluation efficiency. In this work, we introduce TIDE-Bench, a holistic and efficient benchmark for evaluating TIR methods, featuring three key advantages. First, it provides diverse task settings, combining widely used mathematical reasoning and knowledge-intensive QA tasks with two newly designed tasks, namely the tool-grounded experimental design task and the dynamic interactive task, to probe models' abilities in complex tool invocation and multi-tool coordination. Second, TIDE-Bench adopts a comprehensive yet task-aware evaluation protocol, jointly measuring final answer quality, process reliability, tool-use efficiency, and inference cost across heterogeneous task settings. Third, TIDE-Bench constructs high-quality and discriminative evaluation sets by filtering low-discrimination instances from existing datasets, substantially reducing evaluation cost while focusing on more challenging samples. Extensive experiments on multiple foundation models and TIR methods reveal persistent bottlenecks in tool grounding, offering insights for future TIR research.

Problem

Research questions and friction points this paper is trying to address.

Tool-Integrated Reasoning

evaluation benchmark

task diversity

diagnostic comprehensiveness

evaluation efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-integrated reasoning

evaluation benchmark

task-aware evaluation

multi-tool coordination

diagnostic comprehensiveness

🔎 Similar Papers

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

2024-07-16arXiv.orgCitations: 24