🤖 AI Summary
Existing evaluation benchmarks for Tool-Integrated Reasoning (TIR) suffer from limitations in data quality, task diversity, diagnostic comprehensiveness, and assessment efficiency. This work proposes TIDE-Bench, a comprehensive benchmark for TIR that integrates mathematical reasoning, knowledge-intensive question answering, and two newly designed dynamic interactive tasks, alongside the first introduction of a tool-anchored experimental design. TIDE-Bench employs a task-aware, multidimensional automatic evaluation protocol assessing performance across four key dimensions: answer correctness, process reliability, tool usage efficiency, and reasoning cost. By incorporating high-discriminability sample selection, the benchmark enhances both challenge level and evaluation efficiency. Empirical results reveal persistent bottlenecks in current models’ ability to ground reasoning in tool use, offering critical insights for future research while substantially reducing evaluation overhead.
📝 Abstract
Tool-integrated reasoning has emerged as a promising paradigm for enhancing large language models with external computation, retrieval, and execution capabilities. However, the field still lacks a high-quality and unified evaluation benchmark, and existing TIR evaluations remain limited in dataset quality, task diversity, diagnostic comprehensiveness, and evaluation efficiency. In this work, we introduce TIDE-Bench, a holistic and efficient benchmark for evaluating TIR methods, featuring three key advantages. First, it provides diverse task settings, combining widely used mathematical reasoning and knowledge-intensive QA tasks with two newly designed tasks, namely the tool-grounded experimental design task and the dynamic interactive task, to probe models' abilities in complex tool invocation and multi-tool coordination. Second, TIDE-Bench adopts a comprehensive yet task-aware evaluation protocol, jointly measuring final answer quality, process reliability, tool-use efficiency, and inference cost across heterogeneous task settings. Third, TIDE-Bench constructs high-quality and discriminative evaluation sets by filtering low-discrimination instances from existing datasets, substantially reducing evaluation cost while focusing on more challenging samples. Extensive experiments on multiple foundation models and TIR methods reveal persistent bottlenecks in tool grounding, offering insights for future TIR research.