TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing process reward models (PRMs) struggle to model structured operations—such as subtable retrieval and schema interaction—in table reasoning, leading to ineffective supervision. To address this, we propose TaTToo, the first framework to embed tool invocation into reward modeling, explicitly representing multi-step table reasoning and enabling fine-grained, interpretable process supervision via tool execution verification. Our method introduces a tool-execution-driven data construction pipeline and a two-stage training paradigm: cold-start supervised fine-tuning followed by tool-augmented reinforcement learning. Evaluated on five table reasoning benchmarks, TaTToo boosts downstream large language models by an average of 30.9%. Notably, its 8B-parameter variant significantly outperforms strong baselines—including the 72B-parameter Qwen-PRM—demonstrating superior cross-task generalization.

Technology Category

Application Category

📝 Abstract
Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.
Problem

Research questions and friction points this paper is trying to address.

Existing PRMs struggle with table-specific operations in tabular reasoning
Current models lack tool-based verification for precise reward supervision
Performance bottlenecks exist in sub-table retrieval and schema interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-grounded PRM framework for tabular reasoning
Integrates tool-based verification for reward supervision
Uses dual-stage training with fine-tuning and reinforcement learning
🔎 Similar Papers
No similar papers found.