Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the limitations of existing general-purpose process reward models, which struggle to detect silent errors in data agents and often misclassify exploratory behaviors as grounding failures. To overcome these challenges, the authors propose DataPRM—the first environment-aware generative process reward model tailored for data agent analysis. DataPRM actively validates intermediate reasoning states through environmental interaction and incorporates a reflection-aware triplet reward mechanism to distinguish between recoverable and irreversible errors. The model is trained on diverse trajectories with knowledge-enhanced, step-level annotations. Experimental results demonstrate that DataPRM improves performance by 7.21% and 11.28% on ScienceAgentBench and DABStep, respectively. When integrated with reinforcement learning, it achieves state-of-the-art results of 78.73% on DABench and 64.84% on TableBench, significantly outperforming current baselines.

Technology Category

Application Category

📝 Abstract

Process Reward Models (PRMs) have achieved remarkable success in augmenting the reasoning capabilities of Large Language Models (LLMs) within static domains such as mathematics. However, their potential in dynamic data analysis tasks remains underexplored. In this work, we first present a empirical study revealing that general-domain PRMs struggle to supervise data analysis agents. Specifically, they fail to detect silent errors, logical flaws that yield incorrect results without triggering interpreter exceptions, and erroneously penalize exploratory actions, mistaking necessary trial-and-error exploration for grounding failures. To bridge this gap, we introduce DataPRM, a novel environment-aware generative process reward model that (1) can serve as an active verifier, autonomously interacting with the environment to probe intermediate execution states and uncover silent errors, and (2) employs a reflection-aware ternary reward strategy that distinguishes between correctable grounding errors and irrecoverable mistakes. We design a scalable pipeline to construct over 8K high-quality training instances for DataPRM via diversity-driven trajectory generation and knowledge-augmented step-level annotation. Experimental results demonstrate that DataPRM improves downstream policy LLMs by 7.21% on ScienceAgentBench and 11.28% on DABStep using Best-of-N inference. Notably, with only 4B parameters, DataPRM outperforms strong baselines, and exhibits robust generalizability across diverse Test-Time Scaling strategies. Furthermore, integrating DataPRM into Reinforcement Learning yields substantial gains over outcome-reward baselines, achieving 78.73% on DABench and 64.84% on TableBench, validating the effectiveness of process reward supervision. Code is available at https://github.com/zjunlp/DataMind.

Problem

Research questions and friction points this paper is trying to address.

Process Reward Models

silent errors

data analysis agents

grounding failures

dynamic data analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Modeling

Data Analysis Agents

Silent Error Detection