🤖 AI Summary
In real-world robotic reinforcement learning, reward function design remains challenging due to the lack of gait awareness and single-view limitations in existing process reward models (PRMs), coupled with an underdeveloped reward shaping theory that often leads to semantic traps. Method: We propose a gait-aware, multi-view fused process reward modeling framework: (i) introducing step-level reward discretization and multi-view reward fusion; (ii) establishing a policy-invariant reward shaping theoretical framework to provably eliminate misleading optimization. Using over 3,400 hours of multi-view robotic manipulation data, we train a General Reward Model (GRM) integrated with the Dopamine-RL framework and one-step task adaptation. Results: GRM achieves state-of-the-art evaluation accuracy; for novel tasks, only one expert demonstration plus 150 online interactions (~1 hour) suffices to raise policy success rate from ≈0% to 95%, demonstrating strong generalization.
📝 Abstract
The primary obstacle for applying reinforcement learning (RL) to real-world robotics is the design of effective reward functions. While recently learning-based Process Reward Models (PRMs) are a promising direction, they are often hindered by two fundamental limitations: their reward models lack step-aware understanding and rely on single-view perception, leading to unreliable assessments of fine-grained manipulation progress; and their reward shaping procedures are theoretically unsound, often inducing a semantic trap that misguides policy optimization. To address these, we introduce Dopamine-Reward, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. At its core is our General Reward Model (GRM), trained on a vast 3,400+ hour dataset, which leverages Step-wise Reward Discretization for structural understanding and Multi-Perspective Reward Fusion to overcome perceptual limitations. Building upon Dopamine-Reward, we propose Dopamine-RL, a robust policy learning framework that employs a theoretically-sound Policy-Invariant Reward Shaping method, which enables the agent to leverage dense rewards for efficient self-improvement without altering the optimal policy, thereby fundamentally avoiding the semantic trap. Extensive experiments across diverse simulated and real-world tasks validate our approach. GRM achieves state-of-the-art accuracy in reward assessment, and Dopamine-RL built on GRM significantly improves policy learning efficiency. For instance, after GRM is adapted to a new task in a one-shot manner from a single expert trajectory, the resulting reward model enables Dopamine-RL to improve the policy from near-zero to 95% success with only 150 online rollouts (approximately 1 hour of real robot interaction), while retaining strong generalization across tasks. Project website: https://robo-dopamine.github.io