ARM: Advantage Reward Modeling for Long-Horizon Manipulation

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of credit assignment in long-horizon robotic manipulation, where sparse rewards hinder learning and dense rewards are costly to annotate and ill-suited for non-monotonic behaviors. The authors propose Advantage-based Reward Modeling (ARM), a framework that replaces absolute progress evaluation with relative advantage comparisons between actions. ARM leverages low-cognitive-burden ternary human labels—indicating progress, regression, or stagnation—to enable consistent yet inexpensive supervision. The approach automatically generates rewards from both complete and fragmented demonstrations and integrates offline reinforcement learning with DAgger-style data utilization. Evaluated on a complex towel-folding task, ARM achieves a 99.4% success rate, substantially outperforming existing Vision-Language-Action (VLA) baselines while demonstrating exceptional data efficiency and training stability.
📝 Abstract
Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy -- Progressive, Regressive, and Stagnant -- that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.
Problem

Research questions and friction points this paper is trying to address.

long-horizon manipulation
sparse rewards
credit assignment
dense progress rewards
non-monotonic behaviors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advantage Reward Modeling
long-horizon manipulation
tri-state labeling
offline reinforcement learning
progress estimation
🔎 Similar Papers
No similar papers found.
Y
Yiming Mao
LimX Dynamics
Z
Zixi Yu
LimX Dynamics, Beijing University of Posts and Telecommunications
W
Weixin Mao
LimX Dynamics
Y
Yinhao Li
LimX Dynamics
Q
Qirui Hu
LimX Dynamics
Z
Zihan Lan
LimX Dynamics
M
Minzhao Zhu
LimX Dynamics
Hua Chen
Hua Chen
Assistant Professor, ZJU-UIUC Institute; Co-founder, LimX Dynamics
RoboticsEmbodied AIRobot LearningReinforcement LearningControl