SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Long-horizon, contact-rich manipulation of deformable objects (e.g., T-shirt folding) suffers from heterogeneous demonstration quality and poor generalization of conventional frame-level reward modeling. Method: We propose a phase-aware video reward modeling framework that automatically generates temporal reward labels from natural-language subtask annotations, jointly predicting high-level task phases and fine-grained progress—thereby relaxing restrictive fixed-horizon assumptions. Integrating video understanding models with Reward-Aligned Behavior Cloning (RA-BC), the framework enables sample reweighting and high-quality demonstration filtering. Results: Evaluated in simulation and on real robots, our method achieves 83% success rate on flat T-shirts and 67% on crumpled ones—substantially outperforming baseline behavior cloning (8% and 0%, respectively). It demonstrates strong cross-distribution generalization and superior policy learning performance.

Technology Category

Application Category

📝 Abstract

Large-scale robot learning has recently shown promise for enabling robots to perform complex tasks by integrating perception, control, and language understanding. Yet, it struggles with long-horizon, contact-rich manipulation such as deformable object handling, where demonstration quality is inconsistent. Reward modeling offers a natural solution: by providing grounded progress signals, it transforms noisy demonstrations into stable supervision that generalizes across diverse trajectories. We introduce a stage-aware, video-based reward modeling framework that jointly predicts high-level task stages and fine-grained progress. Reward labels are automatically derived from natural language subtask annotations, ensuring consistent progress estimation across variable-length demonstrations. This design overcomes frame-index labeling, which fails in variable-duration tasks like folding a T-shirt. Our reward model demonstrates robustness to variability, generalization to out-of-distribution settings, and strong utility for policy training. Building on it, we propose Reward-Aligned Behavior Cloning (RA-BC), which filters high-quality data and reweights samples by reward. Experiments show the reward model alone outperforms baselines on validation and real robot rollouts. Integrated into RA-BC, our approach achieves 83% success on folding T-shirts from the flattened state and 67% from the crumpled state -- far surpassing vanilla behavior cloning, which attains only 8% and 0% success. Overall, our results highlight reward modeling as a key enabler for scalable, annotation-efficient, and robust imitation learning in long-horizon manipulation.

Problem

Research questions and friction points this paper is trying to address.

Addressing inconsistent demonstration quality in long-horizon robot manipulation

Overcoming frame-index labeling limitations in variable-duration tasks

Improving policy training robustness and generalization in contact-rich manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stage-aware reward modeling for long-horizon manipulation

Automated reward labels from natural language subtasks

Reward-Aligned Behavior Cloning filters and reweights samples

🔎 Similar Papers

VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

2024-05-26arXiv.orgCitations: 0

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

2024-05-22arXiv.orgCitations: 1