🤖 AI Summary
Long-horizon, contact-rich manipulation of deformable objects (e.g., T-shirt folding) suffers from heterogeneous demonstration quality and poor generalization of conventional frame-level reward modeling. Method: We propose a phase-aware video reward modeling framework that automatically generates temporal reward labels from natural-language subtask annotations, jointly predicting high-level task phases and fine-grained progress—thereby relaxing restrictive fixed-horizon assumptions. Integrating video understanding models with Reward-Aligned Behavior Cloning (RA-BC), the framework enables sample reweighting and high-quality demonstration filtering. Results: Evaluated in simulation and on real robots, our method achieves 83% success rate on flat T-shirts and 67% on crumpled ones—substantially outperforming baseline behavior cloning (8% and 0%, respectively). It demonstrates strong cross-distribution generalization and superior policy learning performance.
📝 Abstract
Large-scale robot learning has recently shown promise for enabling robots to perform complex tasks by integrating perception, control, and language understanding. Yet, it struggles with long-horizon, contact-rich manipulation such as deformable object handling, where demonstration quality is inconsistent. Reward modeling offers a natural solution: by providing grounded progress signals, it transforms noisy demonstrations into stable supervision that generalizes across diverse trajectories. We introduce a stage-aware, video-based reward modeling framework that jointly predicts high-level task stages and fine-grained progress. Reward labels are automatically derived from natural language subtask annotations, ensuring consistent progress estimation across variable-length demonstrations. This design overcomes frame-index labeling, which fails in variable-duration tasks like folding a T-shirt. Our reward model demonstrates robustness to variability, generalization to out-of-distribution settings, and strong utility for policy training. Building on it, we propose Reward-Aligned Behavior Cloning (RA-BC), which filters high-quality data and reweights samples by reward. Experiments show the reward model alone outperforms baselines on validation and real robot rollouts. Integrated into RA-BC, our approach achieves 83% success on folding T-shirts from the flattened state and 67% from the crumpled state -- far surpassing vanilla behavior cloning, which attains only 8% and 0% success. Overall, our results highlight reward modeling as a key enabler for scalable, annotation-efficient, and robust imitation learning in long-horizon manipulation.