VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

📅 2024-05-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

Existing vision-instruction correlation (VIC) methods for reward modeling in long-horizon manipulation tasks—based on action-free videos and language instructions—suffer from weak sub-stage awareness, inadequate modeling of task complexity, and insufficient object state estimation. Method: We propose a hierarchical VIC reward model featuring a novel learnable stage detector and motion progress evaluator to enable precise, multi-granularity quantification of task progression; it integrates cross-modal representation learning, temporal action segmentation modeling, and hierarchical reward synthesis. Contribution/Results: Evaluated on both simulation and real-robot platforms, our approach significantly improves success rates for long-horizon tasks, outperforming the best baseline by 43%. The model effectively bridges the gap between high-level language instructions and fine-grained visual dynamics without requiring explicit action annotations, enabling robust reward shaping for complex, temporally extended manipulation tasks.

Technology Category

Application Category

📝 Abstract

We study reward models for long-horizon manipulation tasks by learning from action-free videos and language instructions, which we term the visual-instruction correlation (VIC) problem. Recent advancements in cross-modality modeling have highlighted the potential of reward modeling through visual and language correlations. However, existing VIC methods face challenges in learning rewards for long-horizon tasks due to their lack of sub-stage awareness, difficulty in modeling task complexities, and inadequate object state estimation. To address these challenges, we introduce VICtoR, a novel hierarchical VIC reward model capable of providing effective reward signals for long-horizon manipulation tasks. VICtoR precisely assesses task progress at various levels through a novel stage detector and motion progress evaluator, offering insightful guidance for agents learning the task effectively. To validate the effectiveness of VICtoR, we conducted extensive experiments in both simulated and real-world environments. The results suggest that VICtoR outperformed the best existing VIC methods, achieving a 43% improvement in success rates for long-horizon tasks.

Problem

Research questions and friction points this paper is trying to address.

Learning rewards from videos and instructions

Addressing challenges in long-horizon manipulation tasks

Improving success rates with VICtoR model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical VIC reward model

Stage detector and motion evaluator

Improves success rates significantly

🔎 Similar Papers

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance