Action-Free Reasoning for Policy Generalization

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Imitation learning from human demonstration videos faces two key challenges: the absence of action labels in such videos and poor generalization across embodied platforms with significant morphological disparities. Method: We propose a reasoning-driven paradigm for policy generalization. We introduce the first large-scale human hand-manipulation video dataset (3,377 clips) annotated with natural-language reasoning traces, compatible with the Bridge V2 benchmark. We design a multimodal language-vision reasoning model jointly trained on robot demonstration data (with both reasoning and action labels) and human video data (with reasoning labels only). Contribution/Results: Our core innovation replaces non-transferable action trajectories with transferable, interpretable reasoning processes to explicitly bridge the embodiment gap. Experiments demonstrate substantial improvements in cross-embodiment zero-shot task success rates, effective generalization to unseen tasks, and consistent performance gains as reasoning-data scale increases.

Technology Category

Application Category

📝 Abstract
End-to-end imitation learning offers a promising approach for training robot policies. However, generalizing to new settings remains a significant challenge. Although large-scale robot demonstration datasets have shown potential for inducing generalization, they are resource-intensive to scale. In contrast, human video data is abundant and diverse, presenting an attractive alternative. Yet, these human-video datasets lack action labels, complicating their use in imitation learning. Existing methods attempt to extract grounded action representations (e.g., hand poses), but resulting policies struggle to bridge the embodiment gap between human and robot actions. We propose an alternative approach: leveraging language-based reasoning from human videos-essential for guiding robot actions-to train generalizable robot policies. Building on recent advances in reasoning-based policy architectures, we introduce Reasoning through Action-free Data (RAD). RAD learns from both robot demonstration data (with reasoning and action labels) and action-free human video data (with only reasoning labels). The robot data teaches the model to map reasoning to low-level actions, while the action-free data enhances reasoning capabilities. Additionally, we will release a new dataset of 3,377 human-hand demonstrations with reasoning annotations compatible with the Bridge V2 benchmark and aimed at facilitating future research on reasoning-driven robot learning. Our experiments show that RAD enables effective transfer across the embodiment gap, allowing robots to perform tasks seen only in action-free data. Furthermore, scaling up action-free reasoning data significantly improves policy performance and generalization to novel tasks. These results highlight the promise of reasoning-driven learning from action-free datasets for advancing generalizable robot control. Project page: https://rad-generalization.github.io
Problem

Research questions and friction points this paper is trying to address.

Generalizing robot policies to new settings using human video data.
Leveraging language-based reasoning from action-free human videos.
Bridging the embodiment gap between human and robot actions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-based reasoning integration
Action-free human video utilization
Reasoning through Action-free Data (RAD)
🔎 Similar Papers
No similar papers found.