π€ AI Summary
To address weak generalization in robotic manipulation, scarcity of affordance annotations, and difficulty in cross-context transfer, this paper proposes a novel paradigm for learning manipulable affordance knowledge from human behavioral videos. We introduce HOVA-500Kβthe first large-scale image-level affordance dataset with 500K annotated samples. We propose GLOVER++, a global-local collaborative cross-modal transfer framework integrating multimodal contrastive learning, vision-action joint embedding, hierarchical attention, and open-vocabulary semantic alignment. Furthermore, we establish the first multimodal benchmark dedicated to affordance understanding. Our method achieves state-of-the-art performance on HOVA-500K and significantly improves generalization across scenes, objects, and tasks. It also enables zero-shot action reasoning and embodied decision-making, demonstrating robust transferability and scalability for real-world robotic applications.
π Abstract
Learning manipulation skills from human demonstration videos offers a promising path toward generalizable and interpretable robotic intelligence-particularly through the lens of actionable affordances. However, transferring such knowledge remains challenging due to: 1) a lack of large-scale datasets with precise affordance annotations, and 2) insufficient exploration of affordances in diverse manipulation contexts. To address these gaps, we introduce HOVA-500K, a large-scale, affordance-annotated dataset comprising 500,000 images across 1,726 object categories and 675 actions. We also release a standardized benchmarking suite for multi-modal affordance reasoning. Built upon HOVA-500K, we present GLOVER++, a global-to-local affordance training framework that effectively transfers actionable affordance knowledge from human demonstrations to downstream open-vocabulary reasoning tasks. GLOVER++ achieves state-of-the-art results on the HOVA-500K benchmark and demonstrates strong generalization across diverse downstream robotic manipulation tasks. By explicitly modeling actionable affordances, GLOVER++ facilitates robust transfer across scenes, modalities, and tasks. We hope that HOVA-500K and the GLOVER++ framework will serve as valuable resources for bridging the gap between human demonstrations and robotic manipulation capabilities.