GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation

πŸ“… 2025-05-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address weak generalization in robotic manipulation, scarcity of affordance annotations, and difficulty in cross-context transfer, this paper proposes a novel paradigm for learning manipulable affordance knowledge from human behavioral videos. We introduce HOVA-500Kβ€”the first large-scale image-level affordance dataset with 500K annotated samples. We propose GLOVER++, a global-local collaborative cross-modal transfer framework integrating multimodal contrastive learning, vision-action joint embedding, hierarchical attention, and open-vocabulary semantic alignment. Furthermore, we establish the first multimodal benchmark dedicated to affordance understanding. Our method achieves state-of-the-art performance on HOVA-500K and significantly improves generalization across scenes, objects, and tasks. It also enables zero-shot action reasoning and embodied decision-making, demonstrating robust transferability and scalability for real-world robotic applications.

Technology Category

Application Category

πŸ“ Abstract
Learning manipulation skills from human demonstration videos offers a promising path toward generalizable and interpretable robotic intelligence-particularly through the lens of actionable affordances. However, transferring such knowledge remains challenging due to: 1) a lack of large-scale datasets with precise affordance annotations, and 2) insufficient exploration of affordances in diverse manipulation contexts. To address these gaps, we introduce HOVA-500K, a large-scale, affordance-annotated dataset comprising 500,000 images across 1,726 object categories and 675 actions. We also release a standardized benchmarking suite for multi-modal affordance reasoning. Built upon HOVA-500K, we present GLOVER++, a global-to-local affordance training framework that effectively transfers actionable affordance knowledge from human demonstrations to downstream open-vocabulary reasoning tasks. GLOVER++ achieves state-of-the-art results on the HOVA-500K benchmark and demonstrates strong generalization across diverse downstream robotic manipulation tasks. By explicitly modeling actionable affordances, GLOVER++ facilitates robust transfer across scenes, modalities, and tasks. We hope that HOVA-500K and the GLOVER++ framework will serve as valuable resources for bridging the gap between human demonstrations and robotic manipulation capabilities.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale affordance-annotated datasets for robotic learning
Insufficient exploration of affordances in diverse manipulation contexts
Challenges in transferring affordance knowledge from human to robots
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale affordance-annotated dataset HOVA-500K
Global-to-local affordance training framework GLOVER++
Open-vocabulary reasoning for robotic manipulation