🤖 AI Summary
This work addresses key challenges in human manipulation videos: difficulty in automatic identification of bimanual affordance regions, high annotation cost, and poor robot executability. To this end, we introduce 2HANDS—the first task-oriented, fine-grained, bimanual affordance dataset—featuring action-descriptive labels and spatial annotations of bimanual interactions. Methodologically, we propose a multimodal framework grounded in vision-language models (VLMs), integrating self-supervised video parsing with action–region alignment to enable end-to-end mapping from affordance regions to robot-executable actions. Experiments demonstrate that our model significantly outperforms existing baselines on multi-task affordance segmentation. Crucially, real-world validation on a dual-arm robotic platform confirms the executability of predicted regions, yielding substantial improvements in grasping and manipulation success rates. Our core contributions are: (1) the first formal modeling of fine-grained bimanual affordances, and (2) establishing a closed-loop verification paradigm bridging video understanding and robotic execution.
📝 Abstract
When interacting with objects, humans effectively reason about which regions of objects are viable for an intended action, i.e., the affordance regions of the object. They can also account for subtle differences in object regions based on the task to be performed and whether one or two hands need to be used. However, current vision-based affordance prediction methods often reduce the problem to naive object part segmentation. In this work, we propose a framework for extracting affordance data from human activity video datasets. Our extracted 2HANDS dataset contains precise object affordance region segmentations and affordance class-labels as narrations of the activity performed. The data also accounts for bimanual actions, i.e., two hands co-ordinating and interacting with one or more objects. We present a VLM-based affordance prediction model, 2HandedAfforder, trained on the dataset and demonstrate superior performance over baselines in affordance region segmentation for various activities. Finally, we show that our predicted affordance regions are actionable, i.e., can be used by an agent performing a task, through demonstration in robotic manipulation scenarios.