2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses key challenges in human manipulation videos: difficulty in automatic identification of bimanual affordance regions, high annotation cost, and poor robot executability. To this end, we introduce 2HANDS—the first task-oriented, fine-grained, bimanual affordance dataset—featuring action-descriptive labels and spatial annotations of bimanual interactions. Methodologically, we propose a multimodal framework grounded in vision-language models (VLMs), integrating self-supervised video parsing with action–region alignment to enable end-to-end mapping from affordance regions to robot-executable actions. Experiments demonstrate that our model significantly outperforms existing baselines on multi-task affordance segmentation. Crucially, real-world validation on a dual-arm robotic platform confirms the executability of predicted regions, yielding substantial improvements in grasping and manipulation success rates. Our core contributions are: (1) the first formal modeling of fine-grained bimanual affordances, and (2) establishing a closed-loop verification paradigm bridging video understanding and robotic execution.

Technology Category

Application Category

📝 Abstract

When interacting with objects, humans effectively reason about which regions of objects are viable for an intended action, i.e., the affordance regions of the object. They can also account for subtle differences in object regions based on the task to be performed and whether one or two hands need to be used. However, current vision-based affordance prediction methods often reduce the problem to naive object part segmentation. In this work, we propose a framework for extracting affordance data from human activity video datasets. Our extracted 2HANDS dataset contains precise object affordance region segmentations and affordance class-labels as narrations of the activity performed. The data also accounts for bimanual actions, i.e., two hands co-ordinating and interacting with one or more objects. We present a VLM-based affordance prediction model, 2HandedAfforder, trained on the dataset and demonstrate superior performance over baselines in affordance region segmentation for various activities. Finally, we show that our predicted affordance regions are actionable, i.e., can be used by an agent performing a task, through demonstration in robotic manipulation scenarios.

Problem

Research questions and friction points this paper is trying to address.

Extracting precise actionable bimanual affordances from human videos

Improving affordance region segmentation for various activities

Enabling robotic manipulation using predicted actionable affordance regions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracts affordance data from human videos

Uses VLM-based model for affordance prediction

Demonstrates actionable affordance regions in robotics

🔎 Similar Papers

No similar papers found.