VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing approaches rely on static visual or textual cues, which often fail to accurately identify functional regions on 3D objects that support human-object interactions (HOI). This work proposes a video-guided method for 3D functional region localization, introducing dynamic HOI videos as supervision signals for the first time. By leveraging multimodal alignment techniques to fuse action sequences from videos with 3D geometric structures, the method effectively resolves ambiguities inherent in static cues. To facilitate this research direction, we construct PVAD, the first dataset pairing HOI videos with aligned 3D object models annotated with functional regions. Experimental results demonstrate that our approach significantly outperforms static-cue-based baselines on PVAD, achieving state-of-the-art performance in 3D functional region localization.

Technology Category

Application Category

📝 Abstract

3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI), a capability essential to embodied visual reasoning. However, most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions. As a result, they often struggle to localize the true contact regions involved in real interactions. We take a different perspective. Humans learn how to use objects by observing and imitating actions, not just by examining shapes. Motivated by this intuition, we introduce video-guided 3D affordance grounding, which leverages dynamic interaction sequences to provide functional supervision. To achieve this, we propose VAGNet, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address. To support this new setting, we introduce PVAD, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works. Extensive experiments on PVAD show that VAGNet achieves state-of-the-art performance, significantly outperforming static-based baselines. The code and dataset will be open publicly.

Problem

Research questions and friction points this paper is trying to address.

3D affordance grounding

human-object interaction

dynamic actions

contact region localization

embodied visual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

video-guided affordance

3D affordance grounding

human-object interaction

dynamic interaction cues

VAGNet

🔎 Similar Papers

Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

2024-08-19arXiv.orgCitations: 12