VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to modeling 3D object affordances rely on static cues, which struggle to capture the temporal dynamics and causal relationships inherent in interactive scenarios. This work addresses this limitation by leveraging human-object interaction videos for 3D affordance modeling, introducing VIDA—a large-scale video dataset—and proposing VideoAfford, a novel framework that integrates multimodal large language models with dynamic interaction priors. VideoAfford incorporates a latent action encoder and a spatially aware loss function to jointly perform 3D affordance localization and commonsense reasoning in a unified manner. The method significantly outperforms existing approaches across multiple metrics, demonstrating exceptional generalization in open-world settings and superior affordance reasoning capabilities.

Technology Category

Application Category

📝 Abstract
3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, \textit{VIDA}, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on \textit{VIDA}, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability, we leverage a latent action encoder to extract dynamic interaction priors from HOI videos. Moreover, we introduce a \textit{spatial-aware} loss function to enable VideoAfford to obtain comprehensive 3D spatial knowledge. Extensive experimental evaluations demonstrate that our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities. All datasets and code will be publicly released to advance research in this area.
Problem

Research questions and friction points this paper is trying to address.

3D affordance grounding
human-object interaction
dynamic interaction context
robotic manipulation
multimodal learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D affordance grounding
multimodal large language model
human-object interaction video
spatial-aware loss
dynamic interaction priors
🔎 Similar Papers
No similar papers found.