HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of accurately localizing 3D affordance regions on novel objects based on human interaction intent. To this end, the authors propose a method that integrates multimodal large language models (MLLMs) with contact-aware embeddings to jointly encode interaction intent and geometric information. The approach features a hierarchical cross-modal fusion mechanism and a multi-granularity geometric enhancement module, enabling generalization to unseen objects without requiring explicit attribute annotations or 2D segmentation masks. As part of this contribution, the authors introduce the first 3D affordance benchmark incorporating occlusion challenges and demonstrate significant performance gains over existing methods on both this new benchmark and established public datasets. The code and trained models are publicly released.

Technology Category

Application Category

📝 Abstract
Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches. All code and weights are publicly available.
Problem

Research questions and friction points this paper is trying to address.

3D affordance grounding
interaction intention
multimodal large language models
cross-modal integration
intention-driven
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal large language models
3D affordance grounding
cross-modal integration
interaction intention
geometry lifting
🔎 Similar Papers
No similar papers found.
L
Lei Yao
The Hong Kong Polytechnic University
Y
Yong Chen
Huazhong University of Science and Technology
Y
Yuejiao Su
The Hong Kong Polytechnic University
Yi Wang
Yi Wang
The Hong Kong Polytechnic University
Biomaterials
Moyun Liu
Moyun Liu
Huazhong University of Science and Technology
Embodied AIComputer Vision
Lap-Pui Chau
Lap-Pui Chau
The Hong Kong Polytechnic University
Visual Signal Processing