HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of accurately localizing 3D affordance regions on novel objects based on human interaction intent. To this end, the authors propose a method that integrates multimodal large language models (MLLMs) with contact-aware embeddings to jointly encode interaction intent and geometric information. The approach features a hierarchical cross-modal fusion mechanism and a multi-granularity geometric enhancement module, enabling generalization to unseen objects without requiring explicit attribute annotations or 2D segmentation masks. As part of this contribution, the authors introduce the first 3D affordance benchmark incorporating occlusion challenges and demonstrate significant performance gains over existing methods on both this new benchmark and established public datasets. The code and trained models are publicly released.

Technology Category

Application Category

📝 Abstract

Humans commonly identify 3D object affordance through observed interactions in images or videos, and once formed, such knowledge can be generically generalized to novel objects. Inspired by this principle, we advocate for a novel framework that leverages emerging multimodal large language models (MLLMs) for interaction intention-driven 3D affordance grounding, namely HAMMER. Instead of generating explicit object attribute descriptions or relying on off-the-shelf 2D segmenters, we alternatively aggregate the interaction intention depicted in the image into a contact-aware embedding and guide the model to infer textual affordance labels, ensuring it thoroughly excavates object semantics and contextual cues. We further devise a hierarchical cross-modal integration mechanism to fully exploit the complementary information from the MLLM for 3D representation refinement and introduce a multi-granular geometry lifting module that infuses spatial characteristics into the extracted intention embedding, thus facilitating accurate 3D affordance localization. Extensive experiments on public datasets and our newly constructed corrupted benchmark demonstrate the superiority and robustness of HAMMER compared to existing approaches. All code and weights are publicly available.

Problem

Research questions and friction points this paper is trying to address.

3D affordance grounding

interaction intention

multimodal large language models

cross-modal integration

intention-driven

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal large language models

3D affordance grounding

cross-modal integration