Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

204K/year
πŸ€– AI Summary
This work addresses the challenge of dense hand contact estimation, which requires simultaneous reasoning about high-level semantics and fine-grained 3D geometryβ€”a capability lacking in current multimodal large language models (MLLMs) due to their absence of explicit geometric modeling and vertex-level contact inference. To bridge this gap, we propose ContactPrompt, the first training-free, zero-shot method for dense hand contact estimation. Our approach encodes hand geometry through segmentation masks and vertex-based mesh representations, and introduces a multi-stage structured prompting mechanism that effectively integrates global semantic context with local geometric details to elicit fine-grained reasoning from off-the-shelf MLLMs. Remarkably, without using any training data or task-specific fine-tuning, ContactPrompt outperforms supervised methods that rely on large-scale, densely annotated contact labels.
πŸ“ Abstract
Dense hand contact estimation requires both high-level semantic understanding and fine-grained geometric reasoning of human interaction to accurately localize contact regions. Recently, multi-modal large language models (MLLMs) have demonstrated strong capabilities in understanding visual semantics, enabled by vision-language priors learned from large-scale data. However, leveraging MLLMs for dense hand contact estimation remains underexplored. There are two major challenges in applying MLLMs to dense hand contact estimation. First, encoding explicit 3D hand geometry is difficult, as MLLMs primarily operate on vision and language modalities. Second, capturing fine-grained vertex-level contact remains challenging, as MLLMs tend to focus on high-level semantics rather than detailed geometric reasoning. To address these challenges, we propose ContactPrompt, a training-free and zero-shot approach for dense hand contact estimation using MLLMs. To effectively encode 3D hand geometry, we introduce a detailed hand-part segmentation and a part-wise vertex-grid representation that provides structured, localized geometric information. To enable accurate and efficient dense contact prediction, we develop a multi-stage structured contact reasoning with part conditioning, progressively bridging global semantics and fine-grained geometry. Therefore, our method effectively leverages the reasoning capabilities of MLLMs while enabling precise dense hand contact estimation. Surprisingly, the proposed approach outperforms previous supervised methods trained on large-scale dense contact datasets without requiring any training. The codes will be released.
Problem

Research questions and friction points this paper is trying to address.

dense hand contact estimation
multi-modal large language models
3D hand geometry
fine-grained geometric reasoning
zero-shot
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
dense hand contact estimation
multi-modal large language models
3D hand geometry
zero-shot
πŸ”Ž Similar Papers
2024-05-06arXiv.orgCitations: 3