MaskHOI: Robust 3D Hand-Object Interaction Estimation via Masked Pre-training

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Estimating 3D hand–object interaction (HOI) poses from monocular RGB images remains challenging due to inherent geometric ambiguity and severe mutual occlusion. To address these issues, this paper proposes a mask autoencoder (MAE)-driven, geometry-aware pretraining framework. Our approach introduces two key innovations: (1) region-specific masking ratio allocation and skeleton-guided hand masking, enhancing modeling fidelity of critical anatomical structures; and (2) a signed distance field (SDF)-based multimodal reconstruction objective that jointly optimizes fine-grained hand geometry and global object shape. Evaluated on HOI-3D and InterHand2.6M benchmarks, our method achieves significant improvements over state-of-the-art methods, demonstrating superior robustness to occlusion and stronger generalization across diverse scenes and object categories.

Technology Category

Application Category

📝 Abstract

In 3D hand-object interaction (HOI) tasks, estimating precise joint poses of hands and objects from monocular RGB input remains highly challenging due to the inherent geometric ambiguity of RGB images and the severe mutual occlusions that occur during interaction.To address these challenges, we propose MaskHOI, a novel Masked Autoencoder (MAE)-driven pretraining framework for enhanced HOI pose estimation. Our core idea is to leverage the masking-then-reconstruction strategy of MAE to encourage the feature encoder to infer missing spatial and structural information, thereby facilitating geometric-aware and occlusion-robust representation learning. Specifically, based on our observation that human hands exhibit far greater geometric complexity than rigid objects, conventional uniform masking fails to effectively guide the reconstruction of fine-grained hand structures. To overcome this limitation, we introduce a Region-specific Mask Ratio Allocation, primarily comprising the region-specific masking assignment and the skeleton-driven hand masking guidance. The former adaptively assigns lower masking ratios to hand regions than to rigid objects, balancing their feature learning difficulty, while the latter prioritizes masking critical hand parts (e.g., fingertips or entire fingers) to realistically simulate occlusion patterns in real-world interactions. Furthermore, to enhance the geometric awareness of the pretrained encoder, we introduce a novel Masked Signed Distance Field (SDF)-driven multimodal learning mechanism. Through the self-masking 3D SDF prediction, the learned encoder is able to perceive the global geometric structure of hands and objects beyond the 2D image plane, overcoming the inherent limitations of monocular input and alleviating self-occlusion issues. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches.

Problem

Research questions and friction points this paper is trying to address.

Estimating 3D hand-object poses from monocular RGB with occlusion challenges

Improving geometric-aware feature learning via masked autoencoder pre-training

Addressing hand-object occlusion with region-specific masking strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Autoencoder for robust HOI estimation

Region-specific masking for hand-object balance

Masked SDF learning for 3D geometric awareness

🔎 Similar Papers

WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild

2024-09-18arXiv.orgCitations: 8

ByteDance

San Jose

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)