Mask6D: Masked Pose Priors For 6D Object Pose Estimation

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited robustness of monocular RGB-based 6D object pose estimation under cluttered and heavily occluded conditions, this paper proposes a self-supervised pretraining framework integrating 2D–3D geometric priors with visibility masks. Our method introduces two key structural priors: (1) a pose-aware 2D–3D correspondence map and (2) an object visibility mask, both leveraged to guide representation learning. We design a multimodal pretraining objective tailored for pose discrimination and adopt a Transformer architecture that jointly processes RGB images, correspondence maps, and visibility masks—explicitly encoding geometric constraints while suppressing background interference. Evaluated on standard benchmarks including LM and BOP, our approach significantly outperforms existing end-to-end methods, particularly under severe occlusion, achieving notable gains in pose accuracy and stability. These results empirically validate that geometrically grounded priors effectively enhance pose perception capabilities when visual cues are sparse or ambiguous.

Technology Category

Application Category

📝 Abstract
Robust 6D object pose estimation in cluttered or occluded conditions using monocular RGB images remains a challenging task. One reason is that current pose estimation networks struggle to extract discriminative, pose-aware features using 2D feature backbones, especially when the available RGB information is limited due to target occlusion in cluttered scenes. To mitigate this, we propose a novel pose estimation-specific pre-training strategy named Mask6D. Our approach incorporates pose-aware 2D-3D correspondence maps and visible mask maps as additional modal information, which is combined with RGB images for the reconstruction-based model pre-training. Essentially, this 2D-3D correspondence maps a transformed 3D object model to 2D pixels, reflecting the pose information of the target in camera coordinate system. Meanwhile, the integrated visible mask map can effectively guide our model to disregard cluttered background information. In addition, an object-focused pre-training loss function is designed to further facilitate our network to remove the background interference. Finally, we fine-tune our pre-trained pose prior-aware network via conventional pose training strategy to realize the reliable pose prediction. Extensive experiments verify that our method outperforms previous end-to-end pose estimation methods.
Problem

Research questions and friction points this paper is trying to address.

Robust 6D pose estimation in occluded scenes
Improving pose-aware feature extraction from RGB images
Reducing background interference in cluttered environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pose-aware 2D-3D correspondence maps
Integrates visible mask maps for clutter removal
Employs object-focused pre-training loss function
🔎 Similar Papers
No similar papers found.
Y
Yuechen Xie
PCA Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
Haobo Jiang
Haobo Jiang
Nanyang Technological University / Nanjing University of Science and Technology / EPFL
3D Computer VisionReinforcement Learning
J
Jin Xie
PCA Lab, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China