Demo-Pose: Depth-Monocular Modality Fusion For Object Pose Estimation

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of category-level, CAD-model-free 9-degree-of-freedom object pose estimation by proposing a multimodal fusion architecture that effectively aligns RGB semantic features with depth-driven graph convolutional geometric representations. The core innovations include an efficient RGB-D fusion mechanism and the introduction of a Mesh-Point Loss, which leverages mesh structures during training to enhance geometric reasoning without incurring additional computational overhead at inference time. Evaluated on the REAL275 benchmark, the method achieves a 3.2% improvement in 3D IoU and an 11.1% gain in pose accuracy over state-of-the-art approaches such as GPV-Pose, while maintaining real-time inference capability.
📝 Abstract
Object pose estimation is a fundamental task in 3D vision with applications in robotics, AR/VR, and scene understanding. We address the challenge of category-level 9-DoF pose estimation (6D pose + 3Dsize) from RGB-D input, without relying on CAD models during inference. Existing depth-only methods achieve strong results but ignore semantic cues from RGB, while many RGB-D fusion models underperform due to suboptimal cross-modal fusion that fails to align semantic RGB cues with 3D geometric representations. We propose DeMo-Pose, a hybrid architecture that fuses monocular semantic features with depth-based graph convolutional representations via a novel multimodal fusion strategy. To further improve geometric reasoning, we introduce a novel Mesh-Point Loss (MPL) that leverages mesh structure during training without adding inference overhead. Our approach achieves real-time inference and significantly improves over state-of-the-art methods across object categories, outperforming the strong GPV-Pose baseline by 3.2\% on 3D IoU and 11.1\% on pose accuracy on the REAL275 benchmark. The results highlight the effectiveness of depth-RGB fusion and geometry-aware learning, enabling robust category-level 3D pose estimation for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

object pose estimation
RGB-D fusion
category-level
9-DoF pose
monocular-depth fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal fusion
depth-monocular fusion
category-level pose estimation
Mesh-Point Loss
graph convolutional representation
🔎 Similar Papers
No similar papers found.
R
Rachit Agarwal
Samsung R&D Institute, Bangalore
Abhishek Joshi
Abhishek Joshi
Computer Science, Princeton University
RoboticsDeep Learning
S
Sathish Chalasani
Samsung R&D Institute, Bangalore
W
Woo Jin Kim
Samsung Electronics Suwon, Republic of Korea