MonoSE(3)-Diffusion: A Monocular SE(3) Diffusion Framework for Robust Camera-to-Robot Pose Estimation

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the problem of markerless, monocular visual SE(3) pose estimation for robots. We propose a novel conditional diffusion model-based approach. Our key contributions are: (1) a visibility-constrained forward diffusion process that ensures intermediate poses remain within the camera’s field of view; (2) a timestep-aware progressive denoising mechanism enabling coarse-to-fine pose refinement; and (3) a diffusion posterior sampling framework explicitly designed for the SE(3) manifold. Evaluated on the DREAM and RoboKeyGen benchmarks, our method achieves state-of-the-art performance—particularly on the most challenging dataset, where it attains an AUC of 66.75, surpassing the previous best by 32.3%. The approach demonstrates significantly improved robustness to occlusion, viewpoint variation, and domain shift, along with enhanced cross-scenario generalization capability.

Technology Category

Application Category

📝 Abstract
We propose MonoSE(3)-Diffusion, a monocular SE(3) diffusion framework that formulates markerless, image-based robot pose estimation as a conditional denoising diffusion process. The framework consists of two processes: a visibility-constrained diffusion process for diverse pose augmentation and a timestep-aware reverse process for progressive pose refinement. The diffusion process progressively perturbs ground-truth poses to noisy transformations for training a pose denoising network. Importantly, we integrate visibility constraints into the process, ensuring the transformations remain within the camera field of view. Compared to the fixed-scale perturbations used in current methods, the diffusion process generates in-view and diverse training poses, thereby improving the network generalization capability. Furthermore, the reverse process iteratively predicts the poses by the denoising network and refines pose estimates by sampling from the diffusion posterior of current timestep, following a scheduled coarse-to-fine procedure. Moreover, the timestep indicates the transformation scales, which guide the denoising network to achieve more accurate pose predictions. The reverse process demonstrates higher robustness than direct prediction, benefiting from its timestep-aware refinement scheme. Our approach demonstrates improvements across two benchmarks (DREAM and RoboKeyGen), achieving a notable AUC of 66.75 on the most challenging dataset, representing a 32.3% gain over the state-of-the-art.
Problem

Research questions and friction points this paper is trying to address.

Monocular SE(3) diffusion for camera-to-robot pose estimation
Visibility-constrained diffusion for diverse in-view pose augmentation
Timestep-aware reverse process for progressive pose refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monocular SE(3) diffusion for camera-to-robot pose estimation
Visibility-constrained diffusion process for diverse pose augmentation
Timestep-aware reverse process for progressive pose refinement
🔎 Similar Papers
No similar papers found.
K
Kangjian Zhu
School of Computer Science and Engineering, Nanjing University of Science and Technology, China
Haobo Jiang
Haobo Jiang
Nanyang Technological University / Nanjing University of Science and Technology / EPFL
3D Computer VisionReinforcement Learning
Y
Yigong Zhang
College of Computer Science, Nankai University, China
Jianjun Qian
Jianjun Qian
Nanjing University of Science and Technology
Pattern RecognitionComputer VisionFace Recognition
J
Jian Yang
School of Computer Science and Engineering, Nanjing University of Science and Technology, China
J
Jin Xie
School of Intelligence Science and Technology, Nanjing University, China