DVMNet++: Rethinking Relative Pose Estimation for Unseen Objects

📅 2024-03-20
📈 Citations: 3
Influential: 1
📄 PDF

career value

206K/year
🤖 AI Summary
This work addresses 6DoF relative pose estimation of unknown objects under open-set conditions from two RGB views, eliminating reliance on ground-truth bounding boxes and discrete rotation hypotheses. We propose an end-to-end voxel alignment framework: (1) unsupervised open-vocabulary object detection via vision-language joint embedding; (2) a differentiable RGB-to-voxel 3D reconstruction and alignment module; and (3) a weighted nearest-voxel matching algorithm for robust correspondence, coupled with least-squares optimization for optimal rotation estimation. To our knowledge, this is the first method to jointly predict full 6DoF pose in a single forward pass. Extensive evaluation on CO3D, Objaverse, LINEMOD, and LINEMOD-O demonstrates significant improvements over state-of-the-art methods—achieving higher pose accuracy, faster inference speed, and zero-shot generalization to unseen objects.

Technology Category

Application Category

📝 Abstract
Determining the relative pose of a previously unseen object between two images is pivotal to the success of generalizable object pose estimation. Existing approaches typically predict 3D translation utilizing the ground-truth object bounding box and approximate 3D rotation with a large number of discrete hypotheses. This strategy makes unrealistic assumptions about the availability of ground truth and incurs a computationally expensive process of scoring each hypothesis at test time. By contrast, we rethink the problem of relative pose estimation for unseen objects by presenting a Deep Voxel Matching Network (DVMNet++). Our method computes the relative object pose in a single pass, eliminating the need for ground-truth object bounding boxes and rotation hypotheses. We achieve open-set object detection by leveraging image feature embedding and natural language understanding as reference. The detection result is then employed to approximate the translation parameters and crop the object from the query image. For rotation estimation, we map the two RGB images, i.e., reference and cropped query, to their respective voxelized 3D representations. The resulting voxels are passed through a rotation estimation module, which aligns the voxels and computes the rotation in an end-to-end fashion by solving a least-squares problem. To enhance robustness, we introduce a weighted closest voxel algorithm capable of mitigating the impact of noisy voxels. We conduct extensive experiments on the CO3D, Objaverse, LINEMOD, and LINEMOD-O datasets, demonstrating that our approach delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods. Our code is released at https://github.com/sailor-z/DVMNet/.
Problem

Research questions and friction points this paper is trying to address.

Estimating relative pose for unseen objects between images
Eliminating need for ground-truth bounding boxes and rotation hypotheses
Enhancing robustness and reducing computational cost in pose estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-pass relative pose estimation without ground truth
Voxelized 3D representation for rotation estimation
Weighted closest voxel algorithm for noise reduction
🔎 Similar Papers
No similar papers found.