🤖 AI Summary
To address the generalization bottlenecks of single-view RGB-based 6D pose estimation under depth ambiguity, occlusion, and cluttered scenes, this paper proposes the first multi-view pose estimation framework that requires no object-specific fine-tuning, symmetry-aware annotations, or single-view pose initialization. Our core innovation is a unified multi-view feature-metric joint optimization formulated in the world coordinate system, achieved via differentiable rendering to align cross-view observations with synthetically rendered features, while enforcing consistency between feature embeddings and geometric metrics. The method enables zero-shot transfer to unseen objects, eliminating reliance on object priors or pre-estimated single-view poses. Evaluated on four major BOP benchmarks—YCB-V, T-LESS, ITODD-MV, and HouseCat6D—our approach achieves state-of-the-art performance across all, with particularly significant gains on the industrial-grade multi-view datasets ITODD-MV and HouseCat6D.
📝 Abstract
Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on four datasets (YCB-V, T-LESS, ITODD-MV, HouseCat6D) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.