🤖 AI Summary
Existing supervised methods for single-image zero-shot 3D model alignment suffer from inaccurate pose estimation due to reliance on scarce category- and pose-level annotations. Method: We propose the first weakly supervised framework, constructing a foundation-model-driven joint geometric-semantic feature space; introducing multi-view consistency constraints and a self-supervised triplet loss to mitigate symmetry ambiguity; and designing a texture-invariant normalized coordinate-based dense alignment mechanism for precise 9-DoF pose estimation. Results: On ScanNet25k, our method outperforms the prior state-of-the-art weakly supervised approach by 4.3% and—remarkably—for the first time surpasses the supervised method ROCA by +2.7% without any pose annotations. On our newly introduced cross-domain benchmark SUN2CAD, it achieves state-of-the-art performance across 20 unseen CAD categories, demonstrating strong zero-shot generalization capability.
📝 Abstract
One practical approach to infer 3D scene structure from a single image is to retrieve a closely matching 3D model from a database and align it with the object in the image. Existing methods rely on supervised training with images and pose annotations, which limits them to a narrow set of object categories. To address this, we propose a weakly supervised 9-DoF alignment method for inexact 3D models that requires no pose annotations and generalizes to unseen categories. Our approach derives a novel feature space based on foundation features that ensure multi-view consistency and overcome symmetry ambiguities inherent in foundation features using a self-supervised triplet loss. Additionally, we introduce a texture-invariant pose refinement technique that performs dense alignment in normalized object coordinates, estimated through the enhanced feature space. We conduct extensive evaluations on the real-world ScanNet25k dataset, where our method outperforms SOTA weakly supervised baselines by +4.3% mean alignment accuracy and is the only weakly supervised approach to surpass the supervised ROCA by +2.7%. To assess generalization, we introduce SUN2CAD, a real-world test set with 20 novel object categories, where our method achieves SOTA results without prior training on them.