🤖 AI Summary
This work addresses the challenging problem of occluded 3D shape retrieval from a single view, which suffers from limited interpretability and robustness. The authors propose a pose-aware analysis-by-synthesis framework that formulates retrieval as a feature-level reconstruction task: a 3D encoder is trained by distilling knowledge from a 2D foundation model (DINOv3), and during inference, both shape and pose are jointly optimized to align the reconstructed features with local features extracted from the input image. This approach uniquely unifies interpretable and robust 3D shape retrieval, pose estimation, and category classification within a single framework. Experiments demonstrate substantial improvements over existing methods on both occluded and clean datasets, achieving consistent gains in retrieval accuracy, pose estimation fidelity, and classification performance.
📝 Abstract
Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.