SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image

📅 2024-09-24

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Single-image 3D object pose and shape recovery faces fundamental challenges including severe occlusion, depth ambiguity, large intra- and inter-class shape variation, and scarcity of real-world 3D annotations. To address these, we propose the first training-free iterative fitting framework based on a morphable Signed Distance Field (mSDF). Our method initializes the implicit shape space using cross-modal embeddings from open-world foundation models (e.g., OpenShape), then jointly optimizes pose and shape via render-and-compare optimization and pixel-level 2D/3D feature matching to achieve geometric and appearance alignment. Crucially, it eliminates reliance on supervised training, enabling zero-shot generalization to unseen categories and real-world scenes. Evaluated on Pix3D and Pascal3D+, our approach matches or surpasses state-of-the-art supervised methods—marking the first work to simultaneously solve pose estimation, shape reconstruction, and pixel-level alignment without any training.

Technology Category

Application Category

📝 Abstract

We focus on recovering 3D object pose and shape from single images. This is highly challenging due to strong (self-)occlusions, depth ambiguities, the enormous shape variance, and lack of 3D ground truth for natural images. Recent work relies mostly on learning from finite datasets, so it struggles generalizing, while it focuses mostly on the shape itself, largely ignoring the alignment with pixels. Moreover, it performs feed-forward inference, so it cannot refine estimates. We tackle these limitations with a novel framework, called SDFit. To this end, we make three key observations: (1) Learned signed-distance-function (SDF) models act as a strong morphable shape prior. (2) Foundational models embed 2D images and 3D shapes in a joint space, and (3) also infer rich features from images. SDFit exploits these as follows. First, it uses a category-level morphable SDF (mSDF) model, called DIT, to generate 3D shape hypotheses. This mSDF is initialized by querying OpenShape's latent space conditioned on the input image. Then, it computes 2D-to-3D correspondences, by extracting and matching features from the image and mSDF. Last, it fits the mSDF to the image in an render-and-compare fashion, to iteratively refine estimates. We evaluate SDFit on the Pix3D and Pascal3D+ datasets of real-world images. SDFit performs roughly on par with state-of-the-art learned methods, but, uniquely, requires no re-training. Thus, SDFit is promising for generalizing in the wild, paving the way for future research. Code will be released

Problem

Research questions and friction points this paper is trying to address.

Recovering 3D pose and shape from single images

Overcoming occlusions and depth ambiguities in 3D reconstruction

Generalizing 3D shape prediction to real-world scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learned morphable SDF model as shape prior

Foundational models for 2D-to-3D correspondences

Iterative fitting pipeline refining shape and pose

🔎 Similar Papers

ShapeICP: Iterative Category-level Object Pose and Shape Estimation from Depth