CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This paper addresses end-to-end 3D reconstruction from a single 2D image, jointly estimating camera pose, 3D shape, and texture. We propose a two-stage flow-matching architecture that maps both pose and shape into a shared latent space, enabling co-optimization of pose recovery and geometric detail enhancement. To improve geometry-appearance consistency, we jointly model voxel representations and pixel-voxel correspondences, integrating 2D projection alignment with image features. Leveraging generative modeling and latent-space distribution learning, our method achieves over 3 dB PSNR gain and more than 10% reduction in Chamfer Distance on ShapeNet benchmarks, while attaining pose accuracy competitive with state-of-the-art monocular approaches. Qualitative evaluation further demonstrates superior visual fidelity compared to existing baselines.

Technology Category

Application Category

📝 Abstract

This work proposes a new generation-based 3D reconstruction method, named Cupid, that accurately infers the camera pose, 3D shape, and texture of an object from a single 2D image. Cupid casts 3D reconstruction as a conditional sampling process from a learned distribution of 3D objects, and it jointly generates voxels and pixel-voxel correspondences, enabling robust pose and shape estimation under a unified generative framework. By representing both input camera poses and 3D shape as a distribution in a shared 3D latent space, Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that produces initial 3D geometry with associated 2D projections for pose recovery; and (2) a refinement stage that integrates pose-aligned image features to enhance structural fidelity and appearance details. Extensive experiments demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3 dB PSNR gain and an over 10% Chamfer Distance reduction, while matching monocular estimators on pose accuracy and delivering superior visual fidelity over baseline 3D generative models. For an immersive view of the 3D results generated by Cupid, please visit cupid3d.github.io.

Problem

Research questions and friction points this paper is trying to address.

Infer camera pose, 3D shape, and texture from single image

Jointly generate voxels and pixel-voxel correspondences for reconstruction

Enhance structural fidelity and appearance details through pose alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative 3D reconstruction from single image

Two-stage flow matching for pose and shape

Joint voxel and correspondence generation framework

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View