🤖 AI Summary
This work addresses the limitations of conventional decoupled 3D generation methods, which rely on canonical space assumptions that lead to challenges in pose alignment and spatial inconsistency. The authors propose PAD, an end-to-end diffusion framework that dispenses with canonical space entirely and instead generates 3D geometry directly in the observed space. By leveraging monocular depth back-projection to construct partial point clouds as 3D geometric anchors, PAD enables native pose alignment and pose-aware modeling. The method supports high-fidelity, spatially consistent reconstruction of both single-object and multi-object scenes, significantly outperforming existing approaches in terms of geometric alignment accuracy and image–3D correspondence.
📝 Abstract
Generating pose-aligned 3D objects is challenging due to the spatial mismatches and transformation ambiguities inherent in decoupled canonical-then-rotate paradigms. To this end, we introduce Pose-Aware Diffusion (PAD), a novel end-to-end diffusion framework that synthesizes 3D geometry directly within the observation space. By unprojecting monocular depth into a partial point cloud and explicitly injecting it as a 3D geometric anchor, PAD abandons canonical assumptions to enforce rigorous spatial supervision. This native generation intrinsically resolves pose ambiguity, producing high-fidelity pose-aligned assets. Extensive experiments demonstrate that PAD achieves superior geometric alignment and image-to-3D correspondence compared to state-of-the-art methods. Additionally, PAD naturally extends to compositional 3D scene reconstruction via a simple union of independently generated objects, highlighting its robust ability to preserve precise spatial layouts.