π€ AI Summary
Current 3D-native generative models achieve notable progress in geometric modeling but suffer from suboptimal appearance fidelity due to the scarcity of high-fidelity real-world texture dataβcaused by limited scanning resolution, non-rigid deformations, and scene-scale variability. To address this, we propose a structure-aligned multi-view synthesis framework: (1) leveraging GPT-4o to synthesize high-quality, multi-view, semantically consistent images for constructing a detail-enhanced training set; (2) introducing perceptual feature adaptation and explicit semantic-structure matching to jointly optimize geometric consistency and texture realism. Our method supports both geometry-texture coupled and decoupled generation paradigms, ensuring strong generalization. Experiments demonstrate state-of-the-art performance across multiple 3D generation benchmarks, with significant improvements in texture richness and cross-view consistency.
π Abstract
Although recent 3D-native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich texture details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non-rigid motions of objects, and the limited precision of 3D scanners. We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT-4o-Image model. Considering that the generated images can distort 3D structures due to their lack of multi-view consistency, we design a structure-aligned multi-view synthesis pipeline and construct a detail-enhanced multi-view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with realistic details while preserving the structural consistency with the 3D-native geometry. Our scheme is general to different 3D-native generators, and we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D-native generation paradigms and achieves state-of-the-art photorealistic 3D generation performance.