Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Current 3D-native generative models achieve notable progress in geometric modeling but suffer from suboptimal appearance fidelity due to the scarcity of high-fidelity real-world texture data—caused by limited scanning resolution, non-rigid deformations, and scene-scale variability. To address this, we propose a structure-aligned multi-view synthesis framework: (1) leveraging GPT-4o to synthesize high-quality, multi-view, semantically consistent images for constructing a detail-enhanced training set; (2) introducing perceptual feature adaptation and explicit semantic-structure matching to jointly optimize geometric consistency and texture realism. Our method supports both geometry-texture coupled and decoupled generation paradigms, ensuring strong generalization. Experiments demonstrate state-of-the-art performance across multiple 3D generation benchmarks, with significant improvements in texture richness and cross-view consistency.

Technology Category

Application Category

📝 Abstract

Although recent 3D-native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich texture details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non-rigid motions of objects, and the limited precision of 3D scanners. We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT-4o-Image model. Considering that the generated images can distort 3D structures due to their lack of multi-view consistency, we design a structure-aligned multi-view synthesis pipeline and construct a detail-enhanced multi-view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with realistic details while preserving the structural consistency with the 3D-native geometry. Our scheme is general to different 3D-native generators, and we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D-native generation paradigms and achieves state-of-the-art photorealistic 3D generation performance.

Problem

Research questions and friction points this paper is trying to address.

Generates photorealistic 3D content lacking realistic appearance details

Overcomes limited real-world 3D data with diverse scenes and textures

Enhances 3D structure alignment while preserving realistic detail consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GPT-4o-Image generated images for 3D data enhancement

Implements structure-aligned multi-view synthesis for consistency

Applies perceptual feature adaptation for realistic detail enhancement

🔎 Similar Papers

No similar papers found.

World Labs

$250,000-$350,000 base salary (good-faith estimate for San Francisco Bay Area upon hire; actual offer based on experience, skills, and qualifications)

San Francisco / San Francisco Office, San Francisco, California, United States

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)