🤖 AI Summary
This work addresses the “Janus problem” in text-to-3D generation—where frontal views appear plausible but multi-view geometry exhibits duplication or distortion—via a model-free, sampling-stage optimization. The core innovation is the first formulation of a structural energy function within the PCA subspace of intermediate U-Net features, coupled with gradient injection to guide the denoising trajectory and explicitly enforce multi-view geometric consistency. Integrated into SDS/VSD frameworks, the method dynamically regulates 3D structure during diffusion sampling without modifying model weights. Experiments demonstrate significant suppression of Janus artifacts, improved cross-view geometric alignment, and enhanced structural fidelity. By operating entirely at inference time and requiring no fine-tuning, it establishes a new paradigm for efficient, lightweight text-to-3D synthesis.
📝 Abstract
Text-to-3D generation often suffers from the Janus problem, where objects look correct from the front but collapse into duplicated or distorted geometry from other angles. We attribute this failure to viewpoint bias in 2D diffusion priors, which propagates into 3D optimization. To address this, we propose Structural Energy-Guided Sampling (SEGS), a training-free, plug-and-play framework that enforces multi-view consistency entirely at sampling time. SEGS defines a structural energy in a PCA subspace of intermediate U-Net features and injects its gradients into the denoising trajectory, steering geometry toward the intended viewpoint while preserving appearance fidelity. Integrated seamlessly into SDS/VSD pipelines, SEGS significantly reduces Janus artifacts, achieving improved geometric alignment and viewpoint consistency without retraining or weight modification.