🤖 AI Summary
Score Distillation Sampling (SDS) for text-to-3D generation suffers from an inherent trade-off: enhancing texture fidelity often induces geometric distortions, while improving surface smoothness frequently causes scale inconsistency.
Method: We systematically investigate the mechanisms of training-free guidance techniques—Classifier-Free Guidance (CFG) and FreeU—within the SDS framework, analyzing their temporal dependencies on scale consistency, surface smoothness, texture fidelity, and geometric accuracy. Based on this analysis, we propose a dynamic temporal-adaptive scaling strategy that modulates guidance strength per denoising step.
Contribution/Results: Without additional training or architectural modifications, our method significantly improves texture–geometry coherency. It preserves object-scale robustness while effectively suppressing surface noise and structural artifacts. Experiments demonstrate state-of-the-art quality in zero-shot text-to-3D synthesis, establishing a new paradigm for efficient, controllable, training-free 3D generation.
📝 Abstract
Recent studies show that simple training-free techniques can dramatically improve the quality of text-to-2D generation outputs, e.g. Classifier-Free Guidance (CFG) or FreeU. However, these training-free techniques have been underexplored in the lens of Score Distillation Sampling (SDS), which is a popular and effective technique to leverage the power of pretrained text-to-2D diffusion models for various tasks. In this paper, we aim to shed light on the effect such training-free techniques have on SDS, via a particular application of text-to-3D generation via 2D lifting. We present our findings, which show that varying the scales of CFG presents a trade-off between object size and surface smoothness, while varying the scales of FreeU presents a trade-off between texture details and geometric errors. Based on these findings, we provide insights into how we can effectively harness training-free techniques for SDS, via a strategic scaling of such techniques in a dynamic manner with respect to the timestep or optimization iteration step. We show that using our proposed scheme strikes a favorable balance between texture details and surface smoothness in text-to-3D generations, while preserving the size of the output and mitigating the occurrence of geometric defects.