🤖 AI Summary
Existing text-to-3D generation methods struggle to align with human subjective preferences, resulting in suboptimal aesthetic quality and limited controllability. This work introduces Direct Preference Optimization (DPO) to text-to-3D generation for the first time. We design a fine-grained, multimodal large model (LMM)-based reward function that operates on preference pairs—not scalar scores—to guide optimization. Integrated with differentiable 3D representations (NeRF and 3D Gaussian Splatting), our framework enables end-to-end preference alignment without relying on precise human annotations. This approach facilitates more natural and robust preference learning. Extensive evaluations across multiple benchmarks demonstrate significant improvements over state-of-the-art methods: generated 3D models exhibit enhanced visual fidelity, stronger semantic consistency with input text, and superior user-controllability. To foster reproducibility and further research, we release both code and pretrained models.
📝 Abstract
Text-to-3D generation automates 3D content creation from textual descriptions, which offers transformative potential across various fields. However, existing methods often struggle to align generated content with human preferences, limiting their applicability and flexibility. To address these limitations, in this paper, we propose DreamDPO, an optimization-based framework that integrates human preferences into the 3D generation process, through direct preference optimization. Practically, DreamDPO first constructs pairwise examples, then compare their alignment with human preferences using reward or large multimodal models, and lastly optimizes the 3D representation with a preference-driven loss function. By leveraging pairwise comparison to reflect preferences, DreamDPO reduces reliance on precise pointwise quality evaluations while enabling fine-grained controllability through preference-guided optimization. Experiments demonstrate that DreamDPO achieves competitive results, and provides higher-quality and more controllable 3D content compared to existing methods. The code and models will be open-sourced.