🤖 AI Summary
Current text-to-3D generation methods suffer from two key limitations: insufficient fine-grained semantic alignment—failing to capture prompt details—and weak 3D spatial understanding, leading to geometric inconsistencies and erroneous part assembly. To address these, we propose VLM3D, the first framework that repurposes large vision-language models (VLMs) as differentiable semantic-spatial joint critics. It introduces a dual-query Yes/No log-odds mechanism to jointly optimize semantic fidelity and multi-view geometric consistency via unified gradient signals. VLM3D is architecture-agnostic, seamlessly integrating into both optimization-based and feed-forward generation paradigms without modifying backbone networks. On standard benchmarks, it significantly outperforms state-of-the-art methods, effectively correcting part misalignment and structural anomalies. Our approach achieves synergistic improvements in fine-grained semantic alignment and geometric consistency, establishing a new paradigm for semantic-aware 3D generation.
📝 Abstract
Text-to-3D generation has advanced rapidly, yet state-of-the-art models, encompassing both optimization-based and feed-forward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dual-query critic signal derived from the VLM's Yes or No log-odds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly outperforms existing methods on standard benchmarks. (2) As a test-time guidance module for feed-forward pipelines, it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D establishes a principled and generalizable path to inject the VLM's rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.