Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-3D generation methods suffer from two key limitations: insufficient fine-grained semantic alignment—failing to capture prompt details—and weak 3D spatial understanding, leading to geometric inconsistencies and erroneous part assembly. To address these, we propose VLM3D, the first framework that repurposes large vision-language models (VLMs) as differentiable semantic-spatial joint critics. It introduces a dual-query Yes/No log-odds mechanism to jointly optimize semantic fidelity and multi-view geometric consistency via unified gradient signals. VLM3D is architecture-agnostic, seamlessly integrating into both optimization-based and feed-forward generation paradigms without modifying backbone networks. On standard benchmarks, it significantly outperforms state-of-the-art methods, effectively correcting part misalignment and structural anomalies. Our approach achieves synergistic improvements in fine-grained semantic alignment and geometric consistency, establishing a new paradigm for semantic-aware 3D generation.

Technology Category

Application Category

📝 Abstract
Text-to-3D generation has advanced rapidly, yet state-of-the-art models, encompassing both optimization-based and feed-forward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dual-query critic signal derived from the VLM's Yes or No log-odds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for optimization-based pipelines, VLM3D significantly outperforms existing methods on standard benchmarks. (2) As a test-time guidance module for feed-forward pipelines, it actively steers the iterative sampling process of SOTA native 3D models to correct severe spatial errors. VLM3D establishes a principled and generalizable path to inject the VLM's rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.
Problem

Research questions and friction points this paper is trying to address.

Improving coarse semantic alignment in text-to-3D generation
Addressing geometric inconsistencies in 3D spatial relationships
Correcting catastrophic failures in part assembly processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposing VLMs as differentiable semantic critics
Using dual-query critic signals from VLM log-odds
Applying guidance across optimization and feed-forward pipelines