StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the limitations of existing text-to-image (T2I) generation methods, which often fail to preserve critical spatial information when processing complex prompts due to a lack of visual context or rely on intermediate image generation that incurs high computational costs and is constrained by generator capabilities. To overcome these issues, the authors propose StruVis, a novel framework that introduces structured visual representations in textual form as an intermediate reasoning state within multimodal large language models (MLLMs). This approach simulates visual perception through purely textual reasoning, thereby enhancing comprehension of intricate prompts without generating intermediate images. Consequently, StruVis enables efficient, generator-agnostic reasoning augmentation and integrates seamlessly with diverse T2I models. Experimental results demonstrate performance improvements of 4.61% and 4% on the T2I-ReasonBench and WISE benchmarks, respectively.

Technology Category

Application Category

📝 Abstract

Reasoning-based text-to-image (T2I) generation requires models to interpret complex prompts accurately. Existing reasoning frameworks can be broadly categorized into two types: (1) Text-Only Reasoning, which is computationally efficient but lacks access to visual context, often resulting in the omission of critical spatial and visual elements; and (2) Text-Image Interleaved Reasoning, which leverages a T2I generator to provide visual references during the reasoning process. While this approach enhances visual grounding, it incurs substantial computational costs and constrains the reasoning capacity of MLLMs to the representational limitations of the generator. To this end, we propose StruVis, a novel framework that enhances T2I generation through Thinking with Structured Vision. Instead of relying on intermediate image generation, StruVis employs text-based structured visual representations as intermediate reasoning states, thereby enabling the MLLM to effectively"perceive"visual structure within a purely text-based reasoning process. Powered by this, the reasoning potential for T2I generation of the MLLM is unlocked through structured-vision-guided reasoning. Additionally, as a generator-agnostic reasoning framework, our proposed StruVis can be seamlessly integrated with diverse T2I generators and efficiently enhance their performance in reasoning-based T2I generation. Extensive experiments demonstrate that StruVis achieves significant performance improvements on reasoning-based T2I benchmarks, e.g., a 4.61% gain on T2I-ReasonBench and a 4% gain on WISE.

Problem

Research questions and friction points this paper is trying to address.

text-to-image generation

reasoning-based T2I

visual grounding

structured vision

MLLM reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Vision

Reasoning-based T2I

Text-to-Image Generation