Spherical Dense Text-to-Image Synthesis

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image (T2I) models face three key challenges in panoramic image generation: weak layout controllability, spherical distortion artifacts, and boundary discontinuities—exacerbated by the absence of a unified spherical dense generation framework. To address these, we propose Spherical Dense Text-to-Image synthesis (SDT2I), a novel paradigm introducing two complementary methods: (i) MultiStitchDiffusion (MSTD), a training-free spherical stitching approach; and (ii) an enhanced MultiPanFusion (MPF) incorporating a bootstrap-coupling mechanism and disabling foreground isocylindrical projection attention to improve geometric consistency. We further introduce DSynView—the first spherical layout synthesis benchmark. Experiments show MSTD outperforms MPF across image quality, prompt fidelity, and layout alignment; meanwhile, the improved MPF significantly enhances object integrity and foreground robustness while preserving diversity and structural fidelity. This work establishes the first principled integration of dense generation and spherical modeling.

Technology Category

Application Category

📝 Abstract
Recent advancements in text-to-image (T2I) have improved synthesis results, but challenges remain in layout control and generating omnidirectional panoramic images. Dense T2I (DT2I) and spherical T2I (ST2I) models address these issues, but so far no unified approach exists. Trivial approaches, like prompting a DT2I model to generate panoramas can not generate proper spherical distortions and seamless transitions at the borders. Our work shows that spherical dense text-to-image (SDT2I) can be achieved by integrating training-free DT2I approaches into finetuned panorama models. Specifically, we propose MultiStitchDiffusion (MSTD) and MultiPanFusion (MPF) by integrating MultiDiffusion into StitchDiffusion and PanFusion, respectively. Since no benchmark for SDT2I exists, we further construct Dense-Synthetic-View (DSynView), a new synthetic dataset containing spherical layouts to evaluate our models. Our results show that MSTD outperforms MPF across image quality as well as prompt- and layout adherence. MultiPanFusion generates more diverse images but struggles to synthesize flawless foreground objects. We propose bootstrap-coupling and turning off equirectangular perspective-projection attention in the foreground as an improvement of MPF.
Problem

Research questions and friction points this paper is trying to address.

Challenges in layout control for text-to-image synthesis.
Difficulty in generating omnidirectional panoramic images.
Lack of unified approach for spherical dense text-to-image synthesis.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates DT2I into panorama models
Proposes MSTD and MPF techniques
Introduces DSynView dataset for evaluation
🔎 Similar Papers
No similar papers found.