Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

📅 2024-07-17

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

199K/year

🤖 AI Summary

High-fidelity, multi-scenario, arbitrarily scalable infinite image synthesis remains challenging due to the scarcity of high-resolution image-text paired data. Method: We propose a zero-shot image extrapolation framework that eliminates reliance on high-resolution image-text training pairs. For the first time, we employ a large language model (LLM) as a dynamic text generator, synergistically integrated with a diffusion model and a multi-granularity vision–text conditioning mechanism. The LLM autonomously produces both global semantic descriptions and fine-grained local details, guiding the diffusion model to extend images on demand while preserving cross-regional semantic consistency and enhancing local fidelity. Contribution/Results: Our method enables zero-shot, arbitrary-scale, and cross-domain image synthesis. Extensive experiments demonstrate significant improvements over state-of-the-art approaches in both quantitative metrics and qualitative assessment, achieving high-fidelity, contextually coherent large-scale infinite image generation.

Technology Category

Application Category

📝 Abstract

Text-guided image editing and generation methods have diverse real-world applications. However, text-guided infinite image synthesis faces several challenges. First, there is a lack of text-image paired datasets with high-resolution and contextual diversity. Second, expanding images based on text requires global coherence and rich local context understanding. Previous studies have mainly focused on limited categories, such as natural landscapes, and also required to train on high-resolution images with paired text. To address these challenges, we propose a novel approach utilizing Large Language Models (LLMs) for both global coherence and local context understanding, without any high-resolution text-image paired training dataset. We train the diffusion model to expand an image conditioned on global and local captions generated from the LLM and visual feature. At the inference stage, given an image and a global caption, we use the LLM to generate a next local caption to expand the input image. Then, we expand the image using the global caption, generated local caption and the visual feature to consider global consistency and spatial local context. In experiments, our model outperforms the baselines both quantitatively and qualitatively. Furthermore, our model demonstrates the capability of text-guided arbitrary-sized image generation in zero-shot manner with LLM guidance.

Problem

Research questions and friction points this paper is trying to address.

Text-to-Image Generation

High-Quality Image Synthesis

Limited Data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Image Generation

Text-to-Image Synthesis

🔎 Similar Papers

Unified Text-to-Image Generation and Retrieval