Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Diffusion models rely on large-scale, pretrained text-to-vision prior networks, incurring substantial computational cost and demanding extensive training data. To address these limitations, we propose Optimization-based Visual Inversion (OVI), a training-free and data-free method that directly maps text embeddings into the visual manifold of a pretrained diffusion decoder. Our key contributions are threefold: (1) the first training-free and data-free OVI framework; (2) a joint optimization objective integrating cosine similarity maximization, Mahalanobis distance regularization, and nearest-neighbor loss to enforce semantic fidelity and visual coherence; and (3) an empirical analysis revealing critical limitations of mainstream perceptual quality metrics. Experiments on Kandinsky 2.2 demonstrate that OVI achieves image quality comparable to or exceeding that of efficient prior-based methods, with significant improvements in quantitative metrics—including reduced FID and enhanced CLIP Score—despite requiring no additional training or external data.

Technology Category

Application Category

📝 Abstract

Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Eliminating computationally expensive trained diffusion priors for text-to-image generation

Replacing trained priors with training-free optimization-based visual inversion method

Addressing evaluation benchmark flaws that reward poor perceptual quality outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free visual inversion replaces diffusion prior

Mahalanobis and Nearest-Neighbor constraints regularize optimization

Optimization maximizes cosine similarity between visual and text embeddings

🔎 Similar Papers

No similar papers found.