High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the limitation of existing pre-trained vision-language models in generating high-fidelity images due to their reliance on discrete image tokenization. The authors propose a framework that trains only a diffusion decoder, leveraging the image token logits produced by the frozen pre-trained model without modifying its architecture. Key innovations include a logit-to-code distribution mapping, a lightweight logit calibration mechanism, and a distribution-conditioned diffusion decoder that incorporates continuous representations from a VQ-VAE. Remarkably, with only brief training on ImageNet-1K, the method substantially enhances both VQ-VAE reconstruction quality and the visual fidelity of text-to-image generation.

Technology Category

Application Category

📝 Abstract

Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we propose a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained VLMs, thereby preserving the original model intact. At its core, Logit-to-Code Distributional Mapping converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features, providing an effective conditioning signal for diffusion decoding. A lightweight Logit Calibration aligns training-time proxy logits from the VQ-VAE encoder with VLM-generated logits, mitigating the train-inference gap. Conditioned on these representations, the Distribution-Conditioned Diffusion Decoder generates high-fidelity images. Achieved solely through short training on ImageNet-1K, our method consistently improves visual fidelity for both VQ-VAE reconstructions and text-to-image generations from VLM-predicted tokens.

Problem

Research questions and friction points this paper is trying to address.

text-to-image generation

visual fidelity

vision-language models

discrete tokenization

continuous representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

distribution-conditioned diffusion

vision-language models

logit-to-code mapping