VUGEN: Visual Understanding priors for GENeration

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address the misalignment between understanding and generation representations and architectural complexity in vision-language models (VLMs), this paper proposes VUGEN—the first framework to explicitly leverage pre-trained VLM visual understanding priors to guide image generation. Methodologically, it reduces the dimensionality of the visual encoder’s latent space and introduces a semantic-preserving low-dimensional mapping; VLM-driven sampling is then performed within this compressed space, coupled with a VAE-free pixel-level diffusion decoder for reconstruction. This design eliminates conventional bridging mechanisms and auxiliary encoders, enabling end-to-end alignment between understanding and generation representations. On COCO, VUGEN achieves a DPG Bench score of 74.32 (+3.15) and an FID of 9.06 (−2.80), while fully preserving the original VLM’s visual understanding capabilities.

Technology Category

Application Category

📝 Abstract

Recent advances in Vision-Language Models (VLMs) have enabled unified understanding across text and images, yet equipping these models with robust image generation capabilities remains challenging. Existing approaches often rely on reconstruction-oriented autoencoders or complex bridging mechanisms, leading to misalignment between understanding and generation representations, or architectural complexity. In this work, we propose VUGEN, a novel framework that explicitly leverages VLM's pretrained visual understanding priors for efficient and high-quality image generation. Our approach first transforms the high-dimensional latent space of the VLM's native vision encoder into a lower-dimensional, tractable distribution that maximally preserves visual information. The VLM is then trained to sample within this reduced latent space, ensuring alignment with its visual understanding capabilities. Finally, a dedicated pixel decoder maps these generated latents back to the image space. We find that a VAE-free pixel diffusion decoder to be on par or better than commonly used complex latent diffusion decoders that internally rely on VAE latents. Extensive experiments demonstrate that VUGEN achieves superior image generation performance, improving DPG Bench from 71.17 to 74.32 and FID from 11.86 to 9.06 on COCO, while fully preserving the VLM's original understanding capabilities.

Problem

Research questions and friction points this paper is trying to address.

Bridging visual understanding and generation in VLMs

Reducing misalignment between understanding and generation representations

Simplifying complex architectures for image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages VLM pretrained visual understanding priors

Transforms VLM latent space to lower-dimensional distribution

Uses VAE-free pixel diffusion decoder for generation

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling