Visual Generation Tuning

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Pretrained large vision-language models (VLMs) are inherently designed for visual understanding and lack explicit generative modeling capabilities. Method: We propose Vision Generation Tuning (VGT), a novel paradigm that enables VLMs to acquire implicit visual generation ability by aligning semantic encoders with lightweight pixel decoders in a continuous latent space—enabling efficient autoregressive image reconstruction and synthesis without resorting to complex VAE architectures. Contribution/Results: The resulting VGT-AE architecture unifies understanding and generation within a plug-and-play fine-tuning framework. Experiments demonstrate state-of-the-art performance: 26.67 PSNR and 0.50 rFID for image reconstruction at 28× compression; generation scores of 0.77 on GenEval and 78.73 on DPG-Bench; and 20× faster convergence compared to dedicated VAE baselines—significantly outperforming existing methods while preserving VLM semantics.

Technology Category

Application Category

📝 Abstract

Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.

Problem

Research questions and friction points this paper is trying to address.

Unlocking visual generation potential in vision-language models

Reducing alignment costs for autoregressive modeling convergence

Creating unified multimodal models for understanding and generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Generation Tuning for vision language models

Aligns semantic encoders with latent pixel decoders

Achieves efficient autoregressive modeling with 20x speedup

🔎 Similar Papers

DepthART: Monocular Depth Estimation as Autoregressive Refinement Task

2024-09-23arXiv.orgCitations: 1

Bosch Group

bangalore, IN

Research Engineer/Scientist (all levels), Efficient Models

TikTok

San Jose, California

AI Research Scientist, VLM (vision language models)