Visual Generation Tuning

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pretrained large vision-language models (VLMs) are inherently designed for visual understanding and lack explicit generative modeling capabilities. Method: We propose Vision Generation Tuning (VGT), a novel paradigm that enables VLMs to acquire implicit visual generation ability by aligning semantic encoders with lightweight pixel decoders in a continuous latent space—enabling efficient autoregressive image reconstruction and synthesis without resorting to complex VAE architectures. Contribution/Results: The resulting VGT-AE architecture unifies understanding and generation within a plug-and-play fine-tuning framework. Experiments demonstrate state-of-the-art performance: 26.67 PSNR and 0.50 rFID for image reconstruction at 28× compression; generation scores of 0.77 on GenEval and 78.73 on DPG-Bench; and 20× faster convergence compared to dedicated VAE baselines—significantly outperforming existing methods while preserving VLM semantics.

Technology Category

Application Category

📝 Abstract
Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.
Problem

Research questions and friction points this paper is trying to address.

Unlocking visual generation potential in vision-language models
Reducing alignment costs for autoregressive modeling convergence
Creating unified multimodal models for understanding and generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Generation Tuning for vision language models
Aligns semantic encoders with latent pixel decoders
Achieves efficient autoregressive modeling with 20x speedup
🔎 Similar Papers
J
Jiahao Guo
Huazhong University of Science and Technology (HUST)
S
Sinan Du
Tsinghua University
Jingfeng Yao
Jingfeng Yao
Huazhong University of Science and Technology
computer visiongenerative models
W
Wenyu Liu
Huazhong University of Science and Technology (HUST)
B
Bo Li
Kolors Team, Kuaishou Technology
H
Haoxiang Cao
School of Artificial Intelligence, South China Normal University
Kun Gai
Kun Gai
Senior Director & Researcher, Alibaba Group
Machine LearningComputational Advertising
C
Chun Yuan
Tsinghua University
K
Kai Wu
Kolors Team, Kuaishou Technology
Xinggang Wang
Xinggang Wang
Professor, Huazhong University of Science and Technology
Artificial IntelligenceComputer VisionAutonomous DrivingObject DetectionObject Segmentation