WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the *cross-task generalization collapse*—caused by fixed numbers of learnable queries when integrating pretrained vision-language models (VLMs) with diffusion models—this paper proposes the *Noisy Query Tokens (NQT)* mechanism. NQT employs end-to-end learnable, randomly initialized query tokens to establish a dynamic, distributed semantic alignment space between VLMs and diffusion priors. Additionally, a lightweight VAE branch jointly optimized with linear projection is introduced to recover fine-grained image details. This design significantly improves robustness in multimodal feature alignment and generation fidelity. Experiments demonstrate that NQT enables stable, continual learning across diverse downstream tasks—including text-to-image generation, open-vocabulary segmentation, and visual question answering—effectively mitigating generalization degradation. Crucially, it enhances cross-task transferability without compromising inference efficiency.

Technology Category

Application Category

📝 Abstract
Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.
Problem

Research questions and friction points this paper is trying to address.

Bridging vision-language models with diffusion models efficiently
Overcoming task generalization collapse in multimodal learning
Enhancing continual learning across diverse vision-language tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Noisy Query Tokens enhance VLM-Diffusion bridging
VAE branch recovers fine-grained image details
End-to-end optimization enables stable continual learning
J
Jian Yang
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Dacheng Yin
Dacheng Yin
University of Science and Technology of China
speech enhancementrepresentation learningspeech editing
Xiaoxuan He
Xiaoxuan He
ZheJiang University
Deep Learning
Y
Yong Li
The Hong Kong University of Science and Technology
F
Fengyun Rao
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Jing Lyu
Jing Lyu
Shanghai Jiao Tong University
Power electronicsstabilityrenewable energy grid integrationhigh-voltage dc transmission
W
Wei Zhai
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Y
Yang Cao
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China
Z
Zhengjun Zha
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China