WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the *cross-task generalization collapse*—caused by fixed numbers of learnable queries when integrating pretrained vision-language models (VLMs) with diffusion models—this paper proposes the *Noisy Query Tokens (NQT)* mechanism. NQT employs end-to-end learnable, randomly initialized query tokens to establish a dynamic, distributed semantic alignment space between VLMs and diffusion priors. Additionally, a lightweight VAE branch jointly optimized with linear projection is introduced to recover fine-grained image details. This design significantly improves robustness in multimodal feature alignment and generation fidelity. Experiments demonstrate that NQT enables stable, continual learning across diverse downstream tasks—including text-to-image generation, open-vocabulary segmentation, and visual question answering—effectively mitigating generalization degradation. Crucially, it enhances cross-task transferability without compromising inference efficiency.

Technology Category

Application Category

📝 Abstract

Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

Problem

Research questions and friction points this paper is trying to address.

Bridging vision-language models with diffusion models efficiently

Overcoming task generalization collapse in multimodal learning

Enhancing continual learning across diverse vision-language tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Noisy Query Tokens enhance VLM-Diffusion bridging

VAE branch recovers fine-grained image details

End-to-end optimization enables stable continual learning

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions