🤖 AI Summary
To address the *cross-task generalization collapse*—caused by fixed numbers of learnable queries when integrating pretrained vision-language models (VLMs) with diffusion models—this paper proposes the *Noisy Query Tokens (NQT)* mechanism. NQT employs end-to-end learnable, randomly initialized query tokens to establish a dynamic, distributed semantic alignment space between VLMs and diffusion priors. Additionally, a lightweight VAE branch jointly optimized with linear projection is introduced to recover fine-grained image details. This design significantly improves robustness in multimodal feature alignment and generation fidelity. Experiments demonstrate that NQT enables stable, continual learning across diverse downstream tasks—including text-to-image generation, open-vocabulary segmentation, and visual question answering—effectively mitigating generalization degradation. Crucially, it enhances cross-task transferability without compromising inference efficiency.
📝 Abstract
Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.