Transfer between Modalities with MetaQueries

๐Ÿ“… 2025-04-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of unifying text understanding and pixel-level generation in multimodal large models. We propose MetaQueriesโ€”a mechanism that, while keeping the multimodal large language model (MLLM) backbone frozen, bridges MLLM semantic representations with diffusion model generation spaces via learnable query embeddings. Our approach requires no joint training, explicit modality alignment design, or additional data augmentation; it leverages only standard image-text paired data to achieve understanding-driven, high-fidelity image generation. Crucially, this is the first method to achieve strong generative capability under *fully frozen* MLLM weights, simultaneously ensuring semantic fidelity and generation controllability. Experiments demonstrate state-of-the-art performance on image editing and subject-driven generation tasks. By eliminating the need for parameter updates to the MLLM backbone, our method significantly simplifies multimodal modeling and establishes a new paradigm for efficient cross-modal knowledge transfer.

Technology Category

Application Category

๐Ÿ“ Abstract
Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen, thereby preserving its state-of-the-art multimodal understanding capabilities while achieving strong generative performance. Additionally, our method is flexible and can be easily instruction-tuned for advanced applications such as image editing and subject-driven generation.
Problem

Research questions and friction points this paper is trying to address.

Aligning text and pixel modalities in unified models
Simplifying training for multimodal generative models
Enhancing image generation with MLLM reasoning capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

MetaQueries link autoregressive MLLMs to diffusion models
Training simplified with image-caption data and diffusion objectives
Frozen MLLM backbone maintains understanding while enabling generation
๐Ÿ”Ž Similar Papers
No similar papers found.