๐ค AI Summary
This work addresses the challenge of unifying text understanding and pixel-level generation in multimodal large models. We propose MetaQueriesโa mechanism that, while keeping the multimodal large language model (MLLM) backbone frozen, bridges MLLM semantic representations with diffusion model generation spaces via learnable query embeddings. Our approach requires no joint training, explicit modality alignment design, or additional data augmentation; it leverages only standard image-text paired data to achieve understanding-driven, high-fidelity image generation. Crucially, this is the first method to achieve strong generative capability under *fully frozen* MLLM weights, simultaneously ensuring semantic fidelity and generation controllability. Experiments demonstrate state-of-the-art performance on image editing and subject-driven generation tasks. By eliminating the need for parameter updates to the MLLM backbone, our method significantly simplifies multimodal modeling and establishes a new paradigm for efficient cross-modal knowledge transfer.
๐ Abstract
Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen, thereby preserving its state-of-the-art multimodal understanding capabilities while achieving strong generative performance. Additionally, our method is flexible and can be easily instruction-tuned for advanced applications such as image editing and subject-driven generation.