Multimodal Large Language Models for Multi-Subject In-Context Image Generation

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenges of subject omission and semantic drift in text-to-image generation when multiple specified subjects co-occur. To tackle these issues, the authors propose MUSIC, the first multimodal large language model specifically designed for multi-subject contextual image generation. MUSIC incorporates a Vision Chain-of-Thought (Vision CoT) reasoning mechanism and a semantics-driven, scalable spatial layout planner. The model is further optimized through an automated data generation pipeline and a training strategy tailored for complex multi-subject compositions. Experimental results demonstrate that MUSIC significantly outperforms existing methods in both single- and multi-subject scenarios. To facilitate future research, the authors also introduce MSIC, the first dedicated benchmark for evaluating multi-subject image generation.

Technology Category

Application Category

📝 Abstract

Recent advances in text-to-image (T2I) generation have enabled visually coherent image synthesis from descriptions, but generating images containing multiple given subjects remains challenging. As the number of reference identities increases, existing methods often suffer from subject missing and semantic drift. To address this problem, we propose MUSIC, the first MLLM specifically designed for \textbf{MU}lti-\textbf{S}ubject \textbf{I}n-\textbf{C}ontext image generation. To overcome the data scarcity, we introduce an automatic and scalable data generation pipeline that eliminates the need for manual annotation. Furthermore, we enhance the model's understanding of multi-subject semantic relationships through a vision chain-of-thought (CoT) mechanism, guiding step-by-step reasoning from subject images to semantics and generation. To mitigate identity entanglement and manage visual complexity, we develop a novel semantics-driven spatial layout planning method and demonstrate its test-time scalability. By incorporating complex subject images during training, we improve the model's capacity for chained reasoning. In addition, we curate MSIC, a new benchmark tailored for multi-subject in-context generation. Experimental results demonstrate that MUSIC significantly surpasses other methods in both multi- and single-subject scenarios.

Problem

Research questions and friction points this paper is trying to address.

multi-subject image generation

subject missing

semantic drift

in-context generation

text-to-image synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Model

Multi-Subject Image Generation

Vision Chain-of-Thought