Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing subject-driven image generation methods often struggle to simultaneously preserve identity and follow textual instructions due to the separate encoding of text and reference images, frequently resulting in copy-paste artifacts. To address this limitation, this work proposes a diffusion-based generation framework that leverages a multimodal large language model (MLLM) to jointly encode textual prompts and reference images. The approach integrates identity conditions extracted via a VAE and dynamically fuses semantic and fine-grained details during the denoising process. A novel dual-layer aggregation (DLA) module is introduced to effectively combine multi-level MLLM features, complemented by a multi-stage denoising strategy that balances semantic fidelity with identity preservation. Experimental results demonstrate that the proposed method substantially mitigates artifact generation and achieves clear superiority over current state-of-the-art approaches in human preference evaluations.

📝 Abstract

Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference. Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation. Our project website is available at https://zsh2000.github.io/squeeze-mllm-subject-gen/.

Problem

Research questions and friction points this paper is trying to address.

subject-driven generation

identity preservation

multimodal reasoning

copy-paste artifacts

text-to-image synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models

Subject-driven Generation

Dual Layer Aggregation