MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

πŸ“… 2025-12-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Multi-image composition generation (MICo) faces two key challenges: scarcity of high-quality training data and difficulty in preserving identity consistency across composited images. To address these, we introduce MICo-150Kβ€”the first large-scale, high-fidelity MICo benchmark comprising 150K images spanning seven representative composition tasks. We propose a Decompose-and-Recompose (De&Re) training paradigm and a novel evaluation metric, Weighted-Ref-VIEScore, enabling bidirectional validation on both real and synthetic images. Additionally, we design a human-in-the-loop filtering pipeline and establish MICo-Bench, a comprehensive evaluation benchmark for systematic assessment. Fine-tuned models demonstrate substantial improvements in MICo capability: Qwen-MICo matches Qwen-Image-2509’s performance on three-image composition and, for the first time, supports controllable composition with an arbitrary number of reference images. This work advances multi-source image generation toward greater flexibility, fidelity, and scalability.

Technology Category

Application Category

πŸ“ Abstract
In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To bridge this gap, we conduct a systematic study of MICo, categorizing it into 7 representative tasks and curate a large-scale collection of high-quality source images and construct diverse MICo prompts. Leveraging powerful proprietary models, we synthesize a rich amount of balanced composite images, followed by human-in-the-loop filtering and refinement, resulting in MICo-150K, a comprehensive dataset for MICo with identity consistency. We further build a Decomposition-and-Recomposition (De&Re) subset, where 11K real-world complex images are decomposed into components and recomposed, enabling both real and synthetic compositions. To enable comprehensive evaluation, we construct MICo-Bench with 100 cases per task and 300 challenging De&Re cases, and further introduce a new metric, Weighted-Ref-VIEScore, specifically tailored for MICo evaluation. Finally, we fine-tune multiple models on MICo-150K and evaluate them on MICo-Bench. The results show that MICo-150K effectively equips models without MICo capability and further enhances those with existing skills. Notably, our baseline model, Qwen-MICo, fine-tuned from Qwen-Image-Edit, matches Qwen-Image-2509 in 3-image composition while supporting arbitrary multi-image inputs beyond the latter's limitation. Our dataset, benchmark, and baseline collectively offer valuable resources for further research on Multi-Image Composition.
Problem

Research questions and friction points this paper is trying to address.

Addresses the lack of high-quality training data for multi-image composition tasks
Solves the challenge of synthesizing coherent images from multiple reference inputs
Enables comprehensive evaluation of multi-image composition models and methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created large-scale dataset MICo-150K with human-filtered composite images
Built Decomposition-and-Recomposition subset using real-world image components
Developed MICo-Bench benchmark with tailored Weighted-Ref-VIEScore metric
Xinyu Wei
Xinyu Wei
PolyU & PKU
Computer VisionDeep Learning
K
Kangrui Cen
OPPO Research Institute
H
Hongyang Wei
Tsinghua University
Z
Zhen Guo
Hong Kong Polytechnic University
B
Bairui Li
Hong Kong Polytechnic University
Z
Zeqing Wang
Sun Yat-Sen University
J
Jinrui Zhang
Hong Kong Polytechnic University
L
Lei Zhang
Hong Kong Polytechnic University