MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to evaluate unified multimodal models on multi-image contextual generation tasks, particularly in terms of cross-image composition, contextual reasoning, and identity preservation. To address this gap, this work proposes MICON-Bench—the first systematic evaluation benchmark tailored for multi-image contextual generation—encompassing six core tasks. It introduces two key technical contributions: an automatic semantic and visual consistency verification framework based on multimodal large language models (MLLMs), and a training-free, plug-and-play Dynamic Attention Rebalancing (DAR) mechanism. Experimental results demonstrate that MICON-Bench effectively uncovers limitations in current models, while DAR significantly enhances the consistency and contextual coherence of generated outputs and reduces hallucinations.

Technology Category

Application Category

📝 Abstract
Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce \textbf{MICON-Bench}, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present \textbf{Dynamic Attention Rebalancing (DAR)}, a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence. Github: https://github.com/Angusliuuu/MICON-Bench.
Problem

Research questions and friction points this paper is trying to address.

multi-image context generation
unified multimodal models
cross-image composition
contextual reasoning
identity preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

MICON-Bench
Dynamic Attention Rebalancing
Multi-Image Context Generation
MLLM-driven Evaluation
Unified Multimodal Models
🔎 Similar Papers
No similar papers found.
Mingrui Wu
Mingrui Wu
XMU
MLLMT2I
H
Hang Liu
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Jiayi Ji
Jiayi Ji
Rutgers University
X
Xiaoshuai Sun
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.