MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing benchmarks primarily focus on single-reference multi-focus image generation, lacking systematic evaluation for multi-reference text-to-image generation—characterized by ill-defined task formulations and ambiguous difficulty dimensions. Method: We introduce MRBench, the first comprehensive benchmark specifically designed for multi-reference generation, formally defining five core challenge dimensions: varying numbers of references, cross-domain discrepancies, scale mismatches, rare-concept modeling, and multilingual text understanding. Leveraging a multi-dimensional, controllable test set and standardized evaluation protocols, we conduct systematic assessments of leading diffusion models. Contribution/Results: Our evaluation uncovers critical bottlenecks in semantic fusion, cross-domain alignment, and fine-grained control. Experiments demonstrate substantial performance degradation under complex multi-reference conditions. MRBench establishes a reproducible, comparable, and open-source platform for fair, rigorous evaluation in multi-reference text-to-image generation.

Technology Category

Application Category

📝 Abstract

Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce $ extbf{MultiBanana}$, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at https://github.com/matsuolab/multibanana .

Problem

Research questions and friction points this paper is trying to address.

Assesses multi-reference text-to-image generation model capabilities comprehensively.

Addresses gaps in existing benchmarks for multi-reference conditions and tasks.

Covers diverse challenges like domain mismatch and rare concepts systematically.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MultiBanana benchmark for multi-reference generation

Covers diverse multi-reference problems like domain mismatch

Enables standardized comparison of text-to-image models

🔎 Similar Papers

Unified Text-to-Image Generation and Retrieval

2024-06-09arXiv.orgCitations: 3

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)