MultiRef: Controllable Image Generation with Multiple Visual References

📅 2025-08-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current image generation methods typically rely on either a single text prompt or one reference image, limiting their capacity to support designer-like integration of multi-source visual inspiration. To address this, we introduce MultiRef-bench—the first benchmark dedicated to multi-reference image generation—accompanied by the high-quality MultiRef dataset and a systematic analysis of prevailing models’ limitations under multi-reference conditions. We propose the RefBlend data engine to synthesize diverse, controllable multi-reference samples, enabling rigorous evaluation of three multimodal foundation models and six agent-based frameworks across varied reference configurations. Experimental results reveal that even the state-of-the-art OmniGen achieves only 66.6% accuracy on synthetic benchmarks and 79.0% on real-world scenarios, underscoring the substantial challenges remaining in multi-reference generation. This work establishes a new benchmark, dataset, and analytical framework to advance controllable and flexible visual content creation.

Technology Category

Application Category

📝 Abstract
Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Controllable image generation with multiple visual references
Evaluating models on multi-reference image synthesis tasks
Improving AI tools for integrating diverse visual inspirations
Innovation

Methods, ideas, or system contributions that make the work stand out.

MultiRef-bench for multi-reference evaluation
RefBlend engine synthesizes diverse reference combinations
OmniGen leads in multi-reference image generation
🔎 Similar Papers
No similar papers found.
Ruoxi Chen
Ruoxi Chen
Zhejiang University of Technology
Trustworthy AIMultimodal Models
D
Dongping Chen
University of Washington, Seattle, USA
S
Siyuan Wu
Huazhong University of Science and Technology, Wuhan, China
Sinan Wang
Sinan Wang
Southern University of Science and Technology
Software EngineeringSoftware TestingSoftware Analysis
S
Shiyun Lang
Huazhong University of Science and Technology, Wuhan, China
P
Petr Sushko
Allen Institute for AI, Seattle, USA
G
Gaoyang Jiang
Huazhong University of Science and Technology, Wuhan, China
Yao Wan
Yao Wan
Huazhong University of Science and Technology
NLPProgramming LanguagesSoftware EngineeringLarge Language Models
Ranjay Krishna
Ranjay Krishna
University of Washington, Allen Institute for AI
Computer VisionNatural Language ProcessingMachine LearningHuman Computer Interaction