OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

📅 2024-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) exhibit weak capabilities in interleaved image-text generation—i.e., open-domain generation where images and text alternate in output—yet current benchmarks are limited in scale and scenario diversity, hindering rigorous evaluation. Method: We introduce OpenING, the first comprehensive benchmark for this task, comprising 5,400 human-annotated instances across 56 real-world scenarios. We formally define and quantify interleaved generation capability, develop IntJudge—a high-consistency discriminative evaluator leveraging cross-modal alignment modeling and reinforcement-based feedback training—and design a multi-stage annotation protocol with a fine-grained, decoupled evaluation framework. Results: IntJudge achieves 82.42% human–machine agreement, outperforming GPT-4-based evaluators by 11.34%. Experiments reveal substantial deficiencies in state-of-the-art MLLMs, establishing OpenING as a reproducible, high-fidelity evaluation standard for next-generation multimodal generative models.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to limitations in data size and diversity. To bridge this gap, we introduce OpenING, a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82.42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating interleaved image-text generation lacks benchmarks
Current benchmarks lack data size and diversity
Need robust evaluation for multimodal generation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces OpenING benchmark for interleaved generation
Develops IntJudge model for evaluation
Uses novel data pipeline for training
P
Pengfei Zhou
GATE Team, Shanghai Artificial Intelligence Laboratory
X
Xiaopeng Peng
Rochester Institute of Technology
Jiajun Song
Jiajun Song
Michigan technological University
Wave Energy Converter
C
Chuanhao Li
GATE Team, Shanghai Artificial Intelligence Laboratory
Z
Zhaopan Xu
GATE Team, Shanghai Artificial Intelligence Laboratory
Y
Yue Yang
Shanghai Jiao Tong University, GATE Team, Shanghai Artificial Intelligence Laboratory
Ziyao Guo
Ziyao Guo
National University of Singapore
Deep Learning
H
Hao Zhang
GATE Team, Shanghai Artificial Intelligence Laboratory
Yuqi Lin
Yuqi Lin
Zhejiang University
Computer VisionMultimodal Foundation Model
Yefei He
Yefei He
Zhejiang University
Computer VisionAutoregressive Visual GenerationModel Quantization
L
Lirui Zhao
GATE Team, Shanghai Artificial Intelligence Laboratory
S
Shuo Liu
GATE Team, Shanghai Artificial Intelligence Laboratory
T
Tianhua Li
Shanghai Jiao Tong University, GATE Team, Shanghai Artificial Intelligence Laboratory
Y
Yuxuan Xie
GATE Team, Shanghai Artificial Intelligence Laboratory
X
Xiaojun Chang
University of Science and Technology of China, MBZUAI
Y
Yu Qiao
GATE Team, Shanghai Artificial Intelligence Laboratory
Wenqi Shao
Wenqi Shao
Researcher at Shanghai AI Laboratory
Foundation Model EvaluationLLM CompressionEfficient AdaptationMultimodal Learning
Kaipeng Zhang
Kaipeng Zhang
Shanghai AI Laboratory
LLMMultimodal LLMsAIGC