WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image understanding and generation benchmarks predominantly focus on single-turn interactions, failing to capture the multi-turn, context-dependent editing processes characteristic of real-world scenarios. To address this gap, we propose WEAVE—the first evaluation framework supporting in-context interleaved cross-modal understanding and generation. It comprises a large-scale multi-turn dialogue dataset (WEAVE-100k) and a human-annotated evaluation benchmark (WEAVEBench). WEAVE introduces a novel hybrid Vision-Language Model (VLM)-based assessment framework that performs context-aware evaluation by jointly conditioning on the original image and edit instructions. Evaluation integrates reference-image comparison, automated VLM scoring, and human annotation for efficiency and reliability. Experiments systematically uncover critical bottlenecks in current multimodal models—including visual memory retention, world-knowledge reasoning, and cross-turn collaborative generation—marking the first such analysis. WEAVE establishes a reproducible, scalable foundation for iterative advancement of unified multimodal models.

Technology Category

Application Category

📝 Abstract
Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models'abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.
Problem

Research questions and friction points this paper is trying to address.

Existing multimodal datasets lack multi-turn context-dependent interactions
Current benchmarks fail to capture real-world image creation and editing complexity
There is no comprehensive evaluation for interleaved comprehension and generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

WEAVE dataset enables interleaved multimodal comprehension and generation
WEAVEBench benchmark assesses multi-turn visual memory and reasoning
Training on WEAVE facilitates emergent visual-memory capabilities in models
🔎 Similar Papers
No similar papers found.