UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the lack of unified understanding and generation capabilities in existing models for arbitrarily composed, interleaved multimodal inputs and outputs. To this end, we introduce UniM, the first arbitrary-to-arbitrary interleaved multimodal benchmark, encompassing seven modalities—text, image, audio, video, document, code, and 3D—spanning 30 domains and comprising 31K high-quality samples, along with a multidimensional evaluation framework. Building upon a multimodal large language model architecture, we propose the baseline model UniMA, which incorporates a traceable reasoning mechanism to jointly model semantic understanding, structural integrity, and interleaved coherence. Experimental results demonstrate that UniM presents a significant challenge, revealing critical bottlenecks in current models’ ability to handle unified interleaved multimodal tasks and thereby establishing a foundation and direction for future research.

Technology Category

Application Category

📝 Abstract

In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness&Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.

Problem

Research questions and friction points this paper is trying to address.

any-to-any multimodal learning

interleaved multimodal inputs

multimodal benchmark

unified multimodal understanding and generation

Multimodal Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

any-to-any multimodal learning

interleaved multimodal benchmark

multimodal large language models

structured interleaved generation

multimodal evaluation suite

🔎 Similar Papers

No similar papers found.