ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing evaluation methods assess text and image modalities in isolation, neglecting cross-modal collaborative reasoning. To address this gap, we propose ROVER, the first benchmark establishing a bidirectional cross-modal reasoning evaluation framework—comprising language-guided image generation and vision-augmented language reasoning—built upon 1,312 human-annotated tasks and 1,876 images. ROVER uncovers the pronounced advantage of interleaved architectures in cross-modal reasoning and reveals a systematic dissociation between physical and symbolic reasoning capabilities across models. Extensive experiments on 17 unified multimodal models demonstrate that interleaved structures significantly outperform non-interleaved ones, while strong unimodal capabilities do not inherently confer effective cross-modal reasoning. This work advances generative multimodal AI from isolated modality processing toward genuine, synergistic cross-modal reasoning.

Technology Category

Application Category

📝 Abstract

Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating reciprocal cross-modal reasoning in unified multimodal models

Testing verbal guidance for faithful image synthesis capabilities

Assessing visual augmentation for strengthening verbal reasoning processes

Innovation

Methods, ideas, or system contributions that make the work stand out.

ROVER benchmark tests reciprocal cross-modal reasoning

Evaluates verbal reasoning guiding faithful image synthesis

Assesses visual reasoning strengthening verbal question answering

🔎 Similar Papers

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling