XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether Omni-modal Large Language Models (OLLMs) possess modality-agnostic reasoning capabilities or exhibit modality-specific biases. To this end, we introduce XModBench—the first large-scale, tri-modal benchmark encompassing audio, vision, and text across six cross-modal combinations, featuring 60,828 multiple-choice questions spanning five task families. Leveraging a controlled-variable experimental design, we systematically evaluate modality invariance, divergence, and consistency. We further propose a fine-grained diagnostic framework that, for the first time, uncovers directional imbalances in cross-modal reasoning—e.g., significantly reduced consistency when vision serves as contextual input—and modality-specific performance gaps—e.g., Gemini 2.5 Pro achieves <60% accuracy on spatiotemporal reasoning and suffers sharp degradation with audio inputs. XModBench and its analytical paradigm establish a reproducible, standardized evaluation infrastructure for advancing research on modality alignment in OLLMs.

Technology Category

Application Category

📝 Abstract
Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM's modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at https://xingruiwang.github.io/projects/XModBench/.
Problem

Research questions and friction points this paper is trying to address.

Evaluating modality-invariant reasoning versus modality-specific biases in omni-modal language models
Measuring cross-modal consistency across audio, vision, and text modalities systematically
Diagnosing spatial-temporal reasoning weaknesses and directional imbalance in multimodal AI systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces XModBench benchmark for cross-modal consistency
Systematically tests six modality compositions in question-answer pairs
Diagnoses modality-invariant reasoning and directional imbalance issues
🔎 Similar Papers
No similar papers found.
X
Xingrui Wang
Advanced Micro Devices
J
Jiang Liu
Advanced Micro Devices
C
Chao Huang
Advanced Micro Devices
X
Xiaodong Yu
Advanced Micro Devices
Z
Ze Wang
Advanced Micro Devices
X
Ximeng Sun
Advanced Micro Devices
Jialian Wu
Jialian Wu
AMD GenAI
LLMComputer Vision
Alan Yuille
Alan Yuille
Professor of Cognitive Science and Computer Science, Johns Hopkins University
Computer VisionComputational Models of Mind and BrainMachine Learning
Emad Barsoum
Emad Barsoum
AMD, Columbia University
Generative AIFoundation ModelsAgentic AIComputer VisionML Frameworks
Z
Zicheng Liu
Advanced Micro Devices