OmniPlay: Benchmarking Omni-Modal Models on Omni-Modal Game Playing

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation paradigms inadequately capture the genuine intelligence of general multimodal foundation models (e.g., GPT-4o, Gemini) in dynamic, interactive settings: static benchmarks lack agency, while interactive benchmarks suffer from modality singularity and commonly neglect auditory and temporal cues. Method: We introduce the first multimodal game benchmark integrating auditory, visual, and temporal modalities across five distinct dynamic interaction environments. Guided by the “modality interdependence” design paradigm, we systematically incorporate both cooperative and conflicting multimodal scenarios. Technical innovations include dynamic modality control and memory–reasoning task decoupling. Contribution/Results: Experiments reveal a counterintuitive fragility—models perform better with *less* information—exposing fundamental flaws in current multimodal fusion mechanisms. While state-of-the-art models surpass humans on memory tasks, they consistently fail on cross-modal reasoning and long-horizon planning. Our benchmark establishes a new standard for multimodal intelligence evaluation and yields critical theoretical insights into multimodal integration.

Technology Category

Application Category

📝 Abstract
While generalist foundation models like Gemini and GPT-4o demonstrate impressive multi-modal competence, existing evaluations fail to test their intelligence in dynamic, interactive worlds. Static benchmarks lack agency, while interactive benchmarks suffer from a severe modal bottleneck, typically ignoring crucial auditory and temporal cues. To bridge this evaluation chasm, we introduce OmniPlay, a diagnostic benchmark designed not just to evaluate, but to probe the fusion and reasoning capabilities of agentic models across the full sensory spectrum. Built on a core philosophy of modality interdependence, OmniPlay comprises a suite of five game environments that systematically create scenarios of both synergy and conflict, forcing agents to perform genuine cross-modal reasoning. Our comprehensive evaluation of six leading omni-modal models reveals a critical dichotomy: they exhibit superhuman performance on high-fidelity memory tasks but suffer from systemic failures in challenges requiring robust reasoning and strategic planning. We demonstrate that this fragility stems from brittle fusion mechanisms, which lead to catastrophic performance degradation under modality conflict and uncover a counter-intuitive "less is more" paradox, where removing sensory information can paradoxically improve performance. Our findings suggest that the path toward robust AGI requires a research focus beyond scaling to explicitly address synergistic fusion. Our platform is available for anonymous review at https://github.com/fuqingbie/omni-game-benchmark.
Problem

Research questions and friction points this paper is trying to address.

Evaluating omni-modal models in dynamic interactive worlds
Addressing modal bottlenecks in auditory and temporal cues
Improving cross-modal reasoning and strategic planning in AGI
Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniPlay benchmark tests full sensory reasoning
Five game environments create modality conflicts
Reveals brittle fusion in omni-modal models
🔎 Similar Papers
No similar papers found.