Evaluating Time Awareness and Cross-modal Active Perception of Large Models via 4D Escape Room Task

πŸ“… 2026-03-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current evaluation environments struggle to effectively assess the temporal perception and active sensing capabilities of multimodal large language models under time-dependent auditory signals and selective cross-modal integration. To address this gap, this work proposes EscapeCraft-4Dβ€”the first customizable four-dimensional benchmark that explicitly incorporates temporal irreversibility and mechanisms of modality complementarity and interference. By integrating triggered audio cues, transient evidence, and location-dependent signals, EscapeCraft-4D systematically evaluates models’ spatiotemporal reasoning and active cross-modal integration under strict temporal constraints. Experiments reveal a prevalent modality bias among state-of-the-art multimodal models and a significant deficiency in cross-modal coordination under time pressure, offering the first in-depth characterization of how multimodal interactions influence decision-making dynamics.

Technology Category

Application Category

πŸ“ Abstract
Multimodal Large Language Models (MLLMs) have recently made rapid progress toward unified Omni models that integrate vision, language, and audio. However, existing environments largely focus on 2D or 3D visual context and vision-language tasks, offering limited support for temporally dependent auditory signals and selective cross-modal integration, where different modalities may provide complementary or interfering information, which are essential capabilities for realistic multimodal reasoning. As a result, whether models can actively coordinate modalities and reason under time-varying, irreversible conditions remains underexplored. To this end, we introduce \textbf{EscapeCraft-4D}, a customizable 4D environment for assessing selective cross-modal perception and time awareness in Omni models. It incorporates trigger-based auditory sources, temporally transient evidence, and location-dependent cues, requiring agents to perform spatio-temporal reasoning and proactive multimodal integration under time constraints. Building on this environment, we curate a benchmark to evaluate corresponding abilities across powerful models. Evaluation results suggest that models struggle with modality bias, and reveal significant gaps in current model's ability to integrate multiple modalities under time constraints. Further in-depth analysis uncovers how multiple modalities interact and jointly influence model decisions in complex multimodal reasoning environments.
Problem

Research questions and friction points this paper is trying to address.

time awareness
cross-modal perception
multimodal reasoning
temporal dependency
modality integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

4D environment
time awareness
cross-modal perception
multimodal integration
EscapeCraft-4D
Yurui Dong
Yurui Dong
倍旦倧学
NLP MultiModal AI LLM
Z
Ziyue Wang
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
S
Shuyun Lu
School of Computer & Communication Engineering, University of Science and Technology Beijing, Beijing, China
D
Dairu Liu
College of Software, Nankai University, Tianjin, China
Xuechen Liu
Xuechen Liu
National Institute of Informatics
speaker verificationspeech recognitionspoofing detection
Fuwen Luo
Fuwen Luo
Tsinghua University
Computer Science
P
Peng Li
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China; Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
Y
Yang Liu
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China; Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China