ChronusOmni: Improving Time Awareness of Omni Large Language Models

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from insufficient explicit and implicit cross-modal temporal localization capabilities in long-video understanding and complex temporal reasoning: existing methods primarily focus on explicit vision-language grounding while neglecting audio modalities and implicit audio-visual temporal correlations (e.g., “what appears visually when a person speaks”). To address this, we propose a unified temporal modeling framework featuring: (i) novel text-based timestamp tokens and interleaved frame-level audio-visual representations; (ii) a temporal consistency reinforcement learning reward tailored for fine-grained temporal reasoning; and (iii) ChronusAV—the first high-precision, fully modality-aligned benchmark for audio-visual temporal understanding. Experiments demonstrate over 30% performance gain on ChronusAV, state-of-the-art results across multiple temporal localization metrics, and no degradation in general audio-visual comprehension capability.

Technology Category

Application Category

📝 Abstract
Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.
Problem

Research questions and friction points this paper is trying to address.

Enhancing temporal awareness in omni large language models
Addressing implicit cross-modal audiovisual temporal grounding
Improving fine-grained temporal reasoning across modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaving timestamp tokens with multimodal representations
Using reinforcement learning for temporal ordering and reasoning
Constructing a comprehensive cross-modal aligned dataset
Y
Yijing Chen
Gaoling School of Artificial Intelligence, Renmin University of China
Y
Yihan Wu
Gaoling School of Artificial Intelligence, Renmin University of China
K
Kaisi Guan
Gaoling School of Artificial Intelligence, Renmin University of China
Yuchen Ren
Yuchen Ren
Renmin University of China
Y
Yuyue Wang
Gaoling School of Artificial Intelligence, Renmin University of China
Ruihua Song
Ruihua Song
Renmin University of China
AI based creationmulti-modaltiy chitchatnatural language understandinginformation retrievalinformation extraction
L
Liyun Ru
Baichuan Inc.