Causality-Aware Temporal Projection for Video Understanding in Video-LLMs

📅 2026-01-05
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing video large language models often fail to preserve strict temporal order and causal consistency due to bidirectional temporal modeling, which disrupts chronological sequence. To resolve this, the authors propose the V-CORE framework, which incorporates Learnable Spatial Aggregation (LSA) to retain spatial interactions and introduces a Causal-Aware Temporal Projector (CATP) to enforce unidirectional temporal information flow, thereby ensuring causal consistency. The approach explicitly models temporal ordering constraints through block-wise causal attention and dynamic causal sink tokens. Furthermore, it adopts 4-bit QLoRA fine-tuning with a frozen LLM backbone for computational efficiency. Evaluated on NExT-QA, the method achieves 61.2% accuracy, with notable improvements of 3.5% and 5.2% on temporal and causal reasoning subsets, respectively, demonstrating the effectiveness of the proposed temporal constraint mechanism.

Technology Category

Application Category

📝 Abstract
Recent Video Large Language Models (Video-LLMs) have shown strong multimodal reasoning capabilities, yet remain challenged by video understanding tasks that require consistent temporal ordering and causal coherence. Many parameter-efficient Video-LLMs rely on unconstrained bidirectional projectors to model inter-frame interactions, which can blur temporal ordering by allowing later frames to influence earlier representations, without explicit architectural mechanisms to respect the directional nature of video reasoning. To address this limitation, we propose V-CORE, a parameter-efficient framework that introduces explicit temporal ordering constraints for video understanding. V-CORE consists of two key components: (1) Learnable Spatial Aggregation (LSA), which adaptively selects salient spatial tokens to reduce redundancy, and (2) a Causality-Aware Temporal Projector (CATP), which enforces structured unidirectional information flow via block-causal attention and a terminal dynamic summary token acting as a causal sink. This design preserves intra-frame spatial interactions while ensuring that temporal information is aggregated in a strictly ordered manner. With 4-bit QLoRA and a frozen LLM backbone, V-CORE can be trained efficiently on a single consumer GPU. Experiments show that V-CORE achieves strong performance on the challenging NExT-QA benchmark, reaching 61.2% accuracy, and remains competitive across MSVD-QA, MSRVTT-QA, and TGIF-QA, with gains concentrated in temporal and causal reasoning subcategories (+3.5% and +5.2% respectively), directly validating the importance of explicit temporal ordering constraints.
Problem

Research questions and friction points this paper is trying to address.

temporal ordering
causal coherence
video understanding
Video-LLMs
bidirectional projector
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causality-Aware Temporal Projection
Block-Causal Attention
Learnable Spatial Aggregation
Video-LLMs
Temporal Ordering Constraints