Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video large language models (Video-LLMs) face significant bottlenecks in modeling complex temporal dynamics—such as action evolution and inter-frame dependencies—hindering fine-grained temporal reasoning. To address this, we propose a novel architecture that pioneers the integration of stacked temporal attention modules directly into the visual encoder, enabling explicit time-structure and action-sequence modeling at the early visual representation stage. This design synergizes with a multimodal fusion mechanism to enhance inter-frame relational modeling and cross-modal alignment. Evaluated on VITATECS, MVBench, and Video-MME benchmarks, our model achieves an average performance gain of +5.5%, with particularly strong improvements in action recognition and temporal question answering—outperforming state-of-the-art methods. Our core contribution lies in fundamentally shifting temporal modeling to the foundational layer of the visual encoder, thereby overcoming the limitations of conventional Video-LLMs that rely solely on post-hoc fusion or lightweight temporal modules.

Technology Category

Application Category

📝 Abstract
Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.
Problem

Research questions and friction points this paper is trying to address.

Improving temporal understanding in Video-LLMs
Addressing limitations in action sequence comprehension
Enhancing temporal reasoning for video question answering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stacked temporal attention in vision encoder
Enhancing temporal reasoning in Video-LLMs
Improving action sequence understanding in videos
🔎 Similar Papers
No similar papers found.
Ali Rasekh
Ali Rasekh
Leibniz University Hannover, L3S Research Center
E
Erfan Bagheri Soula
Independent Researcher
O
Omid Daliran
Independent Researcher
Simon Gottschalk
Simon Gottschalk
L3S Research Center
Knowledge GraphsEventsSemantic AnalyticsMobility
M
Mohsen Fayyaz
Microsoft