🤖 AI Summary
Current video large language models (Video-LLMs) face significant bottlenecks in modeling complex temporal dynamics—such as action evolution and inter-frame dependencies—hindering fine-grained temporal reasoning. To address this, we propose a novel architecture that pioneers the integration of stacked temporal attention modules directly into the visual encoder, enabling explicit time-structure and action-sequence modeling at the early visual representation stage. This design synergizes with a multimodal fusion mechanism to enhance inter-frame relational modeling and cross-modal alignment. Evaluated on VITATECS, MVBench, and Video-MME benchmarks, our model achieves an average performance gain of +5.5%, with particularly strong improvements in action recognition and temporal question answering—outperforming state-of-the-art methods. Our core contribution lies in fundamentally shifting temporal modeling to the foundational layer of the visual encoder, thereby overcoming the limitations of conventional Video-LLMs that rely solely on post-hoc fusion or lightweight temporal modules.
📝 Abstract
Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.