🤖 AI Summary
This study addresses the effectiveness of temporal modeling in multimodal large language models (MLLMs) for video understanding, systematically comparing explicit versus implicit temporal modeling paradigms. We propose the Stackable Temporal Encoder (STE), a modular, explicit temporal modeling architecture that supports flexible configuration of receptive field size and token compression ratio, integrating cross-frame attention with multi-granularity compression. For the first time, we empirically demonstrate—across three dimensions: task performance, compression efficiency, and long-sequence temporal reasoning—that explicit temporal modeling delivers critical gains for video MLLMs. We further validate STE’s plug-and-play generalizability to image-based MLLMs. On mainstream video understanding benchmarks, STE achieves an average accuracy improvement of +4.2% with <1% additional parameters, while significantly enhancing action localization and long-range temporal reasoning capabilities.
📝 Abstract
Applying Multimodal Large Language Models (MLLMs) to video understanding presents significant challenges due to the need to model temporal relations across frames. Existing approaches adopt either implicit temporal modeling, relying solely on the LLM decoder, or explicit temporal modeling, employing auxiliary temporal encoders. To investigate this debate between the two paradigms, we propose the Stackable Temporal Encoder (STE). STE enables flexible explicit temporal modeling with adjustable temporal receptive fields and token compression ratios. Using STE, we systematically compare implicit and explicit temporal modeling across dimensions such as overall performance, token compression effectiveness, and temporal-specific understanding. We also explore STE's design considerations and broader impacts as a plug-in module and in image modalities. Our findings emphasize the critical role of explicit temporal modeling, providing actionable insights to advance video MLLMs.