Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing multimodal large language models suffer from inefficiency when processing long, high-resolution videos due to uniform pixel-wise processing. This work proposes AutoGaze, a novel module that introduces, for the first time, an autoregressive gaze mechanism to dynamically select the minimal set of multi-scale video patches at the front end of vision Transformers or multimodal large language models (MLLMs), reconstructing the input within a user-specified error threshold and enabling highly efficient visual token pruning. Trained via next-token prediction combined with reinforcement learning, AutoGaze supports processing of thousand-frame 4K videos and introduces HLVid, the first high-resolution long-form video question-answering benchmark. Experiments demonstrate up to 4–100× reduction in visual tokens and up to 19× inference speedup, achieving 67.0% on VideoMME and outperforming baselines by 10.1% and the best existing MLLM by 4.5% on HLVid.

Technology Category

Application Category

📝 Abstract

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that can reconstruct the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 67.0% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with 5-minute 4K-resolution videos, where an MLLM scaled with AutoGaze improves over the baseline by 10.1% and outperforms the previous best MLLM by 4.5%. Project page: https://autogaze.github.io/.

Problem

Research questions and friction points this paper is trying to address.

video understanding

spatiotemporal redundancy

multimodal large language models

high-resolution video

long-form video

Innovation

Methods, ideas, or system contributions that make the work stand out.

AutoGaze

video understanding

token reduction