🤖 AI Summary
This work addresses the inefficiency of multimodal large language models (MLLMs), which uniformly attend to all visual tokens during generation, leading to attention dispersion and computational redundancy. To mitigate this, the authors propose Gaze Attention, a novel mechanism that introduces human-like dynamic local attention into MLLMs for the first time. It clusters visual embeddings into compact gaze regions and dynamically selects task-relevant regions at each decoding step for localized attention, while incorporating learnable context tokens to preserve global awareness. This approach drastically reduces visual key-value cache usage—by up to 90%—without compromising performance, achieving parity with or surpassing dense attention baselines on both image and video understanding benchmarks, thereby enhancing the model’s ability to focus on task-relevant visual content.
📝 Abstract
When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation step, leading to diluted focus and unnecessary computational overhead. In this work, we introduce Gaze Attention, a novel mechanism that enables MLLMs to selectively attend to task-relevant visual regions during generation. Specifically, we spatially group visual embeddings-stored as key-value caches-into compact gaze regions, each represented by a lightweight descriptor. At each decoding step, the model dynamically selects the most relevant regions and restricts attention to them, reducing redundant computation while enhancing focus. To mitigate the loss of global context caused by localized attention, we further propose learnable context tokens appended to each image or frame, allowing the model to maintain holistic visual awareness. Extensive experiments on image and video understanding benchmarks demonstrate that Gaze Attention matches or surpasses dense-attention baselines, while using up to 90% fewer visual KV entries in the attention computation.