LLaVA-MLB: Mitigating and Leveraging Attention Bias for Training-Free Video LLMs

๐Ÿ“… 2025-03-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Frozen image-based large language models (LLMs) suffer from early-frame information loss and inadequate spatiotemporal modeling in video understanding due to fixed token limits. Method: We propose a training-free, two-stage attention-driven token selection framework: (1) grid-wise attention pooling compresses the video sequence while preserving spatiotemporal structure and mitigating attention bias; (2) dynamic expansion is achieved by actively adapting positional bias using the tail of a visual summary. The method relies solely on attention score analysis from a pre-trained image LLMโ€”introducing no additional parameters or fine-tuning. Contribution/Results: Our approach achieves significant improvements over state-of-the-art methods across multiple video understanding benchmarks, excelling in both fine-grained question answering and long-horizon reasoning. It substantially enhances both accuracy and inference efficiency without architectural or parametric modifications.

Technology Category

Application Category

๐Ÿ“ Abstract
Training-free video large language models (LLMs) leverage pretrained Image LLMs to process video content without the need for further training. A key challenge in such approaches is the difficulty of retaining essential visual and temporal information, constrained by the token limits in Image LLMs. To address this, we propose a two-stage method for selecting query-relevant tokens based on the LLM attention scores: compressing the video sequence and then expanding the sequence. However, during the compression stage, Image LLMs often exhibit a positional attention bias in video sequences, where attention is overly concentrated on later frames, causing early-frame information to be underutilized. To alleviate this attention bias during sequence compression, we propose Gridded Attention Pooling for preserving spatiotemporal structure. Additionally, we introduce Visual Summarization Tail to effectively utilize this bias, facilitating overall video understanding during sequence expansion. In this way, our method effectively Mitigates and Leverages attention Bias (LLaVA-MLB), enabling the frozen Image LLM for detailed video understanding. Experiments on several benchmarks demonstrate that our approach outperforms state-of-the-art methods, achieving superior performance in both efficiency and accuracy. Our code will be released.
Problem

Research questions and friction points this paper is trying to address.

Retain visual and temporal information in video LLMs
Mitigate positional attention bias in video sequences
Enhance video understanding without additional training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage token selection using attention scores
Gridded Attention Pooling for spatiotemporal structure
Visual Summarization Tail for leveraging attention bias
๐Ÿ”Ž Similar Papers
Leqi Shen
Leqi Shen
Tsinghua University
T
Tao He
GRG Banking Equipment Co., Ltd., South China University of Technology
G
Guoqiang Gong
JD.com
F
Fan Yang
School of Software, Tsinghua University, BNRist, Tsinghua University
Y
Yifeng Zhang
JD.com
P
Pengzhang Liu
JD.com
Sicheng Zhao
Sicheng Zhao
Tsinghua University
Affective ComputingMultimediaDomain AdaptationComputer Vision
Guiguang Ding
Guiguang Ding
Tsinghua University
Computer VisionMultimedia Retrieval