Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) face prohibitive computational overhead and latency in long-video understanding due to linear growth of visual tokens with video length. To address this, we propose QTSplus, a query-aware lightweight visual token selection module. Its core contributions are: (1) cross-attention-driven relevance scoring conditioned on the textual query; (2) instance-level retention budget prediction adaptive to query complexity; (3) differentiable Top-n selection during training coupled with hard gating at inference; and (4) a compact re-encoder augmented with absolute temporal positional encoding to preserve temporal structure. Integrated into Qwen2.5-VL, QTSplus achieves an 89% visual stream compression rate and reduces end-to-end latency by 28%, while maintaining near-original accuracy across eight benchmarks. Notably, it improves directional and sequential accuracy on the TempCompass benchmark by +20.5 and +5.6 points, respectively.

Technology Category

Application Category

📝 Abstract
Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector ( extbf{QTSplus}), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) emph{predicting} an instance-specific retention budget based on the complexity of the query, and (iii) emph{selecting} Top-$n$ tokens with a differentiable straight-through estimator during training and a hard gate at inference. Furthermore, a small re-encoder preserves temporal order using absolute time information, enabling second-level localization while maintaining global coverage. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to extbf{89%} and reduces end-to-end latency by extbf{28%} on long videos. The evaluation on eight long video understanding benchmarks shows near-parity accuracy overall when compared with the original Qwen models and outperforms the original model by extbf{+20.5} and extbf{+5.6} points respectively on TempCompass direction and order accuracies. These results show that QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence. We will make all code, data, and trained models' weights publicly available.
Problem

Research questions and friction points this paper is trying to address.

Addressing computational explosion in attention cost for long video understanding
Selecting most relevant visual tokens based on text query complexity
Enabling efficient multimodal language models for real-world long videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-aware token selector dynamically selects important visual tokens
Predicts retention budget based on query complexity for token selection
Uses re-encoder with temporal information for second-level localization
🔎 Similar Papers
No similar papers found.
S
Siyou Li
Queen Mary University of London, London, UK
H
Huanan Wu
University of Sheffield, Sheffield, UK
J
Juexi Shao
Queen Mary University of London, London, UK
Yinghao Ma
Yinghao Ma
PhD candidate, Centre for Digital Music (C4DM), Queen Mary University of London
Music Information RetrievalLarge Language ModelsMultimodal LearningAudio Signal Processing
Y
Yujian Gan
Queen Mary University of London, London, UK
Yihao Luo
Yihao Luo
Imperial College London
Y
Yuwei Wang
Pengcheng Laboratory, Shenzhen, China
Dong Nie
Dong Nie
unc
Computational NeuroScienceMachine LearningLarge Models
L
Lu Wang
Meituan Inc, China
W
Wengqing Wu
Queen Mary University of London, London, UK; Nanjing University of Science, Nanjing, China
L
Le Zhang
Queen Mary University of London, London, UK; University of Birmingham, Birmingham, UK
Massimo Poesio
Massimo Poesio
Professor of Comp. Linguistics, Queen Mary University / Professor of NLP, University of Utrecht
Computational linguistics / NLPGames and NLPAnaphora / CoreferenceDisagreement and NLPBrain data
Juntao Yu
Juntao Yu
Queen Mary University of London
Natural Language ProcessingArtificial Intelligence