QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing long-video understanding methods prune low-activation visual tokens during decoder-side post-processing, neglecting semantic alignment between input-layer visual tokens and instructions—leading to inefficient token budget utilization and loss of critical semantics. This paper proposes Query-Oriented Token Allocation (QOTA), a plug-and-play, training-free, pre-fusion module that performs frame-level importance scoring and dynamic token pre-allocation prior to cross-modal interaction. Its core contributions are: (1) the first query-decoupled chain-of-thought (CoT)-guided frame importance scoring mechanism; (2) the first ante-hoc token allocation paradigm that explicitly models visual-token–instruction semantic alignment at the input layer; and (3) a lightweight, fine-tuning-free architecture for extending large multimodal models. Integrated into LLaVA-Video-7B, QOTA achieves an average +3.2% improvement across six major benchmarks—including Video-MME and MLVU—without increasing token budget. Code is publicly available.

Technology Category

Application Category

📝 Abstract

Recent advances in long video understanding typically mitigate visual redundancy through visual token pruning based on attention distribution. However, while existing methods employ post-hoc low-response token pruning in decoder layers, they overlook the input-level semantic correlation between visual tokens and instructions (query). In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. The query-oriented token selection is crucial as it aligns visual processing with task-specific requirements, optimizing token budget utilization while preserving semantically relevant content. Specifically, (i) QuoTA strategically allocates frame-level importance scores based on query relevance, enabling one-time visual token assignment before cross-modal interactions in decoder layers, (ii) we decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring, and (iii) QuoTA offers a plug-and-play functionality that extends to existing LVLMs. Extensive experimental results demonstrate that implementing QuoTA with LLaVA-Video-7B yields an average performance improvement of 3.2% across six benchmarks (including Video-MME and MLVU) while operating within an identical visual token budget as the baseline. Codes are open-sourced at https://github.com/MAC-AutoML/QuoTA.

Problem

Research questions and friction points this paper is trying to address.

Optimizes visual token assignment for long video comprehension

Aligns visual processing with task-specific query requirements

Enhances performance of large video-language models without training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Query-oriented frame-level importance assessment

Chain-of-Thoughts reasoning for query decoupling

Plug-and-play extension for existing LVLMs

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs