OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the high computational cost of long-video large language models caused by excessive visual tokens, a challenge exacerbated by existing training-free compression methods that overlook intra-frame semantic importance and lack adaptability. The authors propose a two-stage temporal token compression framework grounded in optimal transport: first, spatial pruning preserves semantically critical content within each frame; then, leveraging non-uniform token quality and a locality-aware spatial-feature joint transport cost, the method dynamically evaluates inter-frame compressibility and allocates compression budgets accordingly. This approach pioneers the integration of non-uniform quality and joint transport cost into optimal transport, enabling semantics-aware adaptive compression. With only 10% of tokens retained, it achieves 95.8% and 73.9% of the original performance on six video question-answering and temporal localization benchmarks, respectively, substantially outperforming current training-free alternatives.

📝 Abstract

As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.

Problem

Research questions and friction points this paper is trying to address.

Video Large Language Models

token compression

temporal compression

inference cost

visual tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal Transport

Temporal Token Compression

Video Large Language Models