TTF: Temporal Token Fusion for Efficient Video-Language Model

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the severe computational bottleneck in video language models caused by the explosion of visual tokens in long videos, which hinders inference efficiency during the large language model’s prefilling phase. The authors propose a training-free, plug-and-play preprocessing framework that efficiently compresses visual token sequences prior to LLM input. Their approach introduces a novel temporal token fusion mechanism based on local window similarity, which automatically selects anchor frames, dynamically merges redundant tokens, and realigns spatial coordinates—all while preserving positional consistency for near-lossless compression. Evaluated on Qwen3-VL-8B, the method removes 67% of visual tokens with only 0.16 GFLOPs overhead, retaining 99.5% of the original accuracy and substantially improving inference throughput.

📝 Abstract

Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at $448{\times}448$ resolution already yield >8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput bottleneck. Existing methods often rely on global similarity or attention-guided compression, incurring offsets to their gains. We propose \textbf{Temporal Token Fusion (TTF)}, a training-free, plug-and-play pre-LLM token compression framework that exploits structured temporal redundancy in video. TTF automatically selects an anchor frame, then for each subsequent frame, performs a local window similarity search (e.g.,$3\times 3$), fusing tokens that exceed a threshold. The compressed sequence maintains positional consistency across both prefill and decoding through coordinate realignment, enabling seamless integration with existing VLM pipelines. On Qwen3-VL-8B with threshold t=0.70, TTF removes about 67\% of visual tokens while retaining 99.5\% of the baseline accuracy and introducing only ${\approx}0.16$\,GFLOPs of matching overhead. Overall, TTF offers a practical, efficient solution for video understanding. The code is available at \href{https://github.com/Cominder/ttf}{https://github.com/Cominder/ttf}

Problem

Research questions and friction points this paper is trying to address.

video-language models

visual tokens

inference cost

temporal redundancy

token compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Token Fusion

video-language model

token compression