TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval

📅 2024-09-02

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the high computational overhead and deployment bottlenecks in text-video retrieval caused by temporal redundancy across video frames, this paper proposes the first **progressive multi-granularity temporal token merging mechanism**, specifically designed for the video modality. Unlike image-level token compression methods—which fail to effectively model cross-frame temporal redundancy—our approach fuses temporal features from adjacent video segments to enable holistic video-level representation learning while achieving efficient token compression. The mechanism is fully compatible with diverse fine-tuning paradigms. Experiments demonstrate that it reduces token count by 95%, cuts GFLOPs by 51%, accelerates inference by 1.8×, and improves R-Sum by 4.4% (under parameter-efficient fine-tuning) and 7.9% (under full fine-tuning). Additionally, training speed increases by 1.57×, and GPU memory consumption drops by 75.2%.

Technology Category

Application Category

📝 Abstract

Most text-video retrieval methods utilize the text-image pre-trained CLIP as a backbone, incorporating complex modules that result in high computational overhead. As a result, many studies focus on efficient fine-tuning. The primary challenge in efficient adaption arises from the inherent differences between image and video modalities. Each sampled video frame must be processed by the image encoder independently, which increases complexity and complicates practical deployment. Although existing efficient methods fine-tune with small trainable parameters, they still incur high inference costs due to the large token number. In this work, we argue that temporal redundancy significantly contributes to the model's high complexity due to the repeated information in consecutive frames. Existing token compression methods for image models fail to solve the unique challenges, as they overlook temporal redundancy across frames. To tackle these problems, we propose Temporal Token Merging (TempMe) to reduce temporal redundancy. Specifically, we introduce a progressive multi-granularity framework. By gradually combining neighboring clips, we merge temporal tokens across different frames and learn video-level features, leading to lower complexity and better performance. Extensive experiments validate the superiority of our TempMe. Compared to previous efficient text-video retrieval methods, TempMe significantly reduces output tokens by 95% and GFLOPs by 51%, while achieving a 1.8X speedup and a 4.4% R-Sum improvement. Additionally, TempMe exhibits robust generalization capabilities by integrating effectively with both efficient and full fine-tuning methods. With full fine-tuning, TempMe achieves a significant 7.9% R-Sum improvement, trains 1.57X faster, and utilizes 75.2% GPU memory usage. Our code will be released.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in text-video retrieval.

Minimizes trainable parameters and model complexity.

Enhances temporal modeling across video frames.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Temporal Token Merging (TempMe) for efficiency

Reduces spatio-temporal redundancy in video frames

Achieves superior performance with minimal trainable parameters

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs