MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the prohibitive computational overhead induced by high frame rates and long temporal durations in video understanding—and the information loss commonly incurred by existing training-free token compression methods for vision-language models (VLMs)—this paper proposes a memory-augmented reinforcement learning–based token compression framework. Our core innovations are: (1) a Vision Memory Retriever (VMR) that enables structured retrieval of salient video segments, and (2) a coupled C-GRPO reinforcement learning policy that jointly optimizes the two-stage distillation—retrieval followed by compression. Evaluated on six standard video benchmarks, our method achieves near-baseline performance using only single-frame-level visual tokens, reducing visual tokens by 95%, GPU memory consumption by 72%, and inference latency by 23.9%.

Technology Category

Application Category

📝 Abstract

The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose extbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a extit{retrieve-then-compress} strategy using a extbf{Visual Memory Retriever (VMR)} to select key clips and a extbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by extbf{95%}, GPU memory by extbf{72%}, and latency by extbf{23.9%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational costs in video understanding models

Minimizes information loss during token compression

Enables efficient real-time video processing for constrained devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-augmented reinforcement learning for token compression

Retrieve-then-compress strategy with visual memory retriever

Compression group relative policy optimization for distillation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs