Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Long-video understanding faces dual challenges: prohibitive computational overhead and loss of critical temporal information. To address these, we propose ViLaMP, a hierarchical discrepancy distillation framework—the first to introduce discrepancy distillation into long-video vision-language models (VLMs). ViLaMP employs query-driven key-frame selection to retain the most discriminative frames, while performing patch-level feature refinement on non-key frames, enabling efficient compression with semantic fidelity. The method integrates hierarchical video encoding, differentiated key-frame selection, discrepancy-based feature distillation, and hybrid-granularity (frame- and patch-level) information compression. Evaluated on four video understanding benchmarks, ViLaMP achieves state-of-the-art performance, particularly excelling in long-video tasks—delivering substantial gains in both accuracy and inference efficiency. Notably, it processes hour-long videos (up to 10K frames) on a single A100 GPU.

Technology Category

Application Category

📝 Abstract

Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLaMP, a hierarchical video-language model that processes hour-long videos at ``mixed precision'' through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLaMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLaMP's superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLaMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs for long-form video processing in VLMs

Preserving critical temporal dependencies and semantic information

Achieving efficient processing of ultra-long videos (up to 10K frames)

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical differential distillation for efficiency

Differential keyframe selection for relevance

Differential feature merging for salience

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs