Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

This work addresses the challenges of long-form video understanding, particularly the memory bottleneck caused by frame redundancy and the difficulty of extracting critical information. To this end, the authors propose an end-to-end framework that combines information-density-driven adaptive sampling with autoencoder-guided spatiotemporal compression, effectively preserving discriminative content while significantly reducing computational overhead. The approach seamlessly integrates with multimodal large language models, achieving high compression ratios without compromising semantic fidelity. Extensive experiments demonstrate state-of-the-art performance on both long-video and general video understanding benchmarks, underscoring the method’s efficiency, versatility, and scalability.

Technology Category

Application Category

📝 Abstract

With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two major advantages: it adaptively and effectively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework demonstrates promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results underscore the versatility and efficacy of our approach, particularly in managing the complexities of prolonged video sequences.

Problem

Research questions and friction points this paper is trying to address.

long-form video understanding

video redundancy

frame sampling

information compression

multimodal models

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive video sampling

spatiotemporal compression

long-form video understanding