FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

To address the high computational cost and frequent loss of critical spatiotemporal information in long-video understanding, this paper proposes a lightweight dynamic weighted multi-frame fusion framework. Our method introduces, for the first time, semantic-aware keyframe ranking coupled with an adaptive dynamic weighting fusion mechanism to achieve efficient spatiotemporal feature compression. Additionally, we employ a synthetic data generation strategy that requires no manual annotation to enhance model generalization. By eliminating costly modules—relying solely on lightweight dynamic fusion and optimized frame sampling—the framework achieves significant performance gains: across multiple long-video understanding benchmarks, it reduces token input volume by over 40%, improves average accuracy by 5.2%, and simultaneously maintains high inference efficiency and faithful preservation of discriminative content.

Technology Category

Application Category

📝 Abstract

Recent advancements in video understanding within visual large language models (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common approach is video feature compression to reduce token input to large language models, yet many methods either fail to prioritize essential features, leading to redundant inter-frame information, or introduce computationally expensive modules.To address these issues, we propose FiLA(Fine-grained Vision Language Model)-Video, a novel framework that leverages a lightweight dynamic-weight multi-frame fusion strategy, which adaptively integrates multiple frames into a single representation while preserving key video information and reducing computational costs. To enhance frame selection for fusion, we introduce a keyframe selection strategy, effectively identifying informative frames from a larger pool for improved summarization. Additionally, we present a simple yet effective long-video training data generation strategy, boosting model performance without extensive manual annotation. Experimental results demonstrate that FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.

Problem

Research questions and friction points this paper is trying to address.

Reduces redundant inter-frame information in long videos

Adaptively integrates frames while preserving key information

Improves long-video comprehension efficiency and accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight dynamic-weight multi-frame fusion strategy

Keyframe selection for improved summarization

Simple long-video training data generation

🔎 Similar Papers

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics