FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding

📅 2025-04-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and frequent loss of critical spatiotemporal information in long-video understanding, this paper proposes a lightweight dynamic weighted multi-frame fusion framework. Our method introduces, for the first time, semantic-aware keyframe ranking coupled with an adaptive dynamic weighting fusion mechanism to achieve efficient spatiotemporal feature compression. Additionally, we employ a synthetic data generation strategy that requires no manual annotation to enhance model generalization. By eliminating costly modules—relying solely on lightweight dynamic fusion and optimized frame sampling—the framework achieves significant performance gains: across multiple long-video understanding benchmarks, it reduces token input volume by over 40%, improves average accuracy by 5.2%, and simultaneously maintains high inference efficiency and faithful preservation of discriminative content.

Technology Category

Application Category

📝 Abstract
Recent advancements in video understanding within visual large language models (VLLMs) have led to notable progress. However, the complexity of video data and contextual processing limitations still hinder long-video comprehension. A common approach is video feature compression to reduce token input to large language models, yet many methods either fail to prioritize essential features, leading to redundant inter-frame information, or introduce computationally expensive modules.To address these issues, we propose FiLA(Fine-grained Vision Language Model)-Video, a novel framework that leverages a lightweight dynamic-weight multi-frame fusion strategy, which adaptively integrates multiple frames into a single representation while preserving key video information and reducing computational costs. To enhance frame selection for fusion, we introduce a keyframe selection strategy, effectively identifying informative frames from a larger pool for improved summarization. Additionally, we present a simple yet effective long-video training data generation strategy, boosting model performance without extensive manual annotation. Experimental results demonstrate that FiLA-Video achieves superior efficiency and accuracy in long-video comprehension compared to existing methods.
Problem

Research questions and friction points this paper is trying to address.

Reduces redundant inter-frame information in long videos
Adaptively integrates frames while preserving key information
Improves long-video comprehension efficiency and accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight dynamic-weight multi-frame fusion strategy
Keyframe selection for improved summarization
Simple long-video training data generation
🔎 Similar Papers
No similar papers found.