Dynamic-Aware Video Distillation: Optimizing Temporal Resolution Based on Video Semantics

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing video dataset distillation methods overlook semantic diversity and rigidly assume uniform temporal redundancy, limiting both compression efficiency and fidelity. This paper proposes a dynamic-aware video distillation framework that introduces, for the first time, a semantics-driven adaptive temporal resolution mechanism—dynamically adjusting keyframe density according to video content semantics. Additionally, we design a teacher-in-the-loop reinforcement learning strategy that uses downstream task performance as a feedback signal to end-to-end optimize temporal compression decisions. Evaluated on multiple benchmarks, our method achieves higher downstream accuracy (+3.1% top-1 accuracy) using significantly fewer frames (42% reduction on average), demonstrating comprehensive advantages in compactness, fidelity, and generalization over prior approaches.

Technology Category

Application Category

📝 Abstract

With the rapid development of vision tasks and the scaling on datasets and models, redundancy reduction in vision datasets has become a key area of research. To address this issue, dataset distillation (DD) has emerged as a promising approach to generating highly compact synthetic datasets with significantly less redundancy while preserving essential information. However, while DD has been extensively studied for image datasets, DD on video datasets remains underexplored. Video datasets present unique challenges due to the presence of temporal information and varying levels of redundancy across different classes. Existing DD approaches assume a uniform level of temporal redundancy across all different video semantics, which limits their effectiveness on video datasets. In this work, we propose Dynamic-Aware Video Distillation (DAViD), a Reinforcement Learning (RL) approach to predict the optimal Temporal Resolution of the synthetic videos. A teacher-in-the-loop reward function is proposed to update the RL agent policy. To the best of our knowledge, this is the first study to introduce adaptive temporal resolution based on video semantics in video dataset distillation. Our approach significantly outperforms existing DD methods, demonstrating substantial improvements in performance. This work paves the way for future research on more efficient and semantic-adaptive video dataset distillation research.

Problem

Research questions and friction points this paper is trying to address.

Reducing redundancy in video datasets while preserving essential information

Addressing varying temporal redundancy across different video semantics

Optimizing temporal resolution adaptively based on video content

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL predicts optimal video temporal resolution

Teacher-in-the-loop reward updates RL policy

Adaptive resolution based on video semantics

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling

2024-10-08arXiv.orgCitations: 31

TikTok

San Jose, California

Research Engineer/Scientist (all levels), Efficient Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence