Make Your Training Flexible: Towards Deployment-Efficient Video Models

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video models employ fixed spatiotemporal sampling, leading to suboptimal accuracy–computation trade-offs and poor adaptability to varying inference budgets. To address this, we propose the Token Optimization inference paradigm and Flux—a plug-and-play, flexible sampling enhancement framework—featuring dynamic spatiotemporal sampling, token-level importance modeling via contrastive learning, and a ViT-compatible large-scale pretraining pipeline. This unifies pretraining generality with inference budget controllability. Our FluxViT achieves new state-of-the-art (SOTA) performance across multiple video understanding benchmarks under standard computational cost; remarkably, it matches prior SOTA using only 25% of input tokens, reducing computation by nearly 90%. The core innovations are: (i) the first introduction of token-importance-driven dynamic sampling into video modeling, and (ii) a novel deployment-oriented evaluation paradigm emphasizing inference adaptability.

Technology Category

Application Category

📝 Abstract
Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90% savings. All models and data are available at https://github.com/OpenGVLab/FluxViT.
Problem

Research questions and friction points this paper is trying to address.

Optimize video model training for better accuracy-computation trade-offs.
Enhance adaptability to varying computational budgets in downstream tasks.
Introduce Flux for flexible sampling and token selection in video frameworks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token Optimization maximizes input information efficiency
Flux tool enables flexible video sampling grids
FluxViT achieves state-of-the-art with reduced tokens
Chenting Wang
Chenting Wang
Shanghai Jiao Tong University
Computer VisionVideo Understanding
Kunchang Li
Kunchang Li
ByteDance Seed
Video UnderstandingMultimodal Learning
T
Tianxiang Jiang
Shanghai AI Laboratory, University of Science and Technology of China
X
Xiangyun Zeng
Shanghai AI Laboratory
Y
Yi Wang
Shanghai AI Laboratory
L
Limin Wang
Shanghai AI Laboratory, State Key Laboratory for Novel Software Technology, Nanjing University