AdaVid: Adaptive Video-Language Pretraining

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the challenges of deploying video-language pretraining models on edge devices—namely, high computational overhead and strict frame-length limitations (typically ≤64 frames). Methodologically, we propose an adaptive video encoding framework featuring: (i) a novel Matryoshka-inspired adaptive Transformer module enabling dynamic hidden-dimension scaling during inference; (ii) a lightweight hierarchical network for efficient long-video modeling; and (iii) joint video-text contrastive learning with hierarchical feature aggregation, trained end-to-end on Ego4D. Our key contribution is the first realization of computationally scalable, joint video-language modeling tailored for resource-constrained edge scenarios. Experiments demonstrate that our method achieves EgoVLP’s performance using only 50% of its compute on short-video benchmarks, and surpasses EgoVLP under equal compute budgets. Moreover, it significantly improves the joint scalability of frame capacity and accuracy on Diving48 and long-video tasks.

Technology Category

Application Category

📝 Abstract

Contrastive video-language pretraining has demonstrated great success in learning rich and robust video representations. However, deploying such video encoders on compute-constrained edge devices remains challenging due to their high computational demands. Additionally, existing models are typically trained to process only short video clips, often limited to 4 to 64 frames. In this paper, we introduce AdaVid, a flexible architectural framework designed to learn efficient video encoders that can dynamically adapt their computational footprint based on available resources. At the heart of AdaVid is an adaptive transformer block, inspired by Matryoshka Representation Learning, which allows the model to adjust its hidden embedding dimension at inference time. We show that AdaVid-EgoVLP, trained on video-narration pairs from the large-scale Ego4D dataset, matches the performance of the standard EgoVLP on short video-language benchmarks using only half the compute, and even outperforms EgoVLP when given equal computational resources. We further explore the trade-off between frame count and compute on the challenging Diving48 classification benchmark, showing that AdaVid enables the use of more frames without exceeding computational limits. To handle longer videos, we also propose a lightweight hierarchical network that aggregates short clip features, achieving a strong balance between compute efficiency and accuracy across several long video benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Adapt video encoders for compute-constrained edge devices

Handle longer videos beyond short clip limitations

Balance compute efficiency and accuracy dynamically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive transformer block for dynamic computation

Hierarchical network for efficient long video processing

Matryoshka-inspired embedding dimension adjustment

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

VideoPrism: A Foundational Visual Encoder for Video Understanding

2024-02-20International Conference on Machine LearningCitations: 30