Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

📅 2024-07-28

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

To address the prohibitively high parameter cost of full fine-tuning large vision-language models for first-person video understanding, this paper proposes Ego-VPA—a parameter-efficient adaptation framework. Its core innovation lies in a shared basis prompt mechanism, integrated with local sparse feature approximation, to jointly model intra-frame temporal structure, inter-frame dynamics, and cross-modal video–text associations. By optimizing only 0.84% of the model’s parameters, Ego-VPA achieves performance comparable to full fine-tuning across multiple egocentric video benchmarks—including action recognition, moment localization, and video question answering—while substantially outperforming existing lightweight adapters (e.g., LoRA, prefix tuning). This demonstrates exceptional generalization under extreme parameter budgets. Ego-VPA establishes a new paradigm for efficient transfer of video-language models in resource-constrained settings.

Technology Category

Application Category

📝 Abstract

Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Efficient adaptation for egocentric video tasks.

Reduces learnable parameters significantly.

Enhances cross-modal transfer efficiency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-efficient adaptation

Local sparse approximation

Cross-modal transfer

🔎 Similar Papers

No similar papers found.