Learning to Generalize without Bias for Open-Vocabulary Action Recognition

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

In open-vocabulary action recognition, CLIP’s static bias causes models to over-rely on frame-level static features, severely limiting generalization—especially to out-of-context novel actions. To address this, we propose Open-MeDe: a novel framework integrating meta-optimization with static bias mitigation. We design a cross-batch virtual evaluation strategy that enables rapid, label-free generalization guidance; and introduce trajectory self-ensembling optimization, allowing regularization-free training initialized from CLIP to enhance parameter robustness. Extensive experiments demonstrate that Open-MeDe significantly outperforms state-of-the-art methods in both in-context and out-of-context settings. Notably, it achieves substantial gains in zero-shot action recognition accuracy, validating its effectiveness and superior generalization capability for open-ended, dynamic action understanding.

Technology Category

Application Category

📝 Abstract

Leveraging the effective visual-text alignment and static generalizability from CLIP, recent video learners adopt CLIP initialization with further regularization or recombination for generalization in open-vocabulary action recognition in-context. However, due to the static bias of CLIP, such video learners tend to overfit on shortcut static features, thereby compromising their generalizability, especially to novel out-of-context actions. To address this issue, we introduce Open-MeDe, a novel Meta-optimization framework with static Debiasing for Open-vocabulary action recognition. From a fresh perspective of generalization, Open-MeDe adopts a meta-learning approach to improve known-to-open generalizing and image-to-video debiasing in a cost-effective manner. Specifically, Open-MeDe introduces a cross-batch meta-optimization scheme that explicitly encourages video learners to quickly generalize to arbitrary subsequent data via virtual evaluation, steering a smoother optimization landscape. In effect, the free of CLIP regularization during optimization implicitly mitigates the inherent static bias of the video meta-learner. We further apply self-ensemble over the optimization trajectory to obtain generic optimal parameters that can achieve robust generalization to both in-context and out-of-context novel data. Extensive evaluations show that Open-MeDe not only surpasses state-of-the-art regularization methods tailored for in-context open-vocabulary action recognition but also substantially excels in out-of-context scenarios.

Problem

Research questions and friction points this paper is trying to address.

Static bias in CLIP

Overfitting on shortcut features

Generalization to novel actions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Meta-optimization framework with static debiasing

Cross-batch meta-optimization scheme

Self-ensemble over optimization trajectory

🔎 Similar Papers

No similar papers found.