Can Masking Background and Object Reduce Static Bias for Zero-Shot Action Recognition?

📅 2025-01-22

🏛️ Conference on Multimedia Modeling

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In zero-shot action recognition, models are prone to static biases—such as background and object cues—that distort action semantic modeling. To address this, we propose a dual-mask intervention mechanism: for the first time, learnable foreground/background mask modules explicitly decouple scene-redundant information from action semantics within a CLIP-based transfer framework, enabling unbiased vision–language alignment. Furthermore, we introduce contrastive action semantic distillation to enhance discriminative action representations. On UCF101 and HMDB51, our method achieves zero-shot accuracy improvements of +5.2% and +4.8%, respectively, while significantly reducing background confusion. This work establishes an interpretable and scalable paradigm for mitigating static biases in action recognition, advancing both robustness and generalizability of zero-shot models.