🤖 AI Summary
In zero-shot action recognition, models are prone to static biases—such as background and object cues—that distort action semantic modeling. To address this, we propose a dual-mask intervention mechanism: for the first time, learnable foreground/background mask modules explicitly decouple scene-redundant information from action semantics within a CLIP-based transfer framework, enabling unbiased vision–language alignment. Furthermore, we introduce contrastive action semantic distillation to enhance discriminative action representations. On UCF101 and HMDB51, our method achieves zero-shot accuracy improvements of +5.2% and +4.8%, respectively, while significantly reducing background confusion. This work establishes an interpretable and scalable paradigm for mitigating static biases in action recognition, advancing both robustness and generalizability of zero-shot models.