🤖 AI Summary
Action recognition models often suffer from “static bias,” wherein they over-rely on static scene cues (e.g., background, objects), leading to poor generalization—especially in zero-shot settings. To address this, we propose a dual-stream disentanglement framework that explicitly separates static (biased) and dynamic (unbiased) representations. We enforce statistical independence between the two streams via an independence loss and further constrain the static stream to encode only scene information using a scene prediction loss, thereby suppressing its interference with action classification. The method requires no additional annotations and is plug-and-play compatible with mainstream architectures. Experiments across multiple benchmarks demonstrate substantial mitigation of static bias: on zero-shot action recognition, our approach achieves an average accuracy improvement of 8.2%. Moreover, it enhances robustness in real-world scenarios and improves model interpretability.
📝 Abstract
Action recognition models rely excessively on static cues rather than dynamic human motion, which is known as static bias. This bias leads to poor performance in real-world applications and zero-shot action recognition. In this paper, we propose a method to reduce static bias by separating temporal dynamic information from static scene information. Our approach uses a statistical independence loss between biased and unbiased streams, combined with a scene prediction loss. Our experiments demonstrate that this method effectively reduces static bias and confirm the importance of scene prediction loss.