🤖 AI Summary
This work addresses the challenge that existing data augmentation methods for video action recognition often introduce uncontrolled perturbations, which disrupt intra-class structure and induce representation drift. To mitigate this, the authors propose ReMA, a plug-and-play, training-free hybrid augmentation strategy that enhances representational diversity through controlled replacement while preserving class-conditional stability. ReMA innovatively integrates a Representation Alignment Mechanism (RAM) for structured intra-class mixing and a Dynamic Selection Mechanism (DSM) that generates motion-aware spatiotemporal masks to precisely localize perturbation regions. Extensive experiments demonstrate that ReMA consistently improves model generalization and robustness across multiple video action recognition benchmarks, with consistent effectiveness observed across varying spatiotemporal granularities.
📝 Abstract
Video behavior recognition demands stable and discriminative representations under complex spatiotemporal variations. However, prevailing data augmentation strategies for videos remain largely perturbation-driven, often introducing uncontrolled variations that amplify non-discriminative factors, which finally weaken intra-class distributional structure and representation drift with inconsistent gains across temporal scales. To address these problems, we propose Representation-aware Mixing Augmentation (ReMA), a plug-and-play augmentation strategy that formulates mixing as a controlled replacement process to expand representations while preserving class-conditional stability. ReMA integrates two complementary mechanisms. Firstly, the Representation Alignment Mechanism (RAM) performs structured intra-class mixing under distributional alignment constraints, suppressing irrelevant intra-class drift while enhancing statistical reliability. Then, the Dynamic Selection Mechanism (DSM) generates motion-aware spatiotemporal masks to localize perturbations, guiding them away from discrimination-sensitive regions and promoting temporal coherence. By jointly controlling how and where mixing is applied, ReMA improves representation robustness without additional supervision or trainable parameters. Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.