ReMA: A Training-Free Plug-and-Play Mixing Augmentation for Video Behavior Recognition

📅 2026-01-01
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing data augmentation methods for video action recognition often introduce uncontrolled perturbations, which disrupt intra-class structure and induce representation drift. To mitigate this, the authors propose ReMA, a plug-and-play, training-free hybrid augmentation strategy that enhances representational diversity through controlled replacement while preserving class-conditional stability. ReMA innovatively integrates a Representation Alignment Mechanism (RAM) for structured intra-class mixing and a Dynamic Selection Mechanism (DSM) that generates motion-aware spatiotemporal masks to precisely localize perturbation regions. Extensive experiments demonstrate that ReMA consistently improves model generalization and robustness across multiple video action recognition benchmarks, with consistent effectiveness observed across varying spatiotemporal granularities.

Technology Category

Application Category

📝 Abstract
Video behavior recognition demands stable and discriminative representations under complex spatiotemporal variations. However, prevailing data augmentation strategies for videos remain largely perturbation-driven, often introducing uncontrolled variations that amplify non-discriminative factors, which finally weaken intra-class distributional structure and representation drift with inconsistent gains across temporal scales. To address these problems, we propose Representation-aware Mixing Augmentation (ReMA), a plug-and-play augmentation strategy that formulates mixing as a controlled replacement process to expand representations while preserving class-conditional stability. ReMA integrates two complementary mechanisms. Firstly, the Representation Alignment Mechanism (RAM) performs structured intra-class mixing under distributional alignment constraints, suppressing irrelevant intra-class drift while enhancing statistical reliability. Then, the Dynamic Selection Mechanism (DSM) generates motion-aware spatiotemporal masks to localize perturbations, guiding them away from discrimination-sensitive regions and promoting temporal coherence. By jointly controlling how and where mixing is applied, ReMA improves representation robustness without additional supervision or trainable parameters. Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.
Problem

Research questions and friction points this paper is trying to address.

video behavior recognition
data augmentation
representation drift
spatiotemporal variations
intra-class distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Representation-aware Mixing Augmentation
plug-and-play augmentation
representation alignment
dynamic selection mechanism
video behavior recognition
🔎 Similar Papers
No similar papers found.
Feng-Qi Cui
Feng-Qi Cui
University of Science and Technology of China
MultimediaTrustworthy AILLMAI4S
J
Jinyang Huang
Hefei University of Technology, Hefei, China
Sirui Zhao
Sirui Zhao
University of Science and Technology of China
Affective ComputingMLLMHCI
J
Jinglong Guo
Hefei University of Technology, Hefei, China
Q
Qifan Cai
Hefei University of Technology, Hefei, China; Hefei Xiaosheng Intelligent Technology Co., Ltd., Hefei, China
Xin Yan
Xin Yan
Missouri University of S&T, Google
Z
Zhi Liu
The University of Electro-Communications,Tokyo, Japan