Fast Adversarial Training with Weak-to-Strong Spatial-Temporal Consistency in the Frequency Domain on Videos

๐Ÿ“… 2025-04-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Video recognition faces two key challenges in fast adversarial training: the absence of efficient methods leads to suboptimal training efficiency, and simultaneously achieving high clean accuracy and strong adversarial robustness remains difficult. To address these, we propose VFAT-WSโ€”the first efficient video adversarial training framework. Its core contributions are: (1) a novel โ€œweak-to-strongโ€ spatiotemporal consistency regularization that enhances robust generalization against perturbations; and (2) the first integration of time-frequency domain augmentation (TF-AUG/STF-AUG) with single-step PGD attacks for video adversarial training, enabling joint optimization across frequency-domain and spatiotemporal representations. Evaluated on UCF-101 and HMDB-51, VFAT-WS significantly improves both adversarial and generalization robustness while accelerating training by 4.9ร— and preserving state-of-the-art clean accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
Adversarial Training (AT) has been shown to significantly enhance adversarial robustness via a min-max optimization approach. However, its effectiveness in video recognition tasks is hampered by two main challenges. First, fast adversarial training for video models remains largely unexplored, which severely impedes its practical applications. Specifically, most video adversarial training methods are computationally costly, with long training times and high expenses. Second, existing methods struggle with the trade-off between clean accuracy and adversarial robustness. To address these challenges, we introduce Video Fast Adversarial Training with Weak-to-Strong consistency (VFAT-WS), the first fast adversarial training method for video data. Specifically, VFAT-WS incorporates the following key designs: First, it integrates a straightforward yet effective temporal frequency augmentation (TF-AUG), and its spatial-temporal enhanced form STF-AUG, along with a single-step PGD attack to boost training efficiency and robustness. Second, it devises a weak-to-strong spatial-temporal consistency regularization, which seamlessly integrates the simpler TF-AUG and the more complex STF-AUG. Leveraging the consistency regularization, it steers the learning process from simple to complex augmentations. Both of them work together to achieve a better trade-off between clean accuracy and robustness. Extensive experiments on UCF-101 and HMDB-51 with both CNN and Transformer-based models demonstrate that VFAT-WS achieves great improvements in adversarial robustness and corruption robustness, while accelerating training by nearly 490%.
Problem

Research questions and friction points this paper is trying to address.

Fast adversarial training for video models is unexplored and costly
Existing methods struggle with clean accuracy vs robustness trade-off
Propose VFAT-WS to improve efficiency and robustness in video recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-step PGD attack for efficiency
Weak-to-strong spatial-temporal consistency regularization
Temporal frequency augmentation (TF-AUG) for robustness
๐Ÿ”Ž Similar Papers
S
Songping Wang
School of Intelligence Science and Technology, Nanjing University, Suzhou 215163, China
H
Hanqing Liu
School of Software, Beihang University, Beijing 100098, China
Y
Yueming Lyu
School of Intelligence Science and Technology, Nanjing University, Suzhou 215163, China
Xiantao Hu
Xiantao Hu
Nanjing University of Science & Technology
Computer VIsion
Ziwen He
Ziwen He
Nanjing University of Information Sciences and Technology
W
Wei Wang
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Center for Research on Intelligent Perception and Computing (CRIPAC), Institute of Automation Chinese Academy of Sciences (CASIA), Beijing 100190, China
Caifeng Shan
Caifeng Shan
Philips Research
Computer VisionPattern RecognitionMachine LearningImage/Video Analysis
L
Liang Wang
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Center for Research on Intelligent Perception and Computing (CRIPAC), Institute of Automation Chinese Academy of Sciences (CASIA), Beijing 100190, China