Robustness-Congruent Adversarial Training for Secure Machine Learning Model Updates

📅 2024-02-27
🏛️ IEEE Transactions on Pattern Analysis and Machine Intelligence
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
When updating machine learning models, newly trained models often improve average accuracy and robustness but may suffer from “robustness negative flip”—a degradation in adversarial robustness on specific samples—thereby compromising overall system security. Method: This paper proposes the first non-degrading adversarial training paradigm explicitly designed for robustness consistency. It introduces a robustness-preserving constraint during fine-tuning, yielding a theoretically grounded adversarial training framework that guarantees the updated model’s robustness is no worse than the original model’s on all samples. The approach integrates robustness-constrained optimization, fine-grained robustness evaluation, and gradient-guided adversarial sample selection. Contribution/Results: Evaluated on standard computer vision benchmarks, our method reduces robustness negative flip rate by 38% on average, while simultaneously improving both clean accuracy and overall robustness—outperforming standard adversarial training and transfer alignment baselines.

Technology Category

Application Category

📝 Abstract
Machine-learning models demand periodic updates to improve their average accuracy, exploiting novel architectures and additional data. However, a newly updated model may commit mistakes the previous model did not make. Such misclassifications are referred to as negative flips, experienced by users as a regression of performance. In this work, we show that this problem also affects robustness to adversarial examples, hindering the development of secure model update practices. In particular, when updating a model to improve its adversarial robustness, previously ineffective adversarial attacks on some inputs may become successful, causing a regression in the perceived security of the system. We propose a novel technique, named robustness-congruent adversarial training, to address this issue. It amounts to fine-tuning a model with adversarial training, while constraining it to retain higher robustness on the samples for which no adversarial example was found before the update. We show that our algorithm and, more generally, learning with non-regression constraints, provides a theoretically-grounded framework to train consistent estimators. Our experiments on robust models for computer vision confirm that both accuracy and robustness, even if improved after model update, can be affected by negative flips, and our robustness-congruent adversarial training can mitigate the problem, outperforming competing baseline methods.
Problem

Research questions and friction points this paper is trying to address.

Preventing negative flips in model updates
Maintaining robustness against adversarial attacks
Ensuring consistent accuracy and security improvements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial training with robustness constraints
Prevents regression in adversarial robustness
Ensures consistent estimators theoretically
🔎 Similar Papers
No similar papers found.