FAIR-TAT: Improving Model Fairness Using Targeted Adversarial Training

📅 2024-10-30

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Adversarial training enhances model robustness but often induces class-level robustness imbalance, compromising prediction fairness and degrading clean-sample accuracy. To address this, we propose the first fairness-aware targeted adversarial training framework: (1) it introduces targeted PGD attacks into adversarial training to explicitly strengthen robustness for underperforming classes; (2) it incorporates a class-weighted robustness loss and a fairness regularization term to balance inter-class robustness while preserving overall accuracy. Experiments on CIFAR-10, CIFAR-100, and an ImageNet subset demonstrate that our method significantly improves worst-class robust accuracy by +12.3%, incurs zero clean-accuracy degradation, and generalizes effectively to unseen perturbations and common image degradations—thereby advancing both robustness and fairness in adversarial learning.

Technology Category

Application Category

📝 Abstract

Deep neural networks are susceptible to adversarial attacks and common corruptions, which undermine their robustness. In order to enhance model resilience against such challenges, Adversarial Training (AT) has emerged as a prominent solution. Nevertheless, adversarial robustness is often attained at the expense of model fairness during AT, i.e., disparity in class-wise robustness of the model. While distinctive classes become more robust towards such adversaries, hard to detect classes suffer. Recently, research has focused on improving model fairness specifically for perturbed images, overlooking the accuracy of the most likely non-perturbed data. Additionally, despite their robustness against the adversaries encountered during model training, state-of-the-art adversarial trained models have difficulty maintaining robustness and fairness when confronted with diverse adversarial threats or common corruptions. In this work, we address the above concerns by introducing a novel approach called Fair Targeted Adversarial Training (FAIR-TAT). We show that using targeted adversarial attacks for adversarial training (instead of untargeted attacks) can allow for more favorable trade-offs with respect to adversarial fairness. Empirical results validate the efficacy of our approach.

Problem

Research questions and friction points this paper is trying to address.

Adversarial Attacks

Model Robustness

Fairness in Classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

FAIR-TAT

Adversarial Training

Fairness Enhancement

🔎 Similar Papers

MABR: Multilayer Adversarial Bias Removal Without Prior Bias Knowledge