🤖 AI Summary
While transfer learning improves large language models’ (LLMs) performance on standard NLP tasks, its impact on adversarial robustness remains underexplored and poorly understood.
Method: We systematically evaluate multiple architectures—including BERT, RoBERTa, GPT-2, Gemma, and Phi—across the MBIB benchmark suite (covering hate speech, political bias, and gender bias) to assess adversarial robustness under transfer-based fine-tuning.
Contribution/Results: We uncover, for the first time, a strong positive correlation between model scale and adversarial robustness—challenging the prevailing assumption that transfer learning inherently enhances robustness. Moreover, we identify intricate couplings among model size, architecture, and adaptation strategy. Empirically, scaling up model parameters mitigates the robustness degradation induced by transfer learning. Our work establishes a reproducible adversarial evaluation framework and provides critical design principles for deploying LLMs that jointly optimize both task performance and security-critical robustness.
📝 Abstract
We investigate the adversarial robustness of LLMs in transfer learning scenarios. Through comprehensive experiments on multiple datasets (MBIB Hate Speech, MBIB Political Bias, MBIB Gender Bias) and various model architectures (BERT, RoBERTa, GPT-2, Gemma, Phi), we reveal that transfer learning, while improving standard performance metrics, often leads to increased vulnerability to adversarial attacks. Our findings demonstrate that larger models exhibit greater resilience to this phenomenon, suggesting a complex interplay between model size, architecture, and adaptation methods. Our work highlights the crucial need for considering adversarial robustness in transfer learning scenarios and provides insights into maintaining model security without compromising performance. These findings have significant implications for the development and deployment of LLMs in real-world applications where both performance and robustness are paramount.