Information Theoretic Adversarial Training of Large Language Models

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to harmful behaviors under novel adversarial prompts and the limited scalability of existing adversarial training methods due to high computational costs. To this end, the authors propose the WARDEN framework, which adopts an information-theoretic perspective by formulating a dual objective based on f-divergence ambiguity sets. The method optimizes worst-case loss within a divergence ball around the empirical data distribution and dynamically reweights samples to automatically focus on harder adversarial examples. Integrating continual adversarial training, gradient-perturbed embeddings, and log-sum-exp optimization, WARDEN significantly reduces attack success rates across diverse models and attack settings while maintaining computational and utility costs comparable to strong baselines such as CAT and CAPO, thereby achieving scalable robust alignment.

📝 Abstract

Large language models (LLMs) remain vulnerable to adversarial prompting despite advances in alignment and safety, often exhibiting harmful behaviors under novel attack strategies. While adversarial training can improve robustness, existing approaches are computationally expensive and difficult to scale. Recent continuous adversarial training methods, such as Continuous adversarial training (CAT) and Continuous Adversarial Preference Optimization (CAPO), address this challenge by leveraging gradient-based perturbations in the embedding space, enabling more efficient and expressive attacks. Building on this paradigm, we propose WARDEN, a distributionally robust adversarial training framework for LLMs that dynamically reweights adversarial examples through an f -divergence ambiguity set around the empirical training distribution. Our method optimizes the worst-case adversarial loss within a divergence ball around the empirical data distribution, automatically emphasizing harder adversarial examples. Using the convex dual formulation, the objective reduces to a log-sum-exp form under the KL divergence, with a dynamical parameter controlling the strength of reweighting. This study leads to a new class of information-theoretic objectives that significantly reduce attack success rates while maintaining model utility. Across multiple LLMs and attack settings, WARDEN substantially reduces attack success rates with computational and utility costs comparable to CAT-, CAPO-, and MixAT-based baselines, making it a practical approach for scalable robust alignment.

Problem

Research questions and friction points this paper is trying to address.

adversarial prompting

large language models

adversarial training

robustness

scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

distributionally robust optimization

f-divergence

adversarial training