Closing the Distribution Gap in Adversarial Training for LLMs

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the limitation of existing adversarial training methods, which struggle to defend against in-distribution simple attacks—such as temporal rewrites or language translation—due to insufficient coverage of the true data distribution. To overcome this, the authors propose Distributional Adversarial Training (DAT), which leverages a diffusion-based large language model to approximate the joint distribution of prompts and responses for the first time. DAT generates high-likelihood, diverse samples to enrich the training distribution and iteratively refines model robustness through continuous adversarial optimization. Experimental results demonstrate that DAT significantly outperforms current approaches, effectively mitigating generalization failures against in-distribution attacks while preserving overall model performance.

Technology Category

Application Category

📝 Abstract

Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.

Problem

Research questions and friction points this paper is trying to address.

adversarial training

distribution gap

large language models

robustness

in-distribution attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributional Adversarial Training

Diffusion LLMs

adversarial robustness