MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Large language models (LLMs) exhibit insufficient robustness against discrete adversarial attacks, and existing continuous relaxation methods fail to generalize to realistic discrete threat models. Method: We propose MixAT, the first hybrid adversarial training framework that unifies continuous and discrete adversarial training paradigms. MixAT integrates efficient continuous attacks with potent discrete attacks, combines continuous relaxation-based optimization with discrete gradient approximation, and introduces the ALO-ASR metric to quantify worst-case vulnerability. Results: MixAT reduces ALO-ASR below 20%—a drop of over 30 percentage points versus baselines—while incurring training overhead comparable to pure continuous methods and preserving generation quality. Furthermore, we uncover previously overlooked impacts of deployment factors—including prompt templates, quantization, and LoRA adaptation—on LLM robustness. Our code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Despite recent efforts in Large Language Models (LLMs) safety and alignment, current adversarial attacks on frontier LLMs are still able to force harmful generations consistently. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to a diverse set of discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR<20%) compared to prior defenses (ALO-ASR>50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.

Problem

Research questions and friction points this paper is trying to address.

Improving LLM robustness against diverse adversarial attacks

Bridging gap between discrete and continuous adversarial training

Reducing computational cost while enhancing safety of LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines discrete and continuous adversarial training

Introduces ALO-ASR metric for worst-case vulnerability

Maintains robustness with minimal computational overhead

🔎 Similar Papers

No similar papers found.

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Natural Language Processing Researcher

Kitware

Remote, USA: AL, AZ, CO, DC, FL, GA, IL, IN, MA, MD, ME, MN, NC, NM, NY, OH, OR, PA, TN, TX, UT, VA, WI

Authors to Follow