๐ค AI Summary
Large language models (LLMs) remain vulnerable to adversarial jailbreaking prompts, often generating harmful content despite safety alignment efforts.
Method: This paper proposes a target-toxicity-answer-driven red-teaming method that jointly optimizes malicious user queries and misleading prefixes to elicit a specified harmful response. Leveraging a reinforcement learning framework, it directly models the probability of generating the target toxic answer as a differentiable reward signal, enabling precise and controllable risk detection.
Contribution/Results: The approach is the first to support unified evaluation across both open-source and black-box modelsโincluding proprietary systems like GPT-4o. It achieves state-of-the-art performance on benchmarks such as AdvBench and HH-Harmless, demonstrating significantly higher toxicity activation rates and strong generalization across diverse model families. By shifting focus from prompt optimization to targeted answer generation, this work establishes a novel paradigm for rigorous, scalable, and model-agnostic LLM safety evaluation.
๐ Abstract
Despite the substantial advancements in artificial intelligence, large language models (LLMs) remain being challenged by generation safety. With adversarial jailbreaking prompts, one can effortlessly induce LLMs to output harmful content, causing unexpected negative social impacts. This vulnerability highlights the necessity for robust LLM red-teaming strategies to identify and mitigate such risks before large-scale application. To detect specific types of risks, we propose a novel red-teaming method that $ extbf{A}$ttacks LLMs with $ extbf{T}$arget $ extbf{Toxi}$c $ extbf{A}$nswers ($ extbf{Atoxia}$). Given a particular harmful answer, Atoxia generates a corresponding user query and a misleading answer opening to examine the internal defects of a given LLM. The proposed attacker is trained within a reinforcement learning scheme with the LLM outputting probability of the target answer as the reward. We verify the effectiveness of our method on various red-teaming benchmarks, such as AdvBench and HH-Harmless. The empirical results demonstrate that Atoxia can successfully detect safety risks in not only open-source models but also state-of-the-art black-box models such as GPT-4o.