Atoxia: Red-teaming Large Language Models with Target Toxic Answers

๐Ÿ“… 2024-08-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

205K/year
๐Ÿค– AI Summary
Large language models (LLMs) remain vulnerable to adversarial jailbreaking prompts, often generating harmful content despite safety alignment efforts. Method: This paper proposes a target-toxicity-answer-driven red-teaming method that jointly optimizes malicious user queries and misleading prefixes to elicit a specified harmful response. Leveraging a reinforcement learning framework, it directly models the probability of generating the target toxic answer as a differentiable reward signal, enabling precise and controllable risk detection. Contribution/Results: The approach is the first to support unified evaluation across both open-source and black-box modelsโ€”including proprietary systems like GPT-4o. It achieves state-of-the-art performance on benchmarks such as AdvBench and HH-Harmless, demonstrating significantly higher toxicity activation rates and strong generalization across diverse model families. By shifting focus from prompt optimization to targeted answer generation, this work establishes a novel paradigm for rigorous, scalable, and model-agnostic LLM safety evaluation.

Technology Category

Application Category

๐Ÿ“ Abstract
Despite the substantial advancements in artificial intelligence, large language models (LLMs) remain being challenged by generation safety. With adversarial jailbreaking prompts, one can effortlessly induce LLMs to output harmful content, causing unexpected negative social impacts. This vulnerability highlights the necessity for robust LLM red-teaming strategies to identify and mitigate such risks before large-scale application. To detect specific types of risks, we propose a novel red-teaming method that $ extbf{A}$ttacks LLMs with $ extbf{T}$arget $ extbf{Toxi}$c $ extbf{A}$nswers ($ extbf{Atoxia}$). Given a particular harmful answer, Atoxia generates a corresponding user query and a misleading answer opening to examine the internal defects of a given LLM. The proposed attacker is trained within a reinforcement learning scheme with the LLM outputting probability of the target answer as the reward. We verify the effectiveness of our method on various red-teaming benchmarks, such as AdvBench and HH-Harmless. The empirical results demonstrate that Atoxia can successfully detect safety risks in not only open-source models but also state-of-the-art black-box models such as GPT-4o.
Problem

Research questions and friction points this paper is trying to address.

Detect harmful content in LLMs
Develop robust red-teaming strategies
Improve safety in large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning attacker
Target toxic answers method
Red-teaming LLM safety
๐Ÿ”Ž Similar Papers
2024-08-21International Conference on Automated Software EngineeringCitations: 10
๐Ÿ’ผ Related Jobs
Y
Yuhao Du
Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen
Z
Zhuo Li
Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen
Pengyu Cheng
Pengyu Cheng
Alibaba Group
machine learningnatural language processing
Xiang Wan
Xiang Wan
Shenzhen Research Institute of Big Data
BioinformaticsData MiningBig Data Analysis
A
Anningzhe Gao
Shenzhen Research Institute of Big Data