Evolving Prompts for Toxicity Search in Large Language Models

πŸ“… 2025-11-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Adversarial prompts remain effective at eliciting harmful outputs from safety-aligned large language models (LLMs). To address this, we propose ToxSearchβ€”a black-box evolutionary framework that concurrently evolves toxic prompts via a steady-state loop, incorporating lexical substitution, negation, back-translation, rewriting, and a novel semantic crossover operator, guided by moderation oracle feedback. Our method introduces operator-level analysis to uncover heterogeneous contributions of different perturbations to toxicity induction. Empirical results demonstrate that minimal perturbations exhibit cross-model attack transferability, and prompt reuse substantially undermines defense efficacy. ToxSearch achieves high attack success rates on the LLaMA family. Cross-model transfer experiments reveal an average ~50% decay in toxicity, with smaller models exhibiting greater robustness; however, certain architecture-divergent models remain highly vulnerable.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models remain vulnerable to adversarial prompts that elicit toxic content even after safety alignment. We present ToxSearch, a black-box evolutionary framework that tests model safety by evolving prompts in a synchronous steady-state loop. The system employs a diverse set of operators, including lexical substitutions, negation, back-translation, paraphrasing, and two semantic crossover operators, while a moderation oracle provides fitness guidance. Operator-level analysis shows heterogeneous behavior: lexical substitutions offer the best yield-variance trade-off, semantic-similarity crossover acts as a precise low-throughput inserter, and global rewrites exhibit high variance with elevated refusal costs. Using elite prompts evolved on LLaMA 3.1 8B, we observe practically meaningful but attenuated cross-model transfer, with toxicity roughly halving on most targets, smaller LLaMA 3.2 variants showing the strongest resistance, and some cross-architecture models retaining higher toxicity. These results suggest that small, controllable perturbations are effective vehicles for systematic red-teaming and that defenses should anticipate cross-model reuse of adversarial prompts rather than focusing only on single-model hardening.
Problem

Research questions and friction points this paper is trying to address.

Evolving adversarial prompts to test safety vulnerabilities in aligned language models
Analyzing cross-model transfer of toxicity between different LLM architectures
Developing black-box evolutionary methods to systematically evaluate model robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolutionary framework evolves prompts for toxicity testing
Diverse operators include lexical substitutions and semantic crossover
Moderation oracle guides fitness for black-box safety evaluation
πŸ”Ž Similar Papers
No similar papers found.