🤖 AI Summary
This work addresses the challenge in red-teaming large language models where generating adversarial prompts often entails a trade-off between efficiency and diversity. Existing approaches based on Generative Flow Networks (GFNs) suffer from training instability and mode collapse. To overcome these limitations, the authors propose a novel contrastive trajectory balance mechanism that eliminates the need to estimate the partition function $Z$, complemented by a robust reward masking strategy to mitigate noise interference and a fluency regularization term to prevent the generation of semantically meaningless text. The proposed method significantly enhances both the diversity and success rate of adversarial attacks while ensuring more stable training dynamics and avoiding convergence to low-quality local optima, demonstrating superior performance across multiple red-teaming scenarios.
📝 Abstract
Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function $Z$ estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.