Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the challenge in red-teaming large language models where generating adversarial prompts often entails a trade-off between efficiency and diversity. Existing approaches based on Generative Flow Networks (GFNs) suffer from training instability and mode collapse. To overcome these limitations, the authors propose a novel contrastive trajectory balance mechanism that eliminates the need to estimate the partition function $Z$, complemented by a robust reward masking strategy to mitigate noise interference and a fluency regularization term to prevent the generation of semantically meaningless text. The proposed method significantly enhances both the diversity and success rate of adversarial attacks while ensuring more stable training dynamics and avoiding convergence to low-quality local optima, demonstrating superior performance across multiple red-teaming scenarios.

📝 Abstract

Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function $Z$ estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.

Problem

Research questions and friction points this paper is trying to address.

LLM Red-Teaming

Diverse Attacks

Training Instability

Mode Collapse

Reward Noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stable-GFlowNet

Red-Teaming

Contrastive Trajectory Balance