Don't Walk the Line: Boundary Guidance for Filtered Generation

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
When generative models are deployed alongside safety classifiers, fine-tuning often drives generated samples toward the classifier’s decision boundary, increasing both false positives and false negatives. To address this, we propose a reinforcement learning–based boundary-avoidance fine-tuning method: the reward function explicitly penalizes proximity to the decision boundary, and an LLM-as-a-Judge framework enables fine-grained, joint assessment of safety and generation quality. Crucially, our approach requires no modification to the safety classifier—instead, it optimizes the generation policy to steer outputs away from boundary regions while preserving fidelity and fluency. Experiments demonstrate significant reductions in filtering misclassification rates on jailbreaking and fuzzy-prompt benchmarks. Ablation studies across model scales and reward configurations confirm the method’s generalizability and robustness.

Technology Category

Application Category

📝 Abstract
Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Reducing false positives and negatives in filtered generative models
Steering generation away from classifier decision boundaries
Improving safety and utility for jailbreak and ambiguous prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning fine-tunes generator away from boundaries
Explicitly steers generation from classifier's decision margin
Improves safety and utility via boundary guidance method
🔎 Similar Papers
No similar papers found.