Backtracking for Safety

๐Ÿ“… 2025-03-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing safety alignment methods for large language models (LLMs) struggle to detect and suppress fine-grained toxic content emerging mid-generation; they are vulnerable to adversarial attacks and often rely on initial-segment filtering or global reset, compromising contextual coherence and inference efficiency. This paper proposes a dynamic generative backtracking mechanismโ€”the first to introduce token-level, non-initial-position backtracking into safety alignment. A lightweight toxicity detection module computes real-time safety scores during decoding, enabling precise rollback to the most recent safe state upon risk detection, thereby achieving localized correction while preserving context. As a plug-in module, it integrates seamlessly with mainstream decoding pipelines and incurs less than 3% additional inference overhead. Evaluated across multiple benchmarks, our method reduces end-to-end toxicity by an average of 62% while maintaining generation fluency and task performance.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across various tasks, but ensuring their safety and alignment with human values remains crucial. Current safety alignment methods, such as supervised fine-tuning and reinforcement learning-based approaches, can exhibit vulnerabilities to adversarial attacks and often result in shallow safety alignment, primarily focusing on preventing harmful content in the initial tokens of the generated output. While methods like resetting can help recover from unsafe generations by discarding previous tokens and restarting the generation process, they are not well-suited for addressing nuanced safety violations like toxicity that may arise within otherwise benign and lengthy generations. In this paper, we propose a novel backtracking method designed to address these limitations. Our method allows the model to revert to a safer generation state, not necessarily at the beginning, when safety violations occur during generation. This approach enables targeted correction of problematic segments without discarding the entire generated text, thereby preserving efficiency. We demonstrate that our method dramatically reduces toxicity appearing through the generation process with minimal impact to efficiency.
Problem

Research questions and friction points this paper is trying to address.

Address vulnerabilities in LLM safety alignment methods
Enable targeted correction of safety violations mid-generation
Reduce toxicity without discarding entire generated text
Innovation

Methods, ideas, or system contributions that make the work stand out.

Backtracking method for safety alignment
Targeted correction of problematic segments
Reduces toxicity without discarding text
๐Ÿ”Ž Similar Papers
No similar papers found.