๐ค AI Summary
To address the ethical risks posed by harmful content generation from large language models (LLMs), existing safety interventions rely on auxiliary control models or runtime modifications, often degrading output quality and increasing inference overhead. This paper proposes LLMSafeGuardโa lightweight, fine-tuning-free real-time safety framework that dynamically integrates external verifiers during decoding for immediate safety intervention. Its key contributions are: (1) a similarity-driven, training-free verification mechanism that eliminates the need for model retraining; and (2) a context-aware intervention timing strategy that balances safety guarantees with generation fluency. Experiments demonstrate that LLMSafeGuard reduces toxic output by โฅ38.6% in detoxification tasks while preserving language quality, and achieves โฅ24.2% lower inference latency compared to state-of-the-art methods.
๐ Abstract
Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks but also pose ethical and societal risks due to their propensity to generate harmful content. Existing methods have limitations, including the need for training specific control models and proactive intervention during text generation, that lead to quality degradation and increased computational overhead. To mitigate those limitations, we propose LLMSafeGuard, a lightweight real-time framework that integrates an external validator into decoding, rejecting unsafe outputs while allowing valid ones. We introduce a similarity-based validation approach, simplifying constraint introduction and eliminating the need for control model training. Additionally, LLMSafeGuard employs a context-wise timing selection strategy, intervening LLMs only when necessary. We evaluate LLMSafeGuard on detoxification and copyright safeguarding, demonstrating its superiority over SOTA baselines. In detoxification, LLMSafeGuard reduces toxic output by at least 38.6% while preserving linguistic quality. Additionally, its context-wise timing selection cuts inference time by at least 24.2% without compromising effectiveness.