Adaptive Content Restriction for Large Language Models via Suffix Optimization

📅 2025-08-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of lightweight, adaptive suppression of restricted content generation by large language models (LLMs) without fine-tuning. We propose the Adaptive Content Restriction (AdaCoRe) task and introduce Suffix Optimization (SOP), a novel method that appends a learnable short suffix to user prompts and jointly optimizes it via gradient-based updates and targeted token masking to dynamically suppress diverse harmful terms. SOP requires no model modification or parameter updates, enabling zero-shot transfer across architectures (e.g., Gemma, Mistral, Llama) and deployment on online platforms (e.g., Poe). Evaluated on our curated benchmark CoReBench, SOP achieves 15–17% higher restriction rates than system-level suffix baselines while preserving generation quality. To the best of our knowledge, this is the first framework achieving zero-fine-tuning, high adaptability, and strong generalization for controllable content restriction.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated significant success across diverse applications. However, enforcing content restrictions remains a significant challenge due to their expansive output space. One aspect of content restriction is preventing LLMs from generating harmful content via model alignment approaches such as supervised fine-tuning (SFT). Yet, the need for content restriction may vary significantly across user groups, change rapidly over time, and not always align with general definitions of harmfulness. Applying SFT to each of these specific use cases is impractical due to the high computational, data, and storage demands. Motivated by this need, we propose a new task called extit{Adaptive Content Restriction} (AdaCoRe), which focuses on lightweight strategies -- methods without model fine-tuning -- to prevent deployed LLMs from generating restricted terms for specific use cases. We propose the first method for AdaCoRe, named extit{Suffix Optimization (SOP)}, which appends a short, optimized suffix to any prompt to a) prevent a target LLM from generating a set of restricted terms, while b) preserving the output quality. To evaluate AdaCoRe approaches, including our SOP, we create a new extit{Content Restriction Benchmark} (CoReBench), which contains 400 prompts for 80 restricted terms across 8 carefully selected categories. We demonstrate the effectiveness of SOP on CoReBench, which outperforms the system-level baselines such as system suffix by 15%, 17%, 10%, 9%, and 6% on average restriction rates for Gemma2-2B, Mistral-7B, Vicuna-7B, Llama3-8B, and Llama3.1-8B, respectively. We also demonstrate that SOP is effective on POE, an online platform hosting various commercial LLMs, highlighting its practicality in real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Prevent LLMs from generating harmful content adaptively

Avoid impractical fine-tuning for diverse content restrictions

Maintain output quality while restricting specific terms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Suffix Optimization prevents restricted terms.

Lightweight strategy avoids model fine-tuning.

Benchmark evaluates content restriction effectiveness.

🔎 Similar Papers

No similar papers found.

Authors to Follow