π€ AI Summary
This work addresses the challenge that large language models, constrained by static safety policies, struggle to balance safety and utility at runtime. The authors propose PACT, a novel framework introducing the first hierarchical and configurable safety mechanism. PACT enforces immutable global policies to preserve core safety boundaries while allowing user-defined policies to address domain-specific risks. It further incorporates a risk-aware chain-of-thought process that decomposes safety decisions into interpretable βclassify-then-actβ pathways. This approach enables dynamic, transparent, and customizable safety control, achieving near state-of-the-art performance on comprehensive safety benchmarks while significantly enhancing controllability in user-specified scenarios, thereby effectively mitigating the trade-off between safety and usefulness. The code, data, and evaluation protocol are publicly released.
π Abstract
Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.