Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

πŸ“… 2026-02-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge that large language models, constrained by static safety policies, struggle to balance safety and utility at runtime. The authors propose PACT, a novel framework introducing the first hierarchical and configurable safety mechanism. PACT enforces immutable global policies to preserve core safety boundaries while allowing user-defined policies to address domain-specific risks. It further incorporates a risk-aware chain-of-thought process that decomposes safety decisions into interpretable β€œclassify-then-act” pathways. This approach enables dynamic, transparent, and customizable safety control, achieving near state-of-the-art performance on comprehensive safety benchmarks while significantly enhancing controllability in user-specified scenarios, thereby effectively mitigating the trade-off between safety and usefulness. The code, data, and evaluation protocol are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.
Problem

Research questions and friction points this paper is trying to address.

safety-helpfulness trade-off
static safety policies
runtime controllability
risk-aware reasoning
hierarchical policy
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical policy control
risk-aware chain-of-thought
dynamic safety alignment
controllable LLM safety
Classify-to-Act framework
πŸ”Ž Similar Papers
No similar papers found.