Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

πŸ“… 2026-04-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing hard safety filtering mechanisms often lead to excessive refusals and overlook model robustness and honesty, resulting in systems that are ostensibly safe yet practically limited. This work proposes Guardian-as-an-Advisor (GaaA), a soft-gating framework that uniquely positions the guard model as an advisor rather than an interceptor: it predicts risk labels and generates explanatory rationales, which are prepended to the input to guide the base model’s autonomous decision-making within its original operational policy. We introduce GuardSet, a multi-domain dataset comprising over 208,000 samples, and train the GuardAdvisor via supervised fine-tuning and reinforcement learning. Experiments demonstrate that GaaA maintains high detection accuracy while significantly reducing over-refusal, yields responses of higher quality than those from the original prompt alone, and incurs only a 2–10% increase in end-to-end latency with computational overhead under 5% of the base model’s cost.
πŸ“ Abstract
Hard-gated safety checkers often over-refuse and misalign with a vendor's model spec; prevailing taxonomies also neglect robustness and honesty, yielding safer-on-paper yet less useful systems. This work introduces Guardian-as-an-Advisor (GaaA), a soft-gating pipeline where a guardian predicts a binary risk label plus a concise explanation and prepends this advice to the original query for re-inference, keeping the base model operating under its original spec. To support training and evaluation, GuardSet is constructed, a 208k+ multi-domain dataset unifying harmful and harmless cases with targeted robustness and honesty slices. GuardAdvisor is trained via SFT followed by RL to enforce label-explanation consistency. GuardAdvisor attains competitive detection accuracy while enabling the advisory workflow; when used to augment inputs, responses improve over unaugmented prompts. A latency study shows advisor inference uses below 5% of base-model compute and adds only 2-10% end-to-end overhead under realistic harmful-input rates. Overall, GaaA steers models to comply with the model spec, maintaining safety while reducing over-refusal.
Problem

Research questions and friction points this paper is trying to address.

over-refusal
model specification alignment
robustness
honesty
trustworthy LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Guardian-as-an-Advisor
soft-gating
trustworthy LLMs
over-refusal mitigation
GuardSet
πŸ”Ž Similar Papers
No similar papers found.