ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large language models (LLMs) often fail to detect subtle safety violations in real-world applications. Method: This paper proposes the Criticality-Augmented Gatekeeper (CAG), a lightweight safety guard that distills structured, deliberative “slow thinking” — i.e., critical reasoning — from high-capacity LLMs via joint fine-tuning on structured critiques and safety labels. CAG employs a multitask supervised learning framework coupled with critique-guided classification. Contribution/Results: CAG introduces the first interpretable and distillable critical reasoning paradigm, achieving both high inference efficiency and significantly enhanced risk detection rigor and depth. Experiments demonstrate state-of-the-art performance across multiple safety benchmarks: CAG achieves superior F1 and AUPRC scores, outperforming LLaMA Guard 3 by +16.1% in accuracy and +27.0% in macro-F1.

Technology Category

Application Category

📝 Abstract

Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail's cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

Enhance safety of large language models

Improve nuanced safety violation detection

Boost classification precision and reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates structured critiques

Enhances safety and interpretability

Improves classification precision significantly

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

2024-07-31arXiv.orgCitations: 5

Swiss Cheese Model for AI Safety: A Taxonomy and Reference Architecture for Multi-Layered Guardrails of Foundation Model Based Agents

2024-08-05Citations: 1

Authors to Follow