AprielGuard

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM safety frameworks typically model content harms (e.g., toxicity, bias) and adversarial threats (e.g., prompt injection, jailbreaking) in isolation, limiting holistic risk mitigation in dialogue and agentic settings. Method: We propose the first unified 8B-parameter safety guardian, introducing a fused classification taxonomy and a multi-stage supervised fine-tuning framework to enable end-to-end joint detection of both risk categories. We incorporate structured reasoning traces for enhanced interpretability and design an input encoding mechanism adaptable to single-turn/multi-turn dialogues and agentic workflows. Contribution/Results: Trained on a hybrid dataset of open-source and synthetically generated examples, our model significantly outperforms baselines—including Llama-Guard and Granite Guardian—across multiple public and internal benchmarks. Notably, it achieves marked improvements in detection accuracy for multi-step adversarial scenarios, demonstrating superior robustness and generalization in complex, realistic deployment settings.

Technology Category

Application Category

📝 Abstract
Safeguarding large language models (LLMs) against unsafe or adversarial behavior is critical as they are increasingly deployed in conversational and agentic settings. Existing moderation tools often treat safety risks (e.g. toxicity, bias) and adversarial threats (e.g. prompt injections, jailbreaks) as separate problems, limiting their robustness and generalizability. We introduce AprielGuard, an 8B parameter safeguard model that unify these dimensions within a single taxonomy and learning framework. AprielGuard is trained on a diverse mix of open and synthetic data covering standalone prompts, multi-turn conversations, and agentic workflows, augmented with structured reasoning traces to improve interpretability. Across multiple public and proprietary benchmarks, AprielGuard achieves strong performance in detecting harmful content and adversarial manipulations, outperforming existing opensource guardrails such as Llama-Guard and Granite Guardian, particularly in multi-step and reasoning intensive scenarios. By releasing the model, we aim to advance transparent and reproducible research on reliable safeguards for LLMs.
Problem

Research questions and friction points this paper is trying to address.

Unifies safety risks and adversarial threats in a single framework
Detects harmful content and adversarial manipulations in LLMs
Improves robustness in multi-turn conversations and agentic workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified safety taxonomy for diverse threats
Training with structured reasoning traces
Strong performance in multi-step scenarios
🔎 Similar Papers
No similar papers found.
J
Jaykumar Kasundra
SLAM Lab, ServiceNow
A
Anjaneya Praharaj
SLAM Lab, ServiceNow
S
Sourabh Surana
SLAM Lab, ServiceNow
L
Lakshmi Sirisha Chodisetty
SLAM Lab, ServiceNow
S
Sourav Sharma
SLAM Lab, ServiceNow
A
Abhigya Verma
SLAM Lab, ServiceNow
Abhishek Bhardwaj
Abhishek Bhardwaj
Assistant Professor of Finance, Tulane University
Financial IntermediationHousehold Finance
D
Debasish Kanhar
SLAM Lab, ServiceNow
A
Aakash Bhagat
SLAM Lab, ServiceNow
K
Khalil Slimi
SLAM Lab, ServiceNow
S
Seganrasan Subramanian
SLAM Lab, ServiceNow
S
Sathwik Tejaswi Madhusudhan
SLAM Lab, ServiceNow
R
Ranga Prasad Chenna
SLAM Lab, ServiceNow
Srinivas Sunkara
Srinivas Sunkara
Google Deepmind