A Lightweight Explainable Guardrail for Prompt Safety

📅 2026-01-24

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses security vulnerabilities in large language model (LLM) prompts by proposing a lightweight, interpretable defense mechanism. The approach jointly trains a prompt classifier and an explanation generator via multi-task learning, incorporating synthetically generated explanations designed to counteract LLMs’ confirmation bias. A novel uncertainty-weighted loss function—integrating cross-entropy and focal loss with global explanation signals—is introduced to enhance robustness. Evaluated on three datasets under both in-domain and out-of-domain settings, the method matches or exceeds state-of-the-art performance in both detection accuracy and explanation quality, while substantially reducing model size. This achieves a strong balance among efficiency, interpretability, and security.

Technology Category

Application Category

📝 Abstract

We propose a lightweight explainable guardrail (LEG) method for the classification of unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained using synthetic data for explainability, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches. If accepted, we will release all models and the annotated dataset publicly.

Problem

Research questions and friction points this paper is trying to address.

prompt safety

explainable AI

unsafe prompt detection

guardrail

model interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

lightweight explainable guardrail

multi-task learning

synthetic explanation data