🤖 AI Summary
This work addresses security vulnerabilities in large language model (LLM) prompts by proposing a lightweight, interpretable defense mechanism. The approach jointly trains a prompt classifier and an explanation generator via multi-task learning, incorporating synthetically generated explanations designed to counteract LLMs’ confirmation bias. A novel uncertainty-weighted loss function—integrating cross-entropy and focal loss with global explanation signals—is introduced to enhance robustness. Evaluated on three datasets under both in-domain and out-of-domain settings, the method matches or exceeds state-of-the-art performance in both detection accuracy and explanation quality, while substantially reducing model size. This achieves a strong balance among efficiency, interpretability, and security.
📝 Abstract
We propose a lightweight explainable guardrail (LEG) method for the classification of unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained using synthetic data for explainability, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches. If accepted, we will release all models and the annotated dataset publicly.