ExpGuard: LLM Content Moderation in Specialized Domains

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenge that existing safety guardrails for large language models struggle to effectively detect harmful or adversarial content containing domain-specific terminology in professional fields such as finance, healthcare, and law. To bridge this gap, we propose ExpGuard, the first domain-specialized safety guardrail model, along with ExpGuardMix—a high-quality dataset curated and annotated by domain experts for training and evaluation. ExpGuard employs a dual-audit classification mechanism to jointly assess both user prompts and model responses. Evaluated on our newly constructed test set, ExpGuardTest, and eight public benchmarks, ExpGuard significantly outperforms current approaches, achieving up to an 8.9% improvement in prompt classification accuracy and a 15.3% gain in response moderation performance. All code, data, and models are publicly released.

Technology Category

Application Category

📝 Abstract

With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.

Problem

Research questions and friction points this paper is trying to address.

content moderation

domain-specific safety

large language models

adversarial content

specialized domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

domain-specific content moderation

LLM safety guardrails

adversarial robustness