ExpGuard: LLM Content Moderation in Specialized Domains

๐Ÿ“… 2026-03-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge that existing safety guardrails for large language models struggle to effectively detect harmful or adversarial content containing domain-specific terminology in professional fields such as finance, healthcare, and law. To bridge this gap, we propose ExpGuard, the first domain-specialized safety guardrail model, along with ExpGuardMixโ€”a high-quality dataset curated and annotated by domain experts for training and evaluation. ExpGuard employs a dual-audit classification mechanism to jointly assess both user prompts and model responses. Evaluated on our newly constructed test set, ExpGuardTest, and eight public benchmarks, ExpGuard significantly outperforms current approaches, achieving up to an 8.9% improvement in prompt classification accuracy and a 15.3% gain in response moderation performance. All code, data, and models are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies. Current guardrail models predominantly address general human-LLM interactions, rendering LLMs vulnerable to harmful and adversarial content within domain-specific contexts, particularly those rich in technical jargon and specialized concepts. To address this limitation, we introduce ExpGuard, a robust and specialized guardrail model designed to protect against harmful prompts and responses across financial, medical, and legal domains. In addition, we present ExpGuardMix, a meticulously curated dataset comprising 58,928 labeled prompts paired with corresponding refusal and compliant responses, from these specific sectors. This dataset is divided into two subsets: ExpGuardTrain, for model training, and ExpGuardTest, a high-quality test set annotated by domain experts to evaluate model robustness against technical and domain-specific content. Comprehensive evaluations conducted on ExpGuardTest and eight established public benchmarks reveal that ExpGuard delivers competitive performance across the board while demonstrating exceptional resilience to domain-specific adversarial attacks, surpassing state-of-the-art models such as WildGuard by up to 8.9% in prompt classification and 15.3% in response classification. To encourage further research and development, we open-source our code, data, and model, enabling adaptation to additional domains and supporting the creation of increasingly robust guardrail models.
Problem

Research questions and friction points this paper is trying to address.

content moderation
domain-specific safety
large language models
adversarial content
specialized domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

domain-specific content moderation
LLM safety guardrails
adversarial robustness
expert-annotated dataset
specialized LLM security
๐Ÿ”Ž Similar Papers
No similar papers found.