CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current safety mechanisms in large language models, which rely on static fine-tuned classifiers and struggle to adapt flexibly to evolving governance policies. The authors propose a retrieval-augmented multi-agent framework that reframes safety evaluation as an evidence-driven debate process grounded in external policy documents, enabling zero-shot policy transfer without model fine-tuning. By decoupling safety logic from model weights, the approach significantly enhances interpretability and regulatory adaptability while supporting automated construction of adversarial datasets and auditability. Empirical results demonstrate state-of-the-art performance across seven safety benchmarks, 90% accuracy in generalizing to Wikipedia vandalism detection, and successful generation of nine novel adversarial attack datasets.

Technology Category

Application Category

📝 Abstract
Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero-Shot Adaptability, where our framework successfully generalized to an out-of-domain Wikipedia Vandalism task (achieving 90\% accuracy) by swapping the reference policy; and (2) Automated Data Curation and Auditing, where we leveraged CourtGuard to curate and audit nine novel datasets of sophisticated adversarial attacks. Our results demonstrate that decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance.
Problem

Research questions and friction points this paper is trying to address.

LLM safety
adaptation rigidity
zero-shot policy adaptation
AI governance
static classifiers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-Shot Policy Adaptation
Retrieval-Augmented Multi-Agent
Evidentiary Debate
Model-Agnostic Safety
Automated Data Curation
🔎 Similar Papers
No similar papers found.
U
Umid Suleymanov
Department of Computer Science, Virginia Tech
R
Rufiz Bayramov
School of IT and Engineering, ADA University
S
Suad Gafarli
School of IT and Engineering, ADA University
S
Seljan Musayeva
School of IT and Engineering, ADA University
T
Taghi Mammadov
School of IT and Engineering, ADA University
A
Aynur Akhundlu
School of Law, ADA University
Murat Kantarcioglu
Murat Kantarcioglu
Professor of Computer Science, Virginia Tech
Security and Privacy in AIDatabasesData ScienceComputer Security