🤖 AI Summary
To address poor interpretability, weak generalization, and high data dependency of large-model safety detection methods for low-resource languages, this paper proposes ConsistentGuard—a few-shot multilingual safety defense framework leveraging chain-of-thought (CoT) reasoning enhancement and cross-lingual representation alignment. ConsistentGuard integrates CoT-driven interpretable classification, multilingual semantic alignment, and meta-learning to enable cross-lingual malicious request detection under extreme supervision (only 1,000 annotated samples). We introduce the first multilingual safety evaluation benchmark extension covering six low-resource languages across three established benchmarks. Experiments demonstrate that ConsistentGuard consistently outperforms fully supervised baselines in detection accuracy, decision interpretability, and zero-/few-shot cross-lingual transfer capability. Our framework establishes a new paradigm for trustworthy AI deployment in low-resource settings—delivering efficiency, transparency, and scalability without compromising robustness.
📝 Abstract
Recent advances in LLMs have enhanced AI capabilities, but also increased the risk posed by malicious requests, highlighting the need for effective LLM safeguards to detect such queries. Existing approaches largely rely on classifier-based methods that lack interpretability and perform poorly on low-resource languages. To address these limitations, we propose ConsistentGuard, a novel reasoning-based multilingual safeguard, which enhances explainability via reasoning and boosts knowledge transfer between languages through alignment. With only 1,000 training samples, our method demonstrates superior performance on three datasets across six languages, outperforming larger models trained with significantly more data, and exhibits strong interpretability and generalization ability. We also contribute a multilingual benchmark extension and release our codes to support future research.