MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multilingual large language models (LLMs) are vulnerable to cross-lingual jailbreaking attacks, and the scarcity of multilingual safety-aligned data further weakens defense capabilities. To address this, we propose a reasoning-based safety guard tailored for multilingual settings. Our method comprises three key components: (1) a novel synthetic data generation approach that incorporates culturally and linguistically grounded characteristics to alleviate annotation scarcity; (2) a curriculum-guided Group Relative Policy Optimization (GRPO) framework to enhance multilingual robustness and generalization; and (3) support for multilingual interpretable outputs that expose language-specific safety risks. Experiments demonstrate that our approach significantly outperforms existing baselines across both in-domain and out-of-domain languages, while also generating faithful, multilingual safety explanations—thereby improving transparency and debuggability in content moderation.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking, which can elicit harmful or unsafe behaviors. This vulnerability is exacerbated in multilingual setting, where multilingual safety-aligned data are often limited. Thus, developing a guardrail capable of detecting and filtering unsafe content across diverse languages is critical for deploying LLMs in real-world applications. In this work, we propose an approach to build a multilingual guardrail with reasoning. Our method consists of: (1) synthetic multilingual data generation incorporating culturally and linguistically nuanced variants, (2) supervised fine-tuning, and (3) a curriculum-guided Group Relative Policy Optimization (GRPO) framework that further improves performance. Experimental results demonstrate that our multilingual guardrail consistently outperforms recent baselines across both in-domain and out-of-domain languages. The multilingual reasoning capability of our guardrail enables it to generate multilingual explanations, which are particularly useful for understanding language-specific risks and ambiguities in multilingual content moderation.
Problem

Research questions and friction points this paper is trying to address.

Detect and filter unsafe multilingual LLM content
Address limited multilingual safety-aligned data
Improve reasoning for language-specific risk analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic multilingual data generation with nuanced variants
Supervised fine-tuning for enhanced performance
Curriculum-guided GRPO framework for multilingual reasoning
🔎 Similar Papers
No similar papers found.