Qwen3Guard Technical Report

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing safety guardrail models suffer from two key limitations: (1) they produce only binary safety labels, lacking flexibility to accommodate varying safety tolerance thresholds across application scenarios; and (2) they rely on post-hoc detection, rendering them incompatible with streaming inference and real-time intervention. This paper introduces a multilingual, scalable generative safety guarding framework that pioneers a dual-mode architecture: (i) ternary classification—“safe,” “controversial,” or “unsafe”—and (ii) token-level streaming detection. Fine-tuned via instruction tuning, the framework enables fine-grained policy adaptation, while an integrated lightweight classifier head ensures low-latency online monitoring. The model family spans 0.6B–8B parameters and supports 119 languages. It achieves state-of-the-art performance on English, Chinese, and multilingual safety benchmarks. All models are released under the Apache 2.0 license.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) become more capable and widely used, ensuring the safety of their outputs is increasingly critical. Existing guardrail models, though useful in static evaluation settings, face two major limitations in real-world applications: (1) they typically output only binary "safe/unsafe" labels, which can be interpreted inconsistently across diverse safety policies, rendering them incapable of accommodating varying safety tolerances across domains; and (2) they require complete model outputs before performing safety checks, making them fundamentally incompatible with streaming LLM inference, thereby preventing timely intervention during generation and increasing exposure to harmful partial outputs. To address these challenges, we present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants: Generative Qwen3Guard, which casts safety classification as an instruction-following task to enable fine-grained tri-class judgments (safe, controversial, unsafe); and Stream Qwen3Guard, which introduces a token-level classification head for real-time safety monitoring during incremental text generation. Both variants are available in three sizes (0.6B, 4B, and 8B parameters) and support up to 119 languages and dialects, providing comprehensive, scalable, and low-latency safety moderation for global LLM deployments. Evaluated across English, Chinese, and multilingual benchmarks, Qwen3Guard achieves state-of-the-art performance in both prompt and response safety classification. All models are released under the Apache 2.0 license for public use.
Problem

Research questions and friction points this paper is trying to address.

Binary safety labels cause inconsistent policy interpretation across domains
Existing guardrails require complete outputs, preventing streaming intervention
Current models lack real-time safety monitoring during text generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative variant enables fine-grained tri-class safety judgments
Stream variant performs real-time token-level safety monitoring
Multilingual models support 119 languages with scalable sizes
🔎 Similar Papers
No similar papers found.