🤖 AI Summary
This work addresses the challenge of balancing low latency and high accuracy in safety guardrails for streaming large language models, which typically rely on precise token-level annotations of unsafe boundaries. The authors propose StreamGuard, the first framework to formulate streaming content moderation as a risk prediction problem. By leveraging Monte Carlo rollouts, StreamGuard predicts the potential harm of partially generated text, enabling early intervention without requiring token-level labels. The approach features a model-agnostic architecture and a unified streaming moderation framework, facilitating knowledge transfer across tokenizers and model families. Experiments demonstrate that on an 8B-parameter model, StreamGuard achieves an F1 score of 81.9%, a timely intervention rate of 92.6%, and a false negative rate of 4.9%; when transferred to a 1B-parameter model, performance improves to an F1 score of 98.2% with only a 3.5% false negative rate.
📝 Abstract
In many practical LLM deployments, a single guardrail is used for both prompt and response moderation. Prompt moderation operates on fully observed text, whereas streaming response moderation requires safety decisions to be made over partial generations. Existing text-based streaming guardrails commonly frame this output-side problem as boundary detection, training models to identify the earliest prefix at which a response has already become unsafe. In this work, we introduce StreamGuard, a unified model-agnostic streaming guardrail that instead formulates moderation as a forecasting problem: given a partial prefix, the model predicts the expected harmfulness of likely future continuations. We supervise this prediction using Monte Carlo rollouts, which enables early intervention without requiring exact token-level boundary annotations.
Across standard safety benchmarks, StreamGuard performs strongly both for input moderation and for streaming output moderation. At the 8B scale, StreamGuard improves aggregated input-moderation F1 from 86.7 to 88.2 and aggregated streaming output-moderation F1 from 80.4 to 81.9 relative to Qwen3Guard-Stream-8B-strict. On the QWENGUARDTEST response_loc streaming benchmark, StreamGuard reaches 97.5 F1, 95.1 recall, and 92.6% on-time intervention, compared to 95.9 F1, 92.1 recall, and 89.9% for Qwen3Guard-Stream-8B-stric, while reducing the miss rate from 7.9% to 4.9%. We further show that forecasting-based supervision transfers effectively across tokenizers and model families: with transferred targets, Gemma3-StreamGuard-1B reaches 81.3 response-moderation F1, 98.2 streaming F1, and a 3.5% miss rate. These results show that strong end-to-end streaming moderation can be obtained without exact boundary labels, and that forecasting future risk is an effective supervision strategy for low-latency safety intervention.