๐ค AI Summary
This work addresses the lack of efficient safety monitoring mechanisms in diffusion-based large language models (D-LLMs), which hinders accurate judgment using intermediate representations from their multi-step denoising process. The authors propose a two-tier dynamic safety monitoring framework that introduces, for the first time, โsafety hesitationโ as a proxy metric for sample difficulty. Leveraging this concept, they design a hesitation-aware dynamic routing mechanism: a lightweight probe continuously monitors the generation trajectory and triggers a heavyweight probe only when hesitation exceeds a predefined threshold, enabling on-demand and resource-efficient inference. Evaluated across three datasets and four D-LLMs, the method achieves state-of-the-art performance with probe parameters under 0.85M, significantly outperforming eight baseline approaches in both safety detection accuracy and computational efficiency.
๐ Abstract
Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose $D^2$-Monitor, a bi-level safety monitor for D-LLMs. $D^2$-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, $D^2$-Monitor achieves state-of-the-art performance with a compact parameter footprint ($\leq$ 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.