🤖 AI Summary
Existing evaluation methods rely on dataset-level metrics, which fail to capture fine-grained differences in the nondeterministic behaviors of diffusion language models at the sample level. This work proposes Factorial Variance Attribution (FVA), a novel approach that enables the first cross-factor decomposition of sources of nondeterminism. By combining fine-grained, sample-level evaluation with controlled multi-factor experiments—such as variations in guidance scale, diffusion steps, and hardware—we systematically analyze how model behavior shifts under different conditions. Our findings reveal that nondeterminism in diffusion language models is both pervasive and structurally organized, with code generation tasks exhibiting exceptional sensitivity to configuration settings. Moreover, conventional dataset-level metrics substantially underestimate the true extent of behavioral variability.
📝 Abstract
Diffusion language models (DLMs) have emerged as a promising paradigm for large language models (LLMs), yet the non-deterministic behavior of DLMs remains poorly understood. The existing non-determinism evaluations for LLMs predominantly rely on dataset-level metrics under fixed inference configurations, providing limited insight into how model behavior varies across runs and evaluation conditions. In this work, we show that dataset-level metrics systematically attenuate non-determinism in diffusion language models by aggregating sample-level prediction quality across different runs. As a result, configurations with similar aggregate performance can exhibit substantially different behaviors on individual inputs, leaving fine-grained instability and distinct error patterns uncharacterized. To address this limitation, we conduct a fine-grained evaluation of non-determinism based on sample-level prediction differences across a range of model-related factors-including guidance scale, diffusion steps, and Monte Carlo sampling-as well as system-related factors such as batch size, hardware, and numerical precision. Our analysis reveals that non-determinism in DLMs is pervasive and structured, with code generation exhibiting markedly higher sensitivity to factor-level choices than question answering. To attribute sources of non-determinism evaluation, we introduce Factor Variance Attribution (FVA), a cross-factor analysis metric that decomposes observed non-determinism into variance attributable to different evaluation factor settings. Our findings highlight the need for fine-grained, factor-aware evaluation to enable reliable non-determinism assessment of diffusion language models.