$S^3$: Stratified Scaling Search for Test-Time in Diffusion Language Models

πŸ“… 2026-04-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing diffusion language models struggle to effectively improve generation quality at test time through increased inference computation, primarily because conventional best-of-K sampling is constrained by repeatedly drawing from the same distribution. This work proposes Hierarchical Scaling Search ($S^3$), which introduces verifier-guided, trajectory-level search into diffusion language models for the first time. During the denoising process, $S^3$ dynamically expands multiple candidate trajectories and employs a lightweight, reference-free verifier to evaluate and selectively resample high-potential paths while preserving diversity. Notably, this approach approximates a reward-weighted distribution without modifying the base model or decoding schedule. Experiments demonstrate that $S^3$ significantly enhances performance on mathematical reasoning benchmarks such as MATH-500 and GSM8K when applied to LLaDA-8B-Instruct, validating the efficacy of test-time scaling.
πŸ“ Abstract
Test-time scaling investigates whether a fixed diffusion language model (DLM) can generate better outputs when given more inference compute, without additional training. However, naive best-of-$K$ sampling is fundamentally limited because it repeatedly draws from the same base diffusion distribution, whose high-probability regions are often misaligned with high-quality outputs. We propose $S^3$ (Stratified Scaling Search), a classical verifier-guided search method that improves generation by reallocating compute during the denoising process rather than only at the final output stage. At each denoising step, $S^3$ expands multiple candidate trajectories, evaluates them with a lightweight reference-free verifier, and selectively resamples promising candidates while preserving diversity within the search frontier. This procedure effectively approximates a reward-tilted sampling distribution that favors higher-quality outputs while remaining anchored to the model prior. Experiments with LLaDA-8B-Instruct on MATH-500, GSM8K, ARC-Challenge, and TruthfulQA demonstrate that $S^3$ consistently improves performance across benchmarks, achieving the largest gains on mathematical reasoning tasks while leaving the underlying model and decoding schedule unchanged. These results show that classical search over denoising trajectories provides a practical mechanism for test-time scaling in DLMs.
Problem

Research questions and friction points this paper is trying to address.

test-time scaling
diffusion language models
generation quality
inference compute
sampling distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time scaling
diffusion language models
verifier-guided search
denoising trajectories
reward-tilted sampling
πŸ”Ž Similar Papers
No similar papers found.