Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Diffusion language models (dLLMs) implicitly learn a semi-autoregressive mixture-of-experts structure during training, yet existing inference methods employ a single fixed scheduling strategy, failing to exploit this inherent diversity and thus limiting performance. Method: We propose HEX (Hidden Ensemble of semi-autoregressive eXperts), a training-free test-time scaling method that performs parallel generation across multiple block sizes and aggregates predictions via majority voting—enabling multi-path ensemble inference. Contribution/Results: HEX is the first approach to explicitly uncover and leverage the intrinsic semi-autoregressive expert structure of dLLMs, mitigating failure modes associated with rigid scheduling. Evaluated on GSM8K, MATH, ARC-C, and TruthfulQA, HEX achieves new state-of-the-art accuracies of 88.10%, 40.00%, 87.80%, and 57.46%, respectively—significantly outperforming existing inference paradigms.

Technology Category

Application Category

📝 Abstract

Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.

Problem

Research questions and friction points this paper is trying to address.

Optimizing inference performance in diffusion-based large language models

Leveraging latent semi-autoregressive experts during test-time generation

Overcoming limitations of fixed inference schedules through ensemble methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensembles diverse block-sized generation paths

Uses majority vote over semi-autoregressive experts

Performs training-free inference via schedule ensembling

🔎 Similar Papers

No similar papers found.