π€ AI Summary
To address the severe performance degradation of diffusion large language models (dLLMs) under 2-bit post-training quantization (PTQ), this paper proposes the first dedicated low-bit quantization framework for dLLMs. Our method comprises three key components: (1) Masked Calibration Simulation (MCS), which explicitly emulates the dLLM-specific masked denoising process during calibration to enhance accuracy; (2) Data-Aware Arbitrary-Order Quantizer (DAQ), dynamically adapting quantization parameters to layer-wise statistical characteristics; and (3) Adaptive Block-wise Mixed Precision (ABMP), enabling fine-grained precision allocation via channel-group sensitivity analysis. Experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods under 2-bit constraints, preserving near-lossless generation quality while drastically reducing model size and computational overhead. To the best of our knowledge, this is the first work to achieve efficient and practical ultra-low-bit deployment of dLLMs.
π Abstract
Diffusion large language models (dLLMs), which offer bidirectional context and flexible masked-denoising generation, are emerging as a compelling alternative to autoregressive (AR) LLMs. However, like AR LLMs, their model sizes continue to grow, motivating weight compression for deployment. Although post-training quantization (PTQ) is effective for AR LLMs, directly transferring it to dLLMs at 2-bit leads to unsatisfactory performance. To tackle these challenges, we propose Quant-dLLM, an ultra-low-bit PTQ framework tailored to dLLMs. Since masked-denoising activations in dLLMs differ from the fully visible signals assumed by standard PTQ methods, we introduce Masked Calibration Simulation (MCS) to align calibration with the timestep-dependent masking, which yields more reliable calibrations. Moreover, we propose a Data-aware Any-order Quantizer (DAQ) that learns ultra-low-bit weight representations via an optimization algorithm. It performs iterative approximation guided by our simulated calibration data. In addition, under a strict 2-bit budget, we introduce Adaptive Blockwise Mixed Precision (ABMP), a sensitivity-based precision allocation scheme that adaptively assigns bit width across channel groups. When restricted to 2-bit precision, Quant-dLLM consistently achieves higher accuracy than state-of-the-art (SOTA) AR-transfer PTQ methods on dLLMs. The code and models will be available at: https://github.com/ZTA2785/Quant-dLLM.