🤖 AI Summary
To address the dual bottlenecks of slow autoregressive inference and high computational cost with incoherent generation in masked diffusion models, this paper proposes a slot-level parallel decoding framework. Our method elevates the diffusion modeling unit from individual tokens to fixed-length subsequences (“slots”)—a novel conceptual shift—and introduces, for the first time in diffusion language modeling, KV cache reuse and a unified local causal constraint. The approach integrates slot-wise sequence modeling, an iterative “plan–fill” mechanism, and local causal attention. Evaluated on seven benchmarks, our model achieves an average 34% performance gain over prior diffusion language models while accelerating inference by 18×; compared to strong autoregressive baselines, it matches performance yet sustains a 2.33× speedup. This yields substantial improvements in both generation efficiency and controllability.
📝 Abstract
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$ imes$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$ imes$ average speedup.