🤖 AI Summary
Diffusion large language models (dLLMs) fine-tuned with instruction data suffer from “<eos> overflow”: generated sequences prematurely terminate with repeated <eos> tokens as length increases, degrading output quality. This stems from the dual role of <eos> as both end-of-sequence token and padding symbol, causing excessive probability mass concentration in the output distribution. This work is the first to systematically characterize this mechanism and proposes “Rainbow Padding”—replacing the single <eos> padding token with a periodic sequence of dedicated padding tokens—to disrupt gradient vanishing and mitigate premature termination during backpropagation. The method requires only one round of LoRA-based fine-tuning, introduces no architectural changes, and maintains full compatibility with existing dLLM systems. Experiments demonstrate that merely seven distinct padding tokens substantially improve robustness to generation length, achieving strong performance even with minimal training data. Code is publicly available.
📝 Abstract
Diffusion large language models (dLLMs) have emerged as a promising alternative to autoregressive models, offering flexible generation orders and strong performance on complex reasoning tasks. However, instruction-tuned dLLMs exhibit a critical vulnerability we term exttt{<eos>} overflow: as allocated sequence length increases, responses paradoxically become shorter, collapsing into early termination or degenerating into streams of exttt{<eos>} tokens. Although noticed in practice, this issue has not been systematically analyzed. We trace its root cause to the dual role of exttt{<eos>} as both termination and padding, which concentrates probability mass on exttt{<eos>} at later positions and propagates backward to trigger early termination. To address this, we introduce Rainbow Padding, a simple remedy that replaces repeated exttt{<eos>} placeholders with a repeating cycle of distinct padding tokens, distributing probability mass and breaking exttt{<eos>} dominance. Experiments show that Rainbow Padding substantially improves length robustness and output quality, with as few as seven padding tokens sufficient to prevent early termination. Moreover, the method integrates efficiently into existing instruction-tuned models: LoRA fine-tuning for a single epoch on minimal data yields significant improvements, making this solution highly practical. The code is publicly available at https://github.com/quasar529/rainbow-padding.