🤖 AI Summary
This work proposes REPR-ALIGN, a method to avoid redundant learning of language representations when converting autoregressive language models into diffusion language models. By aligning the geometric structure of hidden-state representations layer by layer, REPR-ALIGN preserves the pretrained semantic structure of the original autoregressive model while keeping its parameters frozen, requiring only the decoding pathway to be relearned. Notably, the approach operates without adapters or architectural modifications and provides the first empirical validation that language representations can be transferred across different generation orders. Experiments demonstrate that, under an identical standard Transformer architecture, REPR-ALIGN achieves up to a 4× training speedup, with particularly pronounced benefits in low-data regimes.
📝 Abstract
Diffusion language models (DLMs) have recently demonstrated capabilities that complement standard autoregressive (AR) models, particularly in non-sequential generation and bidirectional editing. Although recent work has shown that pretrained autoregressive checkpoints can be converted into diffusion language models, existing recipes primarily transfer parameters through continued denoising training with objective- and attention-level modifications. We instead ask whether the internal representation geometry learned by next-token prediction can be explicitly preserved during AR-to-DLM conversion. We hypothesize that much of the semantic structure learned by AR pretraining can transfer across generation orders, and thus DLM training should be viewed as relearning the decoding path rather than relearning language representations. To investigate this, we introduce REPR-ALIGN, a representation alignment objective that adapts a bidirectional masked diffusion model to reuse representations from a pretrained AR model of identical architecture. Concretely, we align the hidden states of the DLM to the frozen AR model at every layer using cosine similarity, while optimizing the standard masked denoising objective. This simple alignment, with no adapters and no architectural changes beyond the attention mask, yields up to 4x training acceleration in our setting and is particularly effective in low-data regimes. Our results suggest that linguistic representations can transfer across generation order, and that representation alignment provides a simple and effective technique for training diffusion language models. Code is available at https://github.com/pengzhangzhi/Open-dLLM.