🤖 AI Summary
This work addresses the limitations of continuous diffusion language models, which operate in a continuous space ill-suited for discrete linguistic structures, thereby constraining denoising and token recovery performance. The authors propose DiHAL, a novel approach that leverages the geometric properties of pretrained Transformer hidden states to guide the placement of the diffusion module. Specifically, DiHAL replaces only the lower layers of the network with a diffusion bridge while preserving the upper layers and the original language model head, enabling reconstruction of hidden states rather than tokens. By employing a geometry-aware proxy score to select the optimal insertion layer and adopting a fixed bridging training protocol, the method demonstrates that shallow insertion is highly effective—achieving significantly better diagnostic performance than existing continuous diffusion baselines on an 8B-parameter model under identical training budgets.
📝 Abstract
Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.