🤖 AI Summary
This study investigates the fundamental differences in internal representation structures between diffusion language models and autoregressive language models, and their implications for inference efficiency. Through layer-wise and token-wise representation analysis, the authors find that the diffusion training objective induces greater hierarchical structure and redundancy in early-layer representations. Leveraging this insight, they propose a static, architecture-agnostic layer-skipping strategy at inference time that achieves substantial efficiency gains without relying on KV cache optimization. Experiments show that skipping redundant layers in native diffusion models reduces FLOPs by 18.75% while preserving over 90% of performance on reasoning and code generation tasks. In contrast, applying the same strategy to autoregressive models leads to significant performance degradation, confirming the unique suitability of the approach for diffusion-based architectures.
📝 Abstract
Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.