Skip to the Good Part: Representation Structure&Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the fundamental differences in internal representation structures between diffusion language models and autoregressive language models, and their implications for inference efficiency. Through layer-wise and token-wise representation analysis, the authors find that the diffusion training objective induces greater hierarchical structure and redundancy in early-layer representations. Leveraging this insight, they propose a static, architecture-agnostic layer-skipping strategy at inference time that achieves substantial efficiency gains without relying on KV cache optimization. Experiments show that skipping redundant layers in native diffusion models reduces FLOPs by 18.75% while preserving over 90% of performance on reasoning and code generation tasks. In contrast, applying the same strategy to autoregressive models leads to significant performance degradation, confirming the unique suitability of the approach for diffusion-based architectures.

Technology Category

Application Category

📝 Abstract
Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.
Problem

Research questions and friction points this paper is trying to address.

diffusion language models
autoregressive models
representational structure
layer skipping
inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion language models
layer skipping
representational analysis
inference efficiency
recency bias
🔎 Similar Papers
No similar papers found.