Learning to Skip the Middle Layers of Transformers

๐Ÿ“… 2025-06-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address redundant computation in Transformer middle layers, this paper proposes a dynamic symmetric skip-layer architecture: (1) a learnable gating mechanism adaptively skips symmetric layer blocks outward from the network center; (2) gated attention masks out invalid attentions to already-skipped positions; and (3) sandwich or per-layer normalization is adopted for residual connections, coupled with an adaptive regularization loss to control gate sparsity. The method reduces FLOPs while preserving representational capacity, establishing a novel paradigm for conditional computation and hierarchical skipping. Although it does not surpass shallower dense baselines on the validation-set cross-entropyโ€“FLOPs trade-off curve, it introduces a structurally innovative, conceptually transparent, and learnable skip-layer framework for efficient Transformers.

Technology Category

Application Category

๐Ÿ“ Abstract
Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.
Problem

Research questions and friction points this paper is trying to address.

Reducing redundancy in middle Transformer layers
Dynamic skipping of central Transformer blocks
Improving efficiency via conditional computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic skipping of middle Transformer layers
Learned gating mechanism for layer skipping
Gated attention prevents attending to skipped tokens
๐Ÿ”Ž Similar Papers
No similar papers found.