Transformers with Selective Access to Early Representations

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the limitation of conventional Transformers in effectively leveraging early low-level features, as existing approaches are either static or computationally expensive. The authors propose the Selective Access Transformer (SATFormer), which formulates the reuse of early representations as a retrieval problem and introduces, for the first time, a context-aware gating mechanism to dynamically control selective access to the value vectors from the first layer. This design enables fine-grained and efficient information reuse while preserving the original forward pathway. Evaluated across model scales ranging from 130M to 1.3B parameters, SATFormer consistently outperforms both standard baselines and static residual methods, achieving an average gain of approximately 1.5 points on retrieval-intensive tasks, all while maintaining throughput and memory consumption comparable to the original Transformer.

📝 Abstract

Several recent Transformer architectures expose later layers to representations computed in the earliest layers, motivated by the observation that low-level features can become harder to recover as the residual stream is repeatedly transformed through depth. The cheapest among these methods add static value residuals: learned mixing coefficients that expose the first-layer value projection V_1 uniformly across tokens and heads. More expressive dense or dynamic alternatives recover finer-grained access, but at higher memory cost and lower throughput. The usefulness of V_1 is unlikely to be constant across tokens, heads, and contexts; different positions plausibly require different amounts of access to early lexical or semantic information. We therefore treat early-representation reuse as a retrieval problem rather than a connectivity problem, and introduce Selective Access Transformer (SATFormer), which preserves the first-layer value pathway while controlling access with a context-dependent gate. Across models from 130M to 1.3B parameters, SATFormer consistently improves validation loss and zero-shot accuracy over the static value-residual and Transformer baselines. Its strongest gains appear on retrieval-intensive benchmarks, where it improves over static value residuals by approximately 1.5 average points, while maintaining throughput and memory usage close to the baseline Transformer. Gate analyses suggest sparse, depth-dependent, head-specific, and category-sensitive access patterns, supporting the interpretation that SATFormer learns selective reuse of early representations rather than uniform residual copying. Our code is available at https://github.com/SkyeGunasekaran/SATFormer.

Problem

Research questions and friction points this paper is trying to address.

Transformers

early representations

value residuals

context-dependent access

representation reuse

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective Access

Early Representations

Context-Dependent Gating