🤖 AI Summary
Standard Transformers suffer from insufficient inter-layer representation reuse and limited deep information integration due to their exclusive reliance on the immediately preceding layer’s hidden states—leading to representation collapse and performance bottlenecks. This work is the first to systematically identify and analyze this issue. We propose Layer-Integrated Memory (LIMe), a lightweight mechanism that caches and selectively retrieves historical hidden states across multiple layers, enabling cross-layer memory integration with zero additional parameters and negligible memory overhead. LIMe supports flexible, depth-aware information fusion and is architecture-agnostic. Extensive experiments demonstrate consistent performance gains across diverse Transformer variants (e.g., BERT, RoBERTa, DeBERTa) and NLP tasks—including GLUE, SQuAD, and semantic similarity benchmarks. LIMe significantly enhances representation diversity and deep semantic modeling capacity, effectively mitigating representation collapse in late-stage training.
📝 Abstract
In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.