You Do Not Fully Utilize Transformer's Representation Capacity

📅 2025-02-13

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

Standard Transformers suffer from insufficient inter-layer representation reuse and limited deep information integration due to their exclusive reliance on the immediately preceding layer’s hidden states—leading to representation collapse and performance bottlenecks. This work is the first to systematically identify and analyze this issue. We propose Layer-Integrated Memory (LIMe), a lightweight mechanism that caches and selectively retrieves historical hidden states across multiple layers, enabling cross-layer memory integration with zero additional parameters and negligible memory overhead. LIMe supports flexible, depth-aware information fusion and is architecture-agnostic. Extensive experiments demonstrate consistent performance gains across diverse Transformer variants (e.g., BERT, RoBERTa, DeBERTa) and NLP tasks—including GLUE, SQuAD, and semantic similarity benchmarks. LIMe significantly enhances representation diversity and deep semantic modeling capacity, effectively mitigating representation collapse in late-stage training.

Technology Category

Application Category

📝 Abstract

In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model's overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.

Problem

Research questions and friction points this paper is trying to address.

Transformers underutilize layer representations

Standard models suffer from representation collapse

LIMe enhances access to earlier layer states

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-Integrated Memory (LIMe)

Access to earlier hidden states

Expands Transformer's representational capacity

🔎 Similar Papers

No similar papers found.