Value Residual Learning

📅 2024-10-23

📈 Citations: 1

✨ Influential: 1

career value

133K/year

🤖 AI Summary

Deep Transformer networks suffer from token-level information decay, where standard residual connections fail to preserve initial token representations effectively across deep layers. To address this, we propose ResFormer, which introduces residual connections directly in the value (V) space of self-attention—marking the first explicit extension of residual learning to the V dimension—thereby significantly enhancing cross-layer information propagation. Building upon this, we design SVFormer, which shares the value embeddings from the first layer to compress the key-value (KV) cache. Experiments show that ResFormer achieves comparable validation loss while reducing model parameters by 13.3% and training data requirements by 15.4%. SVFormer reduces KV cache memory by approximately 47%, with negligible performance degradation, and remains fully compatible with existing KV-efficient optimization techniques. Critically, both methods preserve the standard Transformer architecture—requiring no modifications to attention or feed-forward network modules.

Technology Category

Application Category

📝 Abstract

While Transformer models have achieved remarkable success in various domains, the effectiveness of information propagation through deep networks remains a critical challenge. Standard hidden state residuals often fail to adequately preserve initial token-level information in deeper layers. This paper introduces ResFormer, a novel architecture that enhances information flow by incorporating value residual connections in addition to hidden state residuals. And a variant is the SVFormer, where all layers share the first layer's value embedding. Comprehensive empirical evidence demonstrates ResFormer achieves equivalent validation loss with 13.3% fewer model parameters and 15.4% less training data compared to Transformer, while maintaining similar memory usage and computational cost. Besides, SVFormer reduces KV cache size by nearly half with only a small performance penalty and can be integrated with other KV-efficient methods, yielding further reductions in KV cache, with performance influenced by sequence length and cumulative learning rate.

Problem

Research questions and friction points this paper is trying to address.

Enhances information flow in deep networks

Reduces model parameters and training data

Decreases KV cache size efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

ResFormer enhances Transformer with value residuals

SVFormer shares first layer value embedding

Reduces model parameters and training data

🔎 Similar Papers

Residual Connections Harm Generative Representation Learning