🤖 AI Summary
Deep Transformer networks suffer from token-level information decay, where standard residual connections fail to preserve initial token representations effectively across deep layers. To address this, we propose ResFormer, which introduces residual connections directly in the value (V) space of self-attention—marking the first explicit extension of residual learning to the V dimension—thereby significantly enhancing cross-layer information propagation. Building upon this, we design SVFormer, which shares the value embeddings from the first layer to compress the key-value (KV) cache. Experiments show that ResFormer achieves comparable validation loss while reducing model parameters by 13.3% and training data requirements by 15.4%. SVFormer reduces KV cache memory by approximately 47%, with negligible performance degradation, and remains fully compatible with existing KV-efficient optimization techniques. Critically, both methods preserve the standard Transformer architecture—requiring no modifications to attention or feed-forward network modules.
📝 Abstract
While Transformer models have achieved remarkable success in various domains, the effectiveness of information propagation through deep networks remains a critical challenge. Standard hidden state residuals often fail to adequately preserve initial token-level information in deeper layers. This paper introduces ResFormer, a novel architecture that enhances information flow by incorporating value residual connections in addition to hidden state residuals. And a variant is the SVFormer, where all layers share the first layer's value embedding. Comprehensive empirical evidence demonstrates ResFormer achieves equivalent validation loss with 13.3% fewer model parameters and 15.4% less training data compared to Transformer, while maintaining similar memory usage and computational cost. Besides, SVFormer reduces KV cache size by nearly half with only a small performance penalty and can be integrated with other KV-efficient methods, yielding further reductions in KV cache, with performance influenced by sequence length and cumulative learning rate.