Value Residual Learning

📅 2024-10-23
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
Deep Transformer networks suffer from token-level information decay, where standard residual connections fail to preserve initial token representations effectively across deep layers. To address this, we propose ResFormer, which introduces residual connections directly in the value (V) space of self-attention—marking the first explicit extension of residual learning to the V dimension—thereby significantly enhancing cross-layer information propagation. Building upon this, we design SVFormer, which shares the value embeddings from the first layer to compress the key-value (KV) cache. Experiments show that ResFormer achieves comparable validation loss while reducing model parameters by 13.3% and training data requirements by 15.4%. SVFormer reduces KV cache memory by approximately 47%, with negligible performance degradation, and remains fully compatible with existing KV-efficient optimization techniques. Critically, both methods preserve the standard Transformer architecture—requiring no modifications to attention or feed-forward network modules.

Technology Category

Application Category

📝 Abstract
While Transformer models have achieved remarkable success in various domains, the effectiveness of information propagation through deep networks remains a critical challenge. Standard hidden state residuals often fail to adequately preserve initial token-level information in deeper layers. This paper introduces ResFormer, a novel architecture that enhances information flow by incorporating value residual connections in addition to hidden state residuals. And a variant is the SVFormer, where all layers share the first layer's value embedding. Comprehensive empirical evidence demonstrates ResFormer achieves equivalent validation loss with 13.3% fewer model parameters and 15.4% less training data compared to Transformer, while maintaining similar memory usage and computational cost. Besides, SVFormer reduces KV cache size by nearly half with only a small performance penalty and can be integrated with other KV-efficient methods, yielding further reductions in KV cache, with performance influenced by sequence length and cumulative learning rate.
Problem

Research questions and friction points this paper is trying to address.

Enhances information flow in deep networks
Reduces model parameters and training data
Decreases KV cache size efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

ResFormer enhances Transformer with value residuals
SVFormer shares first layer value embedding
Reduces model parameters and training data
🔎 Similar Papers
No similar papers found.
Zhanchao Zhou
Zhanchao Zhou
Ph.D. student,Westlake University & Zhejiang University
Natural Language Processing
T
Tianyi Wu
University of Electronic Science and Technology of China
Z
Zhiyun Jiang
China University of Mining and Technology
F
Fares Obeid
Zhenzhong Lan
Zhenzhong Lan
School of Engineering, Westlake University
NLPComputer VisionMultimedia