Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Transformer decoders face a critical bottleneck in long-sequence inference due to excessive memory overhead from KV caches. While existing cross-layer sharing methods (e.g., YOCO, CLA) reduce memory usage, they underperform intra-layer approaches like grouped-query attention (GQA). We observe that top-layer KV states primarily depend on salient information from bottom and middle layers. To address this, we propose FusedKV: it fuses the most informative keys and values across layers *after* RoPE encoding—preserving relative positional awareness and eliminating redundant computation. We further introduce FusedKV-Lite, a lightweight variant. Our method incorporates learnable fusion strategies and hierarchical information extraction to reconstruct KV caches efficiently. Evaluated on LLMs ranging from 332M to 4B parameters, FusedKV achieves up to 50% KV cache memory reduction while attaining significantly lower perplexity than standard decoding—demonstrating superior trade-offs between inference efficiency and model performance.

Technology Category

Application Category

📝 Abstract
Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.
Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache memory by 50% in Transformer decoders
Improves cross-layer KV cache sharing over existing methods
Maintains performance while lowering computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

FusedKV fuses bottom and middle layer KV caches
FusedKV-Lite shares keys and values cross-layer efficiently
Method reduces cache memory by 50% with low perplexity
🔎 Similar Papers
No similar papers found.
Hongzhan Lin
Hongzhan Lin
Hong Kong Baptist University
Natural Language ProcessingMultimodal ReasoningSocial Computing
Z
Zhiqi Bai
Taobao & Tmall Group of Alibaba
Xinmiao Zhang
Xinmiao Zhang
The Ohio State University
S
Sen Yang
Taobao & Tmall Group of Alibaba
X
Xiang Li
Taobao & Tmall Group of Alibaba
S
Siran Yang
Taobao & Tmall Group of Alibaba
Yunlong Xu
Yunlong Xu
Alibaba
GPGPUAIRecommender SystemParallel ProgrammingHeterogeneous Computing
J
Jiaheng Liu
Nanjing University
Y
Yongchi Zhao
Taobao & Tmall Group of Alibaba
J
Jiamang Wang
Taobao & Tmall Group of Alibaba
Y
Yuchi Xu
Taobao & Tmall Group of Alibaba
W
Wenbo Su
Taobao & Tmall Group of Alibaba
B
Bo Zheng
Taobao & Tmall Group of Alibaba