Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Transformer decoders face a critical bottleneck in long-sequence inference due to excessive memory overhead from KV caches. While existing cross-layer sharing methods (e.g., YOCO, CLA) reduce memory usage, they underperform intra-layer approaches like grouped-query attention (GQA). We observe that top-layer KV states primarily depend on salient information from bottom and middle layers. To address this, we propose FusedKV: it fuses the most informative keys and values across layers *after* RoPE encoding—preserving relative positional awareness and eliminating redundant computation. We further introduce FusedKV-Lite, a lightweight variant. Our method incorporates learnable fusion strategies and hierarchical information extraction to reconstruct KV caches efficiently. Evaluated on LLMs ranging from 332M to 4B parameters, FusedKV achieves up to 50% KV cache memory reduction while attaining significantly lower perplexity than standard decoding—demonstrating superior trade-offs between inference efficiency and model performance.

Technology Category

Application Category

📝 Abstract

Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.

Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache memory by 50% in Transformer decoders

Improves cross-layer KV cache sharing over existing methods

Maintains performance while lowering computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

FusedKV fuses bottom and middle layer KV caches

FusedKV-Lite shares keys and values cross-layer efficiently

Method reduces cache memory by 50% with low perplexity

🔎 Similar Papers

No similar papers found.