YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing cross-layer key-value (KV) compression methods often incur significant performance degradation while reducing KV cache memory consumption during large language model inference. This work proposes YOCO++, which introduces weighted residual connections into the cross-layer KV sharing architecture for the first time. By adaptively fusing KV information from shallow and deepest layers, YOCO++ enhances model expressiveness without introducing additional computational overhead. Combined with an efficient fine-tuning strategy, YOCO++ achieves state-of-the-art performance under 50% KV cache compression—surpassing even the standard Transformer—and substantially improves inference quality at high compression ratios.

Technology Category

Application Category

📝 Abstract

Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible performance degradation. In this work, we aim to enhance the performance of YOCO, a cross-layer KV compression method that shares the KVs of the middle layer with the top-half layers. We propose YOCO++, an enhanced YOCO that incorporates a weighted residual connection between the KVs of each bottom-half layer and the bottom layer. Compared to YOCO, YOCO++ increases model capacity while maintaining the same training and inference efficiency. Our experiments show that YOCO++ achieves state-of-the-art performance among the cross-layer KV compression methods at a 50% KV cache compression rate, outperforming the standard Transformer.

Problem

Research questions and friction points this paper is trying to address.

KV compression

LLM inference

performance degradation

cross-layer compression

KV cache

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV residual connections

cross-layer KV compression

efficient LLM inference