๐ค AI Summary
This work addresses the inefficiency of existing large language models in handling out-of-order retrieval contexts due to their prefix-dependent key-value caching, and the significant accuracy degradation commonly observed in current position-invariant caching (PIC) approaches. To overcome these limitations, the authors propose the first native PIC method within a decoder-only architecture by integrating an encoder and training it jointly with the decoder. They further introduce the COMB caching system to enable efficient and accurate cache reuse. The approach seamlessly integrates with mainstream inference frameworks and demonstrates broad applicability across models such as DeepSeek-V2-Lite-Chat, achieving 51โ94% reduction in first-token latency, up to 3ร higher throughput, and comparable accuracyโthereby overcoming the practicality bottleneck of prior PIC methods.
๐ Abstract
The Key-Value (KV) cache of Large Language Models (LLMs) is prefix-based, making it highly inefficient for processing contexts retrieved in arbitrary order. Position-Independent Caching (PIC) has been proposed to enable KV reuse without positional constraints; however, existing approaches often incur substantial accuracy degradation, limiting their practical adoption. To address this issue, we propose native PIC by reintroducing the encoder to prevalent decoder-only LLMs and explicitly training it to support PIC. We further develop COMB, a PIC-aware caching system that integrates seamlessly with existing inference frameworks. Experimental results show that COMB reduces Time-to-First-Token (TTFT) by 51-94% and increases throughput by 3$\times$ with comparable accuracy. Furthermore, the quality improvement when using DeepSeek-V2-Lite-Chat demonstrates the applicability of COMB to other types of decoder-only LLMs. Our code is available at https://github.com/shijuzhao/Comb.