You Need an Encoder for Native Position-Independent Caching

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the inefficiency of existing large language models in handling out-of-order retrieval contexts due to their prefix-dependent key-value caching, and the significant accuracy degradation commonly observed in current position-invariant caching (PIC) approaches. To overcome these limitations, the authors propose the first native PIC method within a decoder-only architecture by integrating an encoder and training it jointly with the decoder. They further introduce the COMB caching system to enable efficient and accurate cache reuse. The approach seamlessly integrates with mainstream inference frameworks and demonstrates broad applicability across models such as DeepSeek-V2-Lite-Chat, achieving 51–94% reduction in first-token latency, up to 3× higher throughput, and comparable accuracy—thereby overcoming the practicality bottleneck of prior PIC methods.

Technology Category

Application Category

📝 Abstract

The Key-Value (KV) cache of Large Language Models (LLMs) is prefix-based, making it highly inefficient for processing contexts retrieved in arbitrary order. Position-Independent Caching (PIC) has been proposed to enable KV reuse without positional constraints; however, existing approaches often incur substantial accuracy degradation, limiting their practical adoption. To address this issue, we propose native PIC by reintroducing the encoder to prevalent decoder-only LLMs and explicitly training it to support PIC. We further develop COMB, a PIC-aware caching system that integrates seamlessly with existing inference frameworks. Experimental results show that COMB reduces Time-to-First-Token (TTFT) by 51-94% and increases throughput by 3$\times$ with comparable accuracy. Furthermore, the quality improvement when using DeepSeek-V2-Lite-Chat demonstrates the applicability of COMB to other types of decoder-only LLMs. Our code is available at https://github.com/shijuzhao/Comb.

Problem

Research questions and friction points this paper is trying to address.

Position-Independent Caching

KV cache

Large Language Models

context retrieval

accuracy degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Position-Independent Caching

KV Cache

Encoder-Decoder Architecture