LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant memory and latency bottlenecks caused by the rapid growth of key-value (KV) cache during long-context inference in large language models. To mitigate this, the authors propose a fine-grained hybrid attention head mechanism that dynamically selects critical tokens via a hardware-efficient top-k strategy and reuses these tokens across sparse attention heads, thereby preserving both computational efficiency and attention diversity. Unlike existing coarse-grained sharing approaches, this method overcomes their performance limitations by integrating a HardKuma-driven hybrid head design with dynamic sparse computation. Experiments on models such as Llama3 and Qwen3 demonstrate that the approach achieves generation quality comparable to or better than full attention, while delivering up to 2.7× speedup at a context length of 128K.

Technology Category

Application Category

📝 Abstract
The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.
Problem

Research questions and friction points this paper is trying to address.

long-context LLM
key-value cache
attention head diversity
decoding efficiency
memory bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid-head attention
sparse decoding
key-value cache optimization
long-context LLM inference
HardKuma-based selection
🔎 Similar Papers
No similar papers found.
G
Gang Lin
Harbin Institute of Technology, Shenzhen
Dongfang Li
Dongfang Li
Harbin Institute of Technology, Shenzhen
Natural Language ProcessingLarge Language Models
Z
Zhuoen Chen
Harbin Institute of Technology, Shenzhen
Y
Yukun Shi
Harbin Institute of Technology, Shenzhen
Xuhui Chen
Xuhui Chen
San Francisco State University
Computer Science
Baotian Hu
Baotian Hu
Harbin Institute of Technology (Shenzhen)
LLMMLLMNLP
M
Min Zhang
Harbin Institute of Technology, Shenzhen