ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Transformer-based long-context inference is hindered by the quadratic computational complexity of self-attention. Existing compression methods often compromise semantic fidelity or training/inference efficiency. This paper proposes a lightweight, plug-and-play framework featuring two novel, synergistic adapters: (1) a QK Adapter that compresses query and key representations while distilling attention distributions; and (2) a Chunk Adapter that dynamically identifies semantically coherent chunk boundaries and selectively activates chunking only when necessary. Crucially, the backbone model remains entirely frozen during adaptation, ensuring parameter efficiency and preservation of pre-trained knowledge. Experiments demonstrate that our method maintains baseline performance on short-text tasks, retains 98.64% of original accuracy on 120K-token sequences, reduces KV cache usage by 51.42%, and achieves up to 4.48× inference speedup—substantially enhancing both efficiency and practicality of long-context reasoning.

Technology Category

Application Category

📝 Abstract
Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention's quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.
Problem

Research questions and friction points this paper is trying to address.

Addresses computational inefficiency in Transformer self-attention mechanisms
Solves semantic incompleteness and poor training-inference efficiency issues
Accelerates LLM inference while maintaining performance on long contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses QK Adapter for feature compression and chunk attention
Employs Chunk Adapter to detect chunk boundaries semantically
Accelerates inference via chunk selection and attention distillation
🔎 Similar Papers
No similar papers found.
H
Haojie Ouyang
School of Artificial Intelligence, Beijing University of Posts and Telecommunications
J
Jianwei Lv
Li Auto
Lei Ren
Lei Ren
Li Auto
NLP、LLM、VLM
C
Chen Wei
Li Auto
X
Xiaojie Wang
School of Artificial Intelligence, Beijing University of Posts and Telecommunications
Fangxiang Feng
Fangxiang Feng
Beijing University of Posts and Telecommunications
Multimodal LearningImage Synthesis