APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs

๐Ÿ“… 2025-02-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address computational and communication bottlenecks in the prefill phase of long-context LLM inference, this work proposes a KV-block compression and distribution mechanism under sequence parallelism: under approximate attention, critical compressed context blocks are collaboratively transmitted across GPUsโ€”achieving, for the first time, joint optimization of computation reduction and communication efficiency. The method integrates multi-host approximate attention, customized FlashAttention kernels, sequence-parallel scheduling, and selective KV-block compression. Experiments demonstrate zero task-performance degradation while accelerating prefill by 9.2ร—, 4.2ร—, and 1.6ร— over FlashAttention, RingAttention, and StarAttention, respectively, across diverse models and parallel configurations. The core innovation lies in the cross-GPU compressed KV-block communication mechanism and its co-optimization with approximate attention.

Technology Category

Application Category

๐Ÿ“ Abstract
While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn kernel alongside optimized distribution strategies, supporting diverse models and parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively, without any observable task performance degradation. We provide the implementation and experiment code of APB in https://github.com/thunlp/APB.
Problem

Research questions and friction points this paper is trying to address.

Enhance long-context inference speed
Reduce compute in prefill phase
Improve parallelism across GPUs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-host approximate attention
Compressed context blocks communication
Tailored FlashAttn kernel
๐Ÿ”Ž Similar Papers
No similar papers found.
Yuxiang Huang
Yuxiang Huang
Tsinghua University
Efficient AINatural Language ProcessingMachine Learning System
M
Mingye Li
Department of CS&T, Central South University, Changsha, China.
X
Xu Han
NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China.
Chaojun Xiao
Chaojun Xiao
Postdoctoral Researcher, Tsinghua University
Large Language Model
Weilin Zhao
Weilin Zhao
Tsinghua University
Natural Language ProcessingArtificial IntelligenceEfficient LLM
A
Ao Sun
BUPT, Beijing, China.
H
Hao Zhou
Pattern Recognition Center, WeChat Al, Tencent Inc.
J
Jie Zhou
Pattern Recognition Center, WeChat Al, Tencent Inc.
Z
Zhiyuan Liu
NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China.
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing