Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference

📅 2025-02-21

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

To address the GPU memory bottleneck caused by progressive KV cache expansion across dialogue turns in large language model (LLM) inference, this paper proposes a turn-level attention mechanism. We first identify and leverage a “watershed layer” in LLM reasoning—revealing a previously unrecognized regularity: cross-layer attention distributions exhibit strong similarity across dialogue turns. Building upon this insight, we introduce the first turn-aware relevance modeling and dynamic cache pruning framework, enabling selective reuse of key historical-turn KV caches and lightweight reweighting. Experiments on multiple long-context benchmarks demonstrate that our method preserves original model accuracy (within ±0.3% absolute change), reduces GPU memory consumption by 55%, and decreases end-to-end inference latency by 32%.

Technology Category

Application Category

📝 Abstract

The increasing context window size in large language models (LLMs) has improved their ability to handle complex, long-text tasks. However, as the conversation rounds continue, it is required to store a large amount of KV cache in GPU memory, which significantly affects the efficiency and even availability of the model serving systems. This paper analyzes dialogue data from real users and discovers that the LLM inference manifests a watershed layer, after which the distribution of round-level attention shows notable similarity. We propose Round Attention, a novel round-level attention mechanism that only recalls and computes the KV cache of the most relevant rounds. The experiments show that our method saves 55% memory usage without compromising model performance.

Problem

Research questions and friction points this paper is trying to address.

Reduce GPU memory usage

Optimize LLM inference efficiency

Enhance model serving systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Round-Level Attention Mechanism

KV Cache Optimization

Memory Usage Reduction

🔎 Similar Papers

No similar papers found.