LongFlow: Efficient KV Cache Compression for Reasoning M

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the high memory consumption and bandwidth pressure caused by KV caching in long-sequence generation, where existing compression methods suffer from low efficiency and substantial overhead in importance estimation for long outputs. The authors propose LongFlow, which introduces an attention-aware importance metric with negligible computational cost and no additional storage requirements. By integrating FlashAttention, importance evaluation, and token eviction into a single customized kernel, LongFlow enables efficient dynamic KV cache compression. Experimental results demonstrate that LongFlow achieves up to an 11.8× throughput improvement and 80% KV cache compression while preserving model accuracy.

Technology Category

Application Category

📝 Abstract

Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased deployment costs. In particular, long outputs require large KV caches, resulting in high memory consumption and severe bandwidth pressure during attention computation. Most existing KV cache optimization methods are designed for long-input, short-output scenarios and are ineffective for the long-output setting of reasoning models. Moreover, importance estimation in prior work is computationally expensive and becomes prohibitive when continuous re-evaluation is required during long generation. To address these challenges, we propose LongFlow, a KV cache compression method with an efficient importance estimation metric derived from an intermediate result of attention computation using only the current query. This design introduces negligible computational overhead and requires no auxiliary storage. We further develop a custom kernel that fuses FlashAttention, importance estimation, and token eviction into a single optimized operator, improving system-level efficiency. Experiments show that LongFlow achieves up to an 11.8 times throughput improvement with 80% KV cache compression with minimal impact on model accuracy.

Problem

Research questions and friction points this paper is trying to address.

KV cache compression

long-output generation

reasoning models

memory efficiency

attention computation

Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache compression

long-output reasoning

efficient attention