🤖 AI Summary
This work addresses the memory bottleneck imposed by KV cache in long-context code reasoning for intelligent code generation, where existing compression methods—relying solely on attention mechanisms—often erroneously discard structurally critical tokens. The paper proposes a training-free, structure-aware KV cache compression framework that, for the first time, integrates static program analysis and code property graphs (constructed via Joern) into cache management. By incorporating structural priors without modifying the underlying model, the method guides compression decisions to preserve semantically essential code elements. Integrated into the SGLang intelligent coding pipeline, the approach significantly outperforms attention-based baselines under identical memory budgets and, even under aggressive compression, recovers most of the full-context accuracy while maintaining patch generation quality nearly on par with uncompressed inference.
📝 Abstract
Agentic code tasks such as fault localization and patch generation require processing long codebases under tight memory constraints, where the Key-Value (KV) cache becomes the primary inference bottleneck. Existing compression methods rely exclusively on attention signals to estimate token importance, systematically discarding structurally critical tokens such as call sites, branch conditions, and assignments that are essential for code understanding. We present CodeComp, a training-free KV cache compression framework that incorporates static program analysis into LLM inference via Code Property Graph priors extracted by Joern. Across bug localization and code generation benchmarks, CodeComp consistently outperforms attention-only compression baselines under equal memory budgets, recovering the majority of full-context accuracy under aggressive KV cache compression, while matching the patch generation quality of uncompressed full-context inference and integrating seamlessly into SGLang-based agentic coding pipelines without model modification.