🤖 AI Summary
Large language models (LLMs) deployed for warehouse-scale code tasks are often hindered by contextual noise, context length truncation, and high inference latency. This work presents the first systematic empirical study of context compression techniques tailored to the code domain, evaluating eight methods across three paradigms: discrete token sequences, continuous latent vectors, and visual tokens. The experiments demonstrate that context compression not only effectively mitigates noise and reduces inference costs but also enhances performance under aggressive compression. At a 4× compression ratio, continuous latent vector approaches achieve up to a 28.3% relative improvement in BLEU score over the uncompressed baseline. Moreover, end-to-end latency is reduced by up to 50% under high compression ratios when combining visual and textual compression, with certain strategies even surpassing the performance of using the full, untruncated context.
📝 Abstract
Repository-level code intelligence tasks require large language models (LLMs) to process long, multi-file contexts. Such inputs introduce three challenges: crucial context can be obscured by noise, truncated due to limited windows, and increased inference latency. Context compression mitigates these risks by condensing inputs. While studied in NLP, its applicability to code tasks remains largely unexplored. We present the first systematic empirical study of context compression for repository-level code intelligence, organizing eight methods into three paradigms: discrete token sequences, continuous latent vectors, and visual tokens. We evaluate them on code completion and generation, measuring performance and efficiency. Results show context compression is effective: at 4x compression, continuous latent vector methods surpass full-context performance by up to 28.3% in BLEU score, indicating they filter noise rather than just truncating. On efficiency, all paradigms reduce inference cost. Both visual and text-based compression achieve up to 50% reduction in end-to-end latency at high ratios, approaching the cost of inference without repository context. These findings establish context compression as a viable approach and provide guidance for paradigm selection.