🤖 AI Summary
To address the inference latency caused by excessively long contexts in retrieval-augmented code completion, this paper proposes a semantic compression framework. It introduces a learnable, lightweight projection module that compresses raw code context into a single-token semantic vector, trained end-to-end with the code large language model to ensure semantic alignment between the compressed representation and the decoder. While preserving retrieval effectiveness within the RAG paradigm, the method substantially reduces input sequence length. Experiments on online code completion show a 20–38% reduction in first-token latency, alongside simultaneous improvements in Exact Match (EM) and Edit Similarity (ES) metrics. The approach thus achieves a favorable trade-off between inference efficiency and generation quality, effectively supporting interactive programming scenarios.
📝 Abstract
Retrieval-augmented generation has emerged as one of the most effective approaches for code completion, particularly when context from a surrounding repository is essential. However, incorporating context significantly extends sequence length, leading to slower inference - a critical limitation for interactive settings such as IDEs. In this work, we introduce LlavaCode, a framework that compresses code into compact, semantically rich representations interpretable by code LLM, enhancing generation quality while reducing the retrieved context to only a few compressed single-token vectors. Using a small projector module we can significantly increase the EM and ES metrics of coding model with negligible latency increase. Our experiments demonstrate that compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines.