LLavaCode: Compressed Code Representations for Retrieval-Augmented Code Generation

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inference latency caused by excessively long contexts in retrieval-augmented code completion, this paper proposes a semantic compression framework. It introduces a learnable, lightweight projection module that compresses raw code context into a single-token semantic vector, trained end-to-end with the code large language model to ensure semantic alignment between the compressed representation and the decoder. While preserving retrieval effectiveness within the RAG paradigm, the method substantially reduces input sequence length. Experiments on online code completion show a 20–38% reduction in first-token latency, alongside simultaneous improvements in Exact Match (EM) and Edit Similarity (ES) metrics. The approach thus achieves a favorable trade-off between inference efficiency and generation quality, effectively supporting interactive programming scenarios.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation has emerged as one of the most effective approaches for code completion, particularly when context from a surrounding repository is essential. However, incorporating context significantly extends sequence length, leading to slower inference - a critical limitation for interactive settings such as IDEs. In this work, we introduce LlavaCode, a framework that compresses code into compact, semantically rich representations interpretable by code LLM, enhancing generation quality while reducing the retrieved context to only a few compressed single-token vectors. Using a small projector module we can significantly increase the EM and ES metrics of coding model with negligible latency increase. Our experiments demonstrate that compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines.
Problem

Research questions and friction points this paper is trying to address.

Compresses code into compact representations for retrieval-augmented generation
Reduces context length to mitigate slow inference in interactive settings
Improves code generation quality while accelerating Time-to-First-Token
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compresses code into compact single-token representations
Uses small projector module to enhance coding metrics
Reduces Time-to-First-Token by 20-38% for completion