Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study investigates the joint design of context window size, chunking strategies, and similarity modeling in retrieval-augmented generation (RAG) for code generation tasks—specifically code completion and bug localization—under realistic computational budget constraints. We propose task-aware retrieval design principles and empirically compare sparse (BM25) versus dense retrieval (e.g., Voyager-3), as well as token-level, BPE-based, and syntax-aware chunking, alongside multiple similarity scoring mechanisms. Results show that token-level BM25 achieves both high efficiency and accuracy in programming-language-to-programming-language (PL→PL) tasks; dense retrieval excels in natural-language-to-programming-language (NL→PL) tasks but incurs significantly higher latency; and optimal chunk granularity dynamically depends on context window size. To our knowledge, this work provides the first empirically grounded, task-adapted, lightweight configuration guide for code-specific RAG systems.

Technology Category

Application Category

📝 Abstract

We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

Optimizing code retrieval design for generation tasks under compute constraints

Comparing chunking strategies and similarity scoring for code completion

Evaluating latency-quality trade-offs in bug localization retrieval systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

BM25 with word splitting for PL-PL tasks

Proprietary dense encoders for NL-PL tasks

Line-based chunking matches syntax-aware splitting

🔎 Similar Papers

CodeRAG-Bench: Can Retrieval Augment Code Generation?