cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Existing line-based code chunking methods in retrieval-augmented generation (RAG) often disrupt semantic structures—such as splitting functions across chunks—thereby degrading retrieval relevance and generation quality. To address this, we propose an AST-driven, structure-aware chunking method that recursively partitions and merges sibling AST subtrees to produce cross-lingual, self-contained, semantically coherent, and size-bounded code units. Our approach introduces the first dynamic structural chunking paradigm, overcoming the limitations of heuristic line-based strategies while jointly preserving syntactic integrity and practical deployment constraints. Evaluated on RepoEval and SWE-bench benchmarks, it achieves a +4.3 percentage point improvement in recall@5 and a +2.67 percentage point gain in Pass@1, demonstrating substantial gains in both retrieval relevance and code generation accuracy.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking -- the process of dividing documents into retrievable units. Existing line-based chunking heuristics often break semantic structures, splitting functions or merging unrelated code, which can degrade generation quality. We propose chunking via Abstract Syntax Trees (ourwork), a structure-aware method that recursively breaks large AST nodes into smaller chunks and merges sibling nodes while respecting size limits. This approach generates self-contained, semantically coherent units across programming languages and tasks, improving performance on diverse code generation tasks, e.g., boosting Recall@5 by 4.3 points on RepoEval retrieval and Pass@1 by 2.67 points on SWE-bench generation. Our work highlights the importance of structure-aware chunking for scaling retrieval-enhanced code intelligence.

Problem

Research questions and friction points this paper is trying to address.

Improving code retrieval quality via structural chunking

Preserving semantic coherence in code generation tasks

Enhancing performance across diverse programming languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-aware chunking via Abstract Syntax Trees

Recursive splitting and merging of AST nodes

Improves code retrieval and generation performance

🔎 Similar Papers

CodeRAG-Bench: Can Retrieval Augment Code Generation?