cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing line-based code chunking methods in retrieval-augmented generation (RAG) often disrupt semantic structures—such as splitting functions across chunks—thereby degrading retrieval relevance and generation quality. To address this, we propose an AST-driven, structure-aware chunking method that recursively partitions and merges sibling AST subtrees to produce cross-lingual, self-contained, semantically coherent, and size-bounded code units. Our approach introduces the first dynamic structural chunking paradigm, overcoming the limitations of heuristic line-based strategies while jointly preserving syntactic integrity and practical deployment constraints. Evaluated on RepoEval and SWE-bench benchmarks, it achieves a +4.3 percentage point improvement in recall@5 and a +2.67 percentage point gain in Pass@1, demonstrating substantial gains in both retrieval relevance and code generation accuracy.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking -- the process of dividing documents into retrievable units. Existing line-based chunking heuristics often break semantic structures, splitting functions or merging unrelated code, which can degrade generation quality. We propose chunking via Abstract Syntax Trees (ourwork), a structure-aware method that recursively breaks large AST nodes into smaller chunks and merges sibling nodes while respecting size limits. This approach generates self-contained, semantically coherent units across programming languages and tasks, improving performance on diverse code generation tasks, e.g., boosting Recall@5 by 4.3 points on RepoEval retrieval and Pass@1 by 2.67 points on SWE-bench generation. Our work highlights the importance of structure-aware chunking for scaling retrieval-enhanced code intelligence.
Problem

Research questions and friction points this paper is trying to address.

Improving code retrieval quality via structural chunking
Preserving semantic coherence in code generation tasks
Enhancing performance across diverse programming languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-aware chunking via Abstract Syntax Trees
Recursive splitting and merging of AST nodes
Improves code retrieval and generation performance
🔎 Similar Papers
No similar papers found.