SaraCoder: Orchestrating Semantic and Structural Cues for Profit-Oriented Repository-Level Code Completion

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In warehouse-scale code completion, shallow textual retrieval leads to semantic misdirection, result redundancy, homogeneity, and external symbol ambiguity. To address these challenges, this paper proposes a Hierarchical Feature-Optimized Retrieval-Augmented Generation (RAG) framework. Our approach introduces three key innovations: (1) integrating deep semantic similarity with code graph structural similarity, enhanced by graph-topological importance scoring to improve edit-awareness; (2) performing cross-file dependency analysis for external-context-aware identifier disambiguation; and (3) designing a multi-objective re-ranking mechanism that jointly optimizes relevance and diversity. Evaluated on the CrossCodeEval and RepoEval-Updated benchmarks, our method achieves significant improvements over existing state-of-the-art approaches. It demonstrates strong effectiveness, generalizability across programming languages, and robustness across diverse LLM backends—validating its practical utility for large-scale, real-world code completion tasks.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) for repository-level code completion commonly relies on superficial text similarity, leading to results plagued by semantic misguidance, redundancy, and homogeneity, while also failing to resolve external symbol ambiguity. To address these challenges, we introduce Saracoder, a Hierarchical Feature-Optimized retrieval framework. Its core Hierarchical Feature Optimization module systematically refines candidates by distilling deep semantic relationships, pruning exact duplicates, assessing structural similarity with a novel graph-based metric that weighs edits by their topological importance, and reranking results to maximize both relevance and diversity. Furthermore, an External-Aware Identifier Disambiguator module accurately resolves cross-file symbol ambiguity via dependency analysis. Extensive experiments on the challenging CrossCodeEval and RepoEval-Updated benchmarks demonstrate that Saracoder significantly outperforms existing baselines across multiple programming languages and models. Our work proves that systematically refining retrieval results across multiple dimensions provides a new paradigm for building more accurate and robust repository-level code completion systems.
Problem

Research questions and friction points this paper is trying to address.

Addresses semantic misguidance and redundancy in code completion
Resolves external symbol ambiguity via dependency analysis
Improves retrieval accuracy and diversity in repository-level coding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Feature Optimization for code retrieval
Graph-based structural similarity assessment
Dependency-aware symbol disambiguation
🔎 Similar Papers
No similar papers found.
Xiaohan Chen
Xiaohan Chen
University of Cyprus
Transfer learningFault diagnosisTime-series analysis
Z
Zhongying Pan
Huaneng Information Technology Co., Ltd.
Q
Quan Feng
Hunan Vanguard Group Corporation Co., Ltd.
Y
Yu Tian
Tsinghua University
S
Shuqun Yang
MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics
M
Mengru Wang
Zhejiang University
L
Lina Gong
MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics
Yuxia Geng
Yuxia Geng
Zhejiang University, Hangzhou Dianzi University, PowerChina Huadong Engineering Corporation Limited
Knowledge GraphLarge Language ModelIndustry Application
P
Piji Li
MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics
X
Xiang Chen
MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics