Zero-Shot Cross-Domain Code Search without Fine-Tuning

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high fine-tuning cost and poor cross-domain generalization in zero-shot cross-domain code search, this paper proposes a fine-tuning-free dual-path semantic alignment paradigm. It decouples query–code matching into two complementary paths: query–pseudo-comment matching and code–pseudo-code matching. Leveraging large language models (LLMs), it generates high-quality pseudo-annotations; then integrates pre-trained language model (PLM)-based similarity scoring with strategic sampling to achieve robust cross-domain semantic alignment. Evaluated on three benchmark datasets, our method achieves average MRR improvements of 21.4% and 24.9% over CoCoSoDa and UniXcoder, respectively—matching or even surpassing the performance of the fine-tuned RAPID baseline. To the best of our knowledge, this is the first approach to enable efficient, robust zero-shot cross-domain code retrieval without any task-specific adaptation.

Technology Category

Application Category

📝 Abstract
Code search aims to retrieve semantically relevant code snippets for natural language queries. While pre-trained language models (PLMs) have shown remarkable performance in this task, they struggle in cross-domain scenarios, often requiring costly fine-tuning or facing performance drops in zero-shot settings. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search. The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. Our empirical study reveals the strong complementarity among the three matching schemas in zero-shot cross-domain settings, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge uses Large Language Models (LLMs) to generate comments and pseudo-code, then combines query-code, query-comment, and code-code matching via PLM-based similarity scoring and sampling-based fusion. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

Enables zero-shot cross-domain code search without fine-tuning
Bridges domain gaps via query-comment and code-code matching
Outperforms existing methods in accuracy and resource efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-generated comments and pseudo-code
PLM-based similarity scoring fusion
Zero-shot fine-tuning-free code search
🔎 Similar Papers
No similar papers found.
K
Keyu Liang
The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China
Zhongxin Liu
Zhongxin Liu
Zhejiang University
Software EngineeringLarge Language Models
C
Chao Liu
School of Big Data and Software Engineering, Chongqing University, Chongqing, China
Zhiyuan Wan
Zhiyuan Wan
Associate Professor of Computer Science, Zhejiang University
Software EngineeringSoftware SecurityProgramming Languages
D
David Lo
School of Computing and Information Systems, Singapore Management University, Singapore, Singapore
Xiaohu Yang
Xiaohu Yang
National University of Defense Technology
Plasma physicsLaser-plasma interactionInertial confinement fusionCharged particle beam