🤖 AI Summary
To address the high fine-tuning cost and poor cross-domain generalization in zero-shot cross-domain code search, this paper proposes a fine-tuning-free dual-path semantic alignment paradigm. It decouples query–code matching into two complementary paths: query–pseudo-comment matching and code–pseudo-code matching. Leveraging large language models (LLMs), it generates high-quality pseudo-annotations; then integrates pre-trained language model (PLM)-based similarity scoring with strategic sampling to achieve robust cross-domain semantic alignment. Evaluated on three benchmark datasets, our method achieves average MRR improvements of 21.4% and 24.9% over CoCoSoDa and UniXcoder, respectively—matching or even surpassing the performance of the fine-tuned RAPID baseline. To the best of our knowledge, this is the first approach to enable efficient, robust zero-shot cross-domain code retrieval without any task-specific adaptation.
📝 Abstract
Code search aims to retrieve semantically relevant code snippets for natural language queries. While pre-trained language models (PLMs) have shown remarkable performance in this task, they struggle in cross-domain scenarios, often requiring costly fine-tuning or facing performance drops in zero-shot settings. RAPID, which generates synthetic data for model fine-tuning, is currently the only effective method for zero-shot cross-domain code search. Despite its effectiveness, RAPID demands substantial computational resources for fine-tuning and needs to maintain specialized models for each domain, underscoring the need for a zero-shot, fine-tuning-free approach for cross-domain code search. The key to tackling zero-shot cross-domain code search lies in bridging the gaps among domains. In this work, we propose to break the query-code matching process of code search into two simpler tasks: query-comment matching and code-code matching. Our empirical study reveals the strong complementarity among the three matching schemas in zero-shot cross-domain settings, i.e., query-code, query-comment, and code-code matching. Based on the findings, we propose CodeBridge, a zero-shot, fine-tuning-free approach for cross-domain code search. Specifically, CodeBridge uses Large Language Models (LLMs) to generate comments and pseudo-code, then combines query-code, query-comment, and code-code matching via PLM-based similarity scoring and sampling-based fusion. Experimental results show that our approach outperforms the state-of-the-art PLM-based code search approaches, i.e., CoCoSoDa and UniXcoder, by an average of 21.4% and 24.9% in MRR, respectively, across three datasets. Our approach also yields results that are better than or comparable to those of the zero-shot cross-domain code search approach RAPID, which requires costly fine-tuning.