🤖 AI Summary
Inaccurate variable naming severely degrades code search performance due to semantic inconsistency across code variants. Method: This paper proposes a naming-agnostic, contrastive multi-view code representation learning framework. It (1) introduces a novel AST modeling paradigm that explicitly decouples representation learning from variable identifiers, focusing instead on intrinsic program structure; (2) designs a semantics- and syntax-aware data augmentation strategy to enhance contrastive learning robustness; and (3) constructs a complementary dual-view architecture—integrating graph-based and path-based representations—leveraging AST structural modeling, graph neural networks, and contrastive learning. Contribution/Results: The method is plug-and-play: it requires no modification to downstream tasks yet significantly improves existing models’ robustness against naming variations. Extensive experiments on multiple benchmark datasets demonstrate substantial gains over state-of-the-art approaches, validating both the effectiveness and generalizability of naming-agnostic code representations.
📝 Abstract
Software development is a repetitive task, as developers usually reuse or get inspiration from existing implementations. Code search, which refers to the retrieval of relevant code snippets from a codebase according to the developer’s intent that has been expressed as a query, has become increasingly important in the software development process. Due to the success of deep learning in various applications, a great number of deep learning based code search approaches have sprung up and achieved promising results. However, developers may not follow the same naming conventions and the same variable may have different variable names in different implementations, bringing a challenge to deep learning based code search methods that rely on explicit variable correspondences to understand source code. To overcome this challenge, we propose a naming-agnostic code search method (NACS) based on contrastive multi-view code representation learning. NACS strips information bound to variable names from Abstract Syntax Tree (AST), the representation of the abstract syntactic structure of source code, and focuses on capturing intrinsic properties solely from AST structures. We use semantic-level and syntax-level augmentation techniques to prepare realistically rational data and adopt contrastive learning to design a graph-view modeling component in NACS to enhance the understanding of code snippets. We further model ASTs in a path view to strengthen the graph-view modeling component through multi-view learning. Extensive experiments show that NACS provides superior code search performance compared to baselines and NACS can be adapted to help existing code search methods overcome the impact of different naming conventions. Our implementation is available at https://github.com/KDEGroup/NACS.