Deep Code Search with Naming-Agnostic Contrastive Multi-view Learning

📅 2024-08-18

🏛️ ACM Transactions on Knowledge Discovery from Data

📈 Citations: 1

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Inaccurate variable naming severely degrades code search performance due to semantic inconsistency across code variants. Method: This paper proposes a naming-agnostic, contrastive multi-view code representation learning framework. It (1) introduces a novel AST modeling paradigm that explicitly decouples representation learning from variable identifiers, focusing instead on intrinsic program structure; (2) designs a semantics- and syntax-aware data augmentation strategy to enhance contrastive learning robustness; and (3) constructs a complementary dual-view architecture—integrating graph-based and path-based representations—leveraging AST structural modeling, graph neural networks, and contrastive learning. Contribution/Results: The method is plug-and-play: it requires no modification to downstream tasks yet significantly improves existing models’ robustness against naming variations. Extensive experiments on multiple benchmark datasets demonstrate substantial gains over state-of-the-art approaches, validating both the effectiveness and generalizability of naming-agnostic code representations.

Technology Category

Application Category

📝 Abstract

Software development is a repetitive task, as developers usually reuse or get inspiration from existing implementations. Code search, which refers to the retrieval of relevant code snippets from a codebase according to the developer’s intent that has been expressed as a query, has become increasingly important in the software development process. Due to the success of deep learning in various applications, a great number of deep learning based code search approaches have sprung up and achieved promising results. However, developers may not follow the same naming conventions and the same variable may have different variable names in different implementations, bringing a challenge to deep learning based code search methods that rely on explicit variable correspondences to understand source code. To overcome this challenge, we propose a naming-agnostic code search method (NACS) based on contrastive multi-view code representation learning. NACS strips information bound to variable names from Abstract Syntax Tree (AST), the representation of the abstract syntactic structure of source code, and focuses on capturing intrinsic properties solely from AST structures. We use semantic-level and syntax-level augmentation techniques to prepare realistically rational data and adopt contrastive learning to design a graph-view modeling component in NACS to enhance the understanding of code snippets. We further model ASTs in a path view to strengthen the graph-view modeling component through multi-view learning. Extensive experiments show that NACS provides superior code search performance compared to baselines and NACS can be adapted to help existing code search methods overcome the impact of different naming conventions. Our implementation is available at https://github.com/KDEGroup/NACS.

Problem

Research questions and friction points this paper is trying to address.

Overcomes variable naming inconsistencies in code search

Uses AST structures to capture intrinsic code properties

Enhances code search with multi-view contrastive learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Naming-agnostic code search via AST structure

Contrastive multi-view learning for code

Semantic and syntax augmentation techniques

🔎 Similar Papers

No similar papers found.