A Comprehensive Study on the Use of Word Embedding Models in Software Engineering Domain

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research on word embeddings (WE) in software engineering (SE) is fragmented and lacks systematic comparison or consensus on best practices. Method: We conduct the first large-scale, cross-study empirical survey, systematically analyzing 181 papers from top SE conferences and journals via bibliometrics, systematic literature review (SLR), qualitative coding, and cross-study comparison. Contribution/Results: We map WE applications across 12 SE task categories, characterize the applicability and limitations of six prevalent training strategies, and rigorously delineate performance boundaries and contextual suitability of WE relative to traditional methods (e.g., TF-IDF, LDA). The study identifies recurrent challenges—including inadequate modeling of software-specific semantics—and proposes targeted modeling optimizations. Our findings establish a theoretically grounded, empirically validated framework to guide the principled adoption and adaptation of semantic representation techniques in SE.

Technology Category

Application Category

📝 Abstract
Word embedding (WE) techniques are advanced textual semantic representation models oriented from the natural language processing (NLP) area. Inspired by their effectiveness in facilitating various NLP tasks, more and more researchers attempt to adopt these WE models for their software engineering (SE) tasks, of which semantic representation of software artifacts such as bug reports and code snippets is the basis for further model building. However, existing studies are generally isolated from each other without comprehensive comparison and discussion. This not only makes the best practice of such cross-discipline technique adoption buried in scattered papers, but also makes us kind of blind to current progress in the semantic representation of SE artifacts. To this end, we decided to perform a comprehensive study on the use of WE models in the SE domain. 181 primary studies published in mainstream software engineering venues are collected for analysis. Several research questions related to the SE applications, the training strategy of WE models, the comparison with traditional semantic representation methods, etc., are answered. With the answers, we get a systematical view of the current practice of using WE for the SE domain, and figure out the challenges and actions in adopting or developing practical semantic representation approaches for the SE artifacts used in a series of SE tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating word embedding models in software engineering tasks
Comparing traditional and WE-based semantic representation methods
Identifying challenges in adopting WE for SE artifact analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using word embedding models for software engineering tasks
Comprehensive study on WE models in SE domain
Analyzing 181 studies for SE artifact representation
🔎 Similar Papers
No similar papers found.
Xiaohan Chen
Xiaohan Chen
University of Cyprus
Transfer learningFault diagnosisTime-series analysis
Weiqin Zou
Weiqin Zou
Nanjing University of Aeronautics and Astronautics (NUAA)
software engineering
L
Lianyi Zhi
Nanjing University of Aeronautics and Astronautics
Q
Qianshuang Meng
Nanjing University of Aeronautics and Astronautics
J
Jingxuan Zhang
Nanjing University of Aeronautics and Astronautics