Where are the Hidden Gems? Applying Transformer Models for Design Discussion Detection

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This study addresses the challenge of identifying design decisions implicitly embedded in software engineering discussions, a task hindered by the scarcity and high cost of annotated data. The work presents the first systematic evaluation of various Transformer-based models—including BERT, RoBERTa, XLNet, the lightweight LaMini-Flan-T5-77M, and ChatGPT-4o-mini—for cross-domain design discussion detection, while also rectifying methodological shortcomings in prior approaches. Experimental results demonstrate that BERT and RoBERTa exhibit strong recall, XLNet achieves high precision at the expense of recall, and ChatGPT-4o-mini delivers the best overall performance. LaMini-Flan-T5-77M emerges as an efficient and lightweight alternative. Notably, injecting semantically similar tokens does not yield significant performance gains.

Technology Category

Application Category

📝 Abstract

Design decisions are at the core of software engineering and appear in Q\&A forums, mailing lists, pull requests, issue trackers, and commit messages. Design discussions spanning a project's history provide valuable information for informed decision-making, such as refactoring and software modernization. Machine learning techniques have been used to detect design decisions in natural language discussions; however, their effectiveness is limited by the scarcity of labeled data and the high cost of annotation. Prior work adopted cross-domain strategies with traditional classifiers, training on one domain and testing on another. Despite their success, transformer-based models, which often outperform traditional methods, remain largely unexplored in this setting. The goal of this work is to investigate the performance of transformer-based models (i.e., BERT, RoBERTa, XLNet, LaMini-Flan-T5-77M, and ChatGPT-4o-mini) for detecting design-related discussions. To this end, we conduct a conceptual replication of prior cross-domain studies while extending them with modern transformer architectures and addressing methodological issues in earlier work. The models were fine-tuned on Stack Overflow and evaluated on GitHub artifacts (i.e., pull requests, issues, and commits). BERT and RoBERTa show strong recall across domains, while XLNet achieves higher precision but lower recall. ChatGPT-4o-mini yields the highest recall and competitive overall performance, whereas LaMini-Flan-T5-77M provides a lightweight alternative with stronger precision but less balanced performance. We also evaluated similar-word injection for data augmentation, but unlike prior findings, it did not yield meaningful improvements. Overall, these results highlight both the opportunities and trade-offs of using modern language models for detecting design discussion.

Problem

Research questions and friction points this paper is trying to address.

design discussion detection

software engineering

labeled data scarcity

cross-domain learning

natural language processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer models

design discussion detection

cross-domain learning