Identifying Concurrency Bug Reports via Linguistic Patterns

๐Ÿ“… 2026-01-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the challenges of high annotation costs and error-proneness in the automatic identification of concurrency bug reports. The authors propose a multi-granularity linguistic patternโ€“based classification framework that constructs 58 domain-specific language patterns across four levels: word, phrase, sentence, and report. The approach integrates pattern matching, traditional machine learning, fine-tuning of pre-trained language models (PLMs), and prompting of large language models (LLMs). Notably, domain-specific linguistic knowledge is innovatively injected into the PLM fine-tuning process. The work also releases a high-quality annotated dataset. Experimental results demonstrate that the method achieves precision rates of 91% and 93% on GitHub and Jira datasets, respectively, and maintains a high precision of 91% on hold-out test data.

Technology Category

Application Category

๐Ÿ“ Abstract
With the growing ubiquity of multi-core architectures, concurrent systems have become essential but increasingly prone to complex issues such as data races and deadlocks. While modern issue-tracking systems facilitate the reporting of such problems, labeling concurrency-related bug reports remains a labor-intensive and error-prone task. This paper presents a linguistic-pattern-based framework for automatically identifying concurrency bug reports. We derive 58 distinct linguistic patterns from 730 manually labeled concurrency bug reports, organized across four levels: word-level (keywords), phrase-level (n-grams), sentence-level (semantic), and bug report-level (contextual). To assess their effectiveness, we evaluate four complementary approaches-matching, learning, prompt-based, and fine-tuning-spanning traditional machine learning, large language models (LLMs), and pre-trained language models (PLMs). Our comprehensive evaluation on 12 large-scale open-source projects (10,920 issue reports from GitHub and Jira) demonstrates that fine-tuning PLMs with linguistic-pattern-enriched inputs achieves the best performance, reaching a precision of 91% on GitHub and 93% on Jira, and maintaining strong precision on post cut-off data (91%). The contributions of this work include: (1) a comprehensive taxonomy of linguistic patterns for concurrency bugs, (2) a novel fine-tuning strategy that integrates domain-specific linguistic knowledge into PLMs, and (3) a curated, labeled dataset to support reproducible research. Together, these advances provide a foundation for improving the automation, precision, and interpretability of concurrency bug classification.
Problem

Research questions and friction points this paper is trying to address.

concurrency bug
bug report identification
linguistic patterns
issue tracking
automated classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

linguistic patterns
concurrency bug detection
pre-trained language models
fine-tuning strategy
automated bug classification
๐Ÿ”Ž Similar Papers
No similar papers found.