Improving MPI Error Detection and Repair with Large Language Models and Bug References

๐Ÿ“… 2026-04-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the significant challenges in detecting and repairing bugs in MPI programs, which stem from their intricate inter-process communication mechanisms. Existing large language models (LLMs) generally lack sufficient knowledge of MPI-specific defects, leading to suboptimal performance. To overcome this limitation, the paper proposes a synergistic framework that integrates few-shot learning (FSL), chain-of-thought (CoT) reasoning, and retrieval-augmented generation (RAG), leveraging external MPI defect knowledge to guide LLMs toward precise bug localization and repair. The approach substantially improves bug detection accuracyโ€”from 44% to 77%โ€”and demonstrates consistent effectiveness and strong generalization across multiple state-of-the-art LLMs, offering a novel paradigm for debugging parallel programs.
๐Ÿ“ Abstract
Message Passing Interface (MPI) is a foundational technology in high-performance computing (HPC), widely used for large-scale simulations and distributed training (e.g., in machine learning frameworks such as PyTorch and TensorFlow). However, maintaining MPI programs remains challenging due to their complex interplay among processes and the intricacies of message passing and synchronization. With the advancement of large language models like ChatGPT, it is tempting to adopt such technology for automated error detection and repair. Yet, our studies reveal that directly applying large language models (LLMs) yields suboptimal results, largely because these models lack essential knowledge about correct and incorrect usage, particularly the bugs found in MPI programs. In this paper, we design a bug detection and repair technique alongside Few-Shot Learning (FSL), Chain-of-Thought (CoT) reasoning, and Retrieval Augmented Generation (RAG) techniques in LLMs to enhance the large language model's ability to detect and repair errors. Surprisingly, such enhancements lead to a significant improvement, from 44% to 77%, in error detection accuracy compared to baseline methods that use ChatGPT directly. Additionally, our experiments demonstrate our bug referencing technique generalizes well to other large language models.
Problem

Research questions and friction points this paper is trying to address.

MPI
error detection
bug repair
large language models
high-performance computing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Few-Shot Learning
Chain-of-Thought
Retrieval Augmented Generation
MPI error repair
Large Language Models
๐Ÿ”Ž Similar Papers
S
Scott Piersall
Dept. of Computer Science, University of Central Florida, Orlando, 32816, FL, US
Y
Yang Gao
Dept. of Computer Science, University of Central Florida, Orlando, 32816, FL, US
S
Shenyang Liu
Dept. of Computer Science, University of Central Florida, Orlando, 32816, FL, US
Liqiang Wang
Liqiang Wang
Professor of Computer Science, University of Central Florida
Big DataDeep LearningBlockchainProgram AnalysisParallel Computing