🤖 AI Summary
Non-systematic concurrency bugs in parallel programs suffer from low detection efficiency and poor root-cause attribution. To address this, we propose Optimized Dynamic Partial Order Reduction (ODPOR), a novel technique jointly targeting bug discovery and root-cause attribution. We introduce the RFS-ODPOR asynchronous exploration mechanism to prevent state-space stagnation, and design an efficient ODPOR-based backtracking algorithm that ensures high coverage while enabling precise bug localization. Integrated into the Mc SimGrid simulation framework, our approach supports validation on realistic-scale distributed applications. Experimental results demonstrate that, compared to conventional ODPOR, our method accelerates non-systematic bug detection by multiple orders of magnitude and generates compact, interpretable execution traces for root-cause溯源, significantly improving debuggability and verification comprehensibility.
📝 Abstract
Assessing the correctness of distributed and parallel applications is notoriously difficult due to the complexity of the concurrent behaviors and the difficulty to reproduce bugs. In this context, Dynamic Partial Order Reduction (DPOR) techniques have proved successful in exploiting concurrency to verify applications without exploring all their behaviors. However, they may lack of efficiency when tracking non-systematic bugs of real size applications. In this paper, we suggest two adaptations of the Optimal Dynamic Partial Order Reduction (ODPOR) algorithm with a particular focus on bug finding and explanation. The first adaptation is an out-of-order version called RFS ODPOR which avoids being stuck in uninteresting large parts of the state space. Once a bug is found, the second adaptation takes advantage of ODPOR principles to efficiently find the origins of the bug.