Correct Black-Box Monitors for Distributed Deadlock Detection: Formalisation and Implementation (Technical Report)

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

In distributed microservice systems, message passing and remote procedure calls can induce latent deadlocks whose effects cascade across services, severely impeding diagnosis. To address this, we propose the first formally complete black-box distributed deadlock detection method: it performs non-intrusive monitoring by passively observing inter-service communication and injecting lightweight probe messages. We establish the first formal model for black-box distributed deadlock monitoring and mechanize its correctness proof in Coq—guaranteeing zero false positives and zero false negatives. Leveraging Erlang/OTP’s runtime semantics, we design a monitoring algorithm and probe protocol that jointly ensure theoretical soundness and practical deployability. Implemented as the tool DDMon, our approach demonstrates high detection accuracy and low runtime overhead on real-world Erlang/OTP applications, significantly enhancing observability and reliability of distributed systems.

Technology Category

Application Category

📝 Abstract

Many software applications rely on concurrent and distributed (micro)services that interact via message-passing and various forms of remote procedure calls (RPC). As these systems organically evolve and grow in scale and complexity, the risk of introducing deadlocks increases and their impact may worsen: even if only a few services deadlock, many other services may block while awaiting responses from the deadlocked ones. As a result, the "core" of the deadlock can be obfuscated by its consequences on the rest of the system, and diagnosing and fixing the problem can be challenging. In this work we tackle the challenge by proposing distributed black-box monitors that are deployed alongside each service and detect deadlocks by only observing the incoming and outgoing messages, and exchanging probes with other monitors. We present a formal model that captures popular RPC-based application styles (e.g., gen_servers in Erlang/OTP), and a distributed black-box monitoring algorithm that we prove sound and complete (i.e., identifies deadlocked services with neither false positives nor false negatives). We implement our results in a tool called DDMon for the monitoring of Erlang/OTP applications, and we evaluate its performance. This is the first work that formalises, proves the correctness, and implements distributed black-box monitors for deadlock detection. Our results are mechanised in Coq. DDMon is the companion artifact of this paper.

Problem

Research questions and friction points this paper is trying to address.

Detecting deadlocks in distributed message-passing systems

Monitoring services without internal system knowledge

Providing accurate deadlock identification without false results

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed black-box monitors alongside services

Observing incoming and outgoing messages only

Exchanging probes between monitors for detection

🔎 Similar Papers

How to Evaluate Distributed Coordination Systems? -- A Survey and Analysis