🤖 AI Summary
This work addresses deadlock, livelock, and task failure in multi-agent systems caused by flawed coordination protocols by proposing a verification-first approach to automatic protocol synthesis. The method leverages large language models (LLMs) to generate protocol topologies and PlusCal specifications, which are then formally verified using the TLA+ model checker (TLC). Counterexamples from failed verifications drive iterative refinements until correctness is achieved. Verified protocol implementations are compiled into agent prompts for execution, augmented with runtime topology monitoring to ensure consistency. This study presents the first integration of LLMs with TLA+ to enable verifiable and repairable protocol generation. Experiments demonstrate that all 48 evaluated tasks pass formal verification (62.5% on the first attempt), achieve an average runtime task completion rate of 89.4%, and reduce deadlock/livelock incidence from 31.1% to 14.1%, while exhibiting half the performance degradation rate under reduced model capabilities.
📝 Abstract
We present TraceFix, a verification-first pipeline for Large Language Model (LLM) multi-agent coordination. An agent synthesizes a protocol topology as a structured intermediate representation (IR) from a task description, generates PlusCal coordination logic, and iteratively repairs the protocol using counterexamples from the TLA+ model checker (TLC) until verification succeeds. Verified process bodies are compiled into per-agent system prompts and executed under a runtime monitor that rejects out-of-topology coordination operations. On 48 tasks spanning 16 scenario families, all tasks reach full TLC verification; 62.5% pass on the first attempt and none requires more than four repair iterations. State spaces span six orders of magnitude yet verification completes in under 60 s for every task. A 3,456-run runtime comparison shows that topology-monitored execution achieves the highest task completion (89.4% average, 81.5% full) and that runtimes using the verified protocol degrade at roughly half the rate of prompt-only and chat-only baselines when model capability is reduced. A paired ablation under a fixed runtime shows that TLC-verified protocols cut deadlock/livelock (DL/LL) from 31.1% to 14.1%, with the largest separation under fault injection.