🤖 AI Summary
Distributed deep learning models often introduce output divergence during parallelization, causing behavioral deviations from their original sequential counterparts.
Method: This paper proposes a static verification approach that formally models the “model refinement” relation to determine whether a distributed implementation can reconstruct the sequential model’s outputs losslessly.
Contribution/Results: Its core innovation is an iterative rewriting inference mechanism integrated within the GraphGuard system, which synergistically combines graph analysis and static program analysis to achieve scalable and interpretable refinement verification. The method supports end-to-end verification of large-scale models—including GPT and Llama-3—and precisely pinpoints the root causes of output divergence. Experimental evaluation demonstrates significant improvements in both the trustworthiness of distributed model implementations and debugging efficiency.
📝 Abstract
Distributed machine learning training and inference is common today because today's large models require more memory and compute than can be provided by a single GPU. Distributed models are generally produced by programmers who take a sequential model specification and apply several distribution strategies to distribute state and computation across GPUs. Unfortunately, bugs can be introduced in the process, and a distributed model implementation's outputs might differ from the sequential model's outputs. In this paper, we describe an approach to statically identify such bugs by checking model refinement, that is, can the sequential model's outputs be reconstructed from the distributed model's outputs? Our approach, implemented in GraphGuard, uses iterative rewriting to prove model refinement. Our approach can scale to today's large models and deployments: we evaluate it using GPT and Llama-3. Further, it provides actionable output that aids in bug localization.