Verifying Computational Graphs in Production-Grade Distributed Machine Learning Frameworks

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Distributed machine learning frameworks introduce subtle correctness errors through parallelization and optimization, severely degrading performance; existing verification approaches are either ad hoc or prohibitively expensive. This paper proposes Scalify, a lightweight semantic equivalence verification framework. Scalify introduces a novel relational modeling technique that unifies equivalence saturation with Datalog-based reasoning, coupled with symbolic bijection inference to achieve high-precision computational graph equivalence checking. To enhance scalability, it further designs rewrite template reuse, hierarchical memoization, and parallel rewrite partitioning. Evaluated on commodity hardware, Scalify completes end-to-end verification of ultra-large models—e.g., Llama-3.1-405B—in minutes, accurately localizing erroneous code. Deployed in Amazon’s production environment, Scalify has uncovered five previously unknown defects.

Technology Category

Application Category

📝 Abstract

Modern machine learning frameworks support very large models by incorporating parallelism and optimization techniques. Yet, these very techniques add new layers of complexity, introducing silent errors that severely degrade model performance. Existing solutions are either ad hoc or too costly for production. We present Scalify, a lightweight framework that exposes silent errors by verifying semantic equivalence of computational graphs using equality saturation and Datalog-style reasoning. To scale, Scalify partitions graphs with parallel rewriting and layer memoization, reuses rewrite templates, and augments equality saturation with relational reasoning and symbolic bijection inference. It further localizes discrepancies to precise code sites, turning verification results into actionable debugging guidance. Scalify verifies models as large as Llama-3.1-405B within minutes on a commodity machine and exposed five unknown bugs in Amazon production machine learning frameworks.

Problem

Research questions and friction points this paper is trying to address.

Detecting silent errors in large-scale computational graphs

Verifying semantic equivalence in distributed ML frameworks

Providing actionable debugging guidance for production systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses equality saturation and Datalog reasoning

Employs parallel rewriting with layer memoization

Augments with relational reasoning and symbolic inference

🔎 Similar Papers

FedGraph: A Research Library and Benchmark for Federated Graph Learning

2024-10-08arXiv.orgCitations: 0

💼 Related Jobs

TL, Research Inference

OpenAI

$380K – $555K • Offers Equity

San Francisco

Software Engineer