AccelSync: Verifying Synchronization Coverage in Accelerator Pipeline Programs

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
This work addresses hardware-visible data races in AI accelerator pipeline programs caused by missing or misaligned synchronization, which evade detection by conventional simulation and golden-reference testing. The authors formalize the problem within a constrained concurrent language and reduce correctness verification to a decidable barrier sufficiency problem: whether all cross-unit buffer read-write pairs are ordered by a happens-before relation derived from program order, synchronization order, and barrier order. They propose the first static verification algorithm with O(|E|²) time complexity and implement a synchronization coverage checker supporting multiple hardware backends. Evaluated on 6,292 production kernels, the tool uncovered three previously unknown hazards; among 120 LLM-generated kernels, it detected defects in 19.2%. Mutation testing achieved 100% detection rate, outperforming msSanitizer by 400× in speed while catching errors missed by the latter.
📝 Abstract
AI accelerator operators are compiled into multi-stage pipeline programs where DMA, vector, matrix, and scalar units execute concurrently on shared on-chip buffers. A missing or misplaced synchronization primitive introduces hardware-visible data races that escape both simulation and golden testing, because neither models the accelerator's cross-unit visibility semantics. We formalize accelerator pipeline programs as a restricted concurrent language, define a parameterized hardware event semantics with three ordering relations -- program order, synchronization order, and barrier order -- and reduce the correctness question to barrier sufficiency: whether every cross-unit write-read pair on the same buffer is ordered by happens-before. Here "barrier" denotes an abstract ordering primitive in the model, covering vendor pipe barriers, hard-event synchronization, and equivalent frontend-normalized synchronization points. We prove that barrier sufficiency is decidable in $O(|E|^2)$ time and that our checker is both sound and complete under the modeled semantics. We implement AccelSync, a static verification tool instantiated for Ascend 910B2 and Cambricon MLU370 by changing only the hardware model. On 6,292 production kernels from the CANN operator library, AccelSync identifies 3 previously unknown synchronization hazards -- one matching a hazard class for which we observed nondeterministic outputs on Ascend 910B2 under a specific toolkit/driver configuration (CANN 8.0.RC3), though this observation was not reproducible after a subsequent driver upgrade -- and on 120 LLM-generated kernels it flags a 19.2% defect rate (95% CI: [13.0%, 27.4%]). A mutation study on 688 non-equivalent mutants yields 100% detection, and a head-to-head comparison shows AccelSync detects hazards that Huawei's runtime sanitizer msSanitizer misses, at 400x lower cost per kernel.
Problem

Research questions and friction points this paper is trying to address.

synchronization
data races
accelerator pipeline
hardware visibility
concurrent programs
Innovation

Methods, ideas, or system contributions that make the work stand out.

barrier sufficiency
accelerator pipeline verification
happens-before analysis
static concurrency checking
synchronization coverage
🔎 Similar Papers
No similar papers found.