🤖 AI Summary
This work addresses the performance limitations of the traditional Raft consensus protocol in cross-domain distributed AI systems, where high communication latency severely degrades data synchronization efficiency. To overcome this challenge, the authors propose CD-Raft, the first systematic optimization of Raft tailored for cross-domain environments. CD-Raft introduces intelligent leader scheduling and minimizes round-trip time (RTT) along read and write paths, significantly reducing consensus latency while preserving strong consistency. The correctness of the design is rigorously verified using TLA+ formal methods, and its performance is evaluated under the YCSB benchmark. Experimental results demonstrate that CD-Raft achieves a 32.90% reduction in average latency and a 49.24% reduction in 99th-percentile tail latency compared to classic Raft.
📝 Abstract
Today's massive AI computation loads push heavy data synchronization across sites, i.e., nodes in data centers. Any reduction in such consensus latency can significantly improve the overall performance of desired systems. This consensus challenge explosively peaks at cross-domain sites. In this paper, we proposed CD-Raft to address the cross-domain latency challenge, an optimized Raft protocol for strong consistency in cross-domain sites. CD-Raft can significantly reduce consensus latency by optimizing cross-domain round-trip time (RTT) for reads and writes, as well as carefully positioning the leader node. We verified the correctness of CD-Raft in a formal specification using the TLA+ specification, guaranteeing the strong consistency across sites. We have prototyped CD-Raft and evaluated it using the YCSB benchmark. Empirical results show that compared to the classic Raft, CD-Raft reduces the average latency by 32.90% and (99th percentile) tail latency by 49.24% for renown traces across multiple sites.