🤖 AI Summary
This work addresses the trade-off between consistency and performance in distributed databases under cross-partition transactions, where traditional two-phase commit (2PC) protocols suffer from high coordination overhead, significant latency, and complex recovery during dynamic partition migration. To overcome these limitations, the authors propose a tree-based 2PC framework tailored for OceanBase, which treats log streams as atomic commit units and organizes participants into a coordinator-rooted directed acyclic commit tree. The protocol recursively executes a tree-structured commit process and introduces uncertainty states such as prepare-unknown and trans-unknown to prevent consistency violations caused by lost context. Notably, the design eliminates the need for explicit participant lists and inherently supports dynamic partition migration. Experimental results demonstrate that the framework substantially reduces latency and bandwidth overhead, achieving transaction performance close to single-node levels while maintaining strong consistency and high scalability.
📝 Abstract
Modern distributed databases face challenges in achieving transactional consistency across distributed partitions. Traditional two-phase commit (2PC) protocols incur high coordination overhead and latency, and require complex recovery for dynamic partition transfers. This paper introduces a novel tree-shaped 2PC framework for OceanBase that leverages single-machine log streams to address these challenges through three innovations. First, we propose log streams as atomic participants, replacing partition-level coordination. By treating each log stream as the commit unit, a transaction spanning $N$ co-located partitions interacts with one participant, reducing coordination overhead by orders of magnitude (e.g., 99 percent reduction for $N=100$). Second, we design a tree-shaped 2PC protocol with coordinator-rooted DAG topology that dynamically handles partition transfers by recursively constructing commit trees. When a partition migrates during a transaction, the protocol embeds migration contexts as leaf nodes, eliminating explicit participant list updates, resolving circular dependencies, and ensuring linearizable commits under topology changes. Third, we introduce prepare-unknown and trans-unknown states to prevent consistency violations when participants lose context. These states signal uncertainty during retries, avoiding erroneous aborts from so-called lying participants while isolating users from ambiguity. Experimental evaluation demonstrates performance approaching that of single-machine transactions, with reduced latency and bandwidth consumption, validating the framework's effectiveness for modern distributed databases.