Tidehunter: Large-Value Storage With Minimal Data Relocation

๐Ÿ“… 2026-02-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the severe write amplification (10โ€“30ร—) in LSM-tree-based storage engines under workloads with large values and uniformly distributed random writes, which stems from frequent compaction. The authors propose a novel storage engine that treats the write-ahead log (WAL) as the permanent storage medium, maintaining only a lightweight index mapping keys to WAL offsets and thereby eliminating value data compaction. Key innovations include a lock-free write path, an optimistic index structure optimized for uniform key distributions, and a non-blocking periodic space reclamation mechanism. The design further leverages atomic allocation and parallel copying to maximize NVMe throughput, complemented by lazy index refresh, single-pass lookups, and epoch-based pruning to accelerate queries. Evaluated on a 1TB dataset, the system achieves 830K writes per secondโ€”8.4ร— faster than RocksDB and 2.9ร— faster than BlobDBโ€”with 1.7ร— and 15.6ร— improvements in point lookup and existence check latency, respectively, and has been stably deployed in the Sui blockchain.

Technology Category

Application Category

๐Ÿ“ Abstract
Log-Structured Merge-Trees (LSM-trees) dominate persistent key-value storage but suffer from high write amplification from 10x to 30x under random workloads due to repeated compaction. This overhead becomes prohibitive for large values with uniformly distributed keys, a workload common in content-addressable storage, deduplication systems, and blockchain validators. We present Tidehunter, a storage engine that eliminates value compaction by treating the Write-Ahead Log (WAL) as permanent storage rather than a temporary recovery buffer. Values are never overwritten; and small, lazily-flushed index tables map keys to WAL positions. Tidehunter introduces (a) lock-free writes that saturate NVMe drives through atomic allocation and parallel copying, (b) an optimistic index structure that exploits uniform key distributions for single-roundtrip lookups, and (c) epoch-based pruning that reclaims space without blocking writes. On a 1 TB dataset with 1 KB values, Tidehunter achieves 830K writes per second, that is 8.4x higher than RocksDB and 2.9x higher than BlobDB, while improving point queries by 1.7x and existence checks by 15.6x. We validate real-world impact by integrating Tidehunter into Sui, a high-throughput blockchain, where it maintains stable throughput and latency under loads that cause RocksDB-backed validators to collapse. Tidehunter is production-ready and is being deployed in production within Sui.
Problem

Research questions and friction points this paper is trying to address.

write amplification
LSM-tree
large-value storage
compaction
random workloads
Innovation

Methods, ideas, or system contributions that make the work stand out.

LSM-tree
write amplification
WAL-as-storage
lock-free writes
epoch-based pruning
๐Ÿ”Ž Similar Papers
No similar papers found.