Mycelium: A Transformation-Embedded LSM-Tree

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Compaction in LSM-trees is essential but incurs substantial I/O overhead and significant write/read amplification; moreover, conventional approaches decouple data transformations—such as column-family merging, format conversion, and index construction—from compaction, leading to redundant I/O and increased latency. This paper proposes Transformation-Embedded LSM-tree (TE-LSM), the first design to deeply integrate general-purpose user-defined transformations into the compaction pipeline, enabling semantically consistent data reorganization and transformation concurrently during I/O-intensive compaction. TE-LSM introduces a compaction-aware cost model, cross-column-family merging, column-group splitting, and incremental indexing. Implemented transparently atop RocksDB, it achieves efficient, end-to-end integration. Experiments show only a 20% write-throughput reduction—significantly better than external transformation schemes (35–60% degradation)—and up to 425% read-latency reduction, markedly improving end-to-end efficiency and query readiness.

Technology Category

Application Category

📝 Abstract
Compaction is a necessary, but often costly background process in write-optimized data structures like LSM-trees that reorganizes incoming data that is sequentially appended to logs. In this paper, we introduce Transformation-Embedded LSM-trees (TE-LSM), a novel approach that transparently embeds a variety of data transformations into the compaction process. While many others have sought to reduce the high cost of compaction, TE-LSMs leverage the opportunity to embed other useful work to amortize IO costs and amplification. We illustrate the use of a TE-LSM in Mycelium, our prototype built on top of RocksDB that extends the compaction process through a cross-column-family merging mechanism. Mycelium enables seamless integration of a transformer interface and aims to better prepare data for future accesses based on access patterns. We use Mycelium to explore three types of transformations: splitting column groups, converting data formats, and index building. In addition to providing a cost model analysis, we evaluate Mycelium's write and read performance using YCSB workloads. Our results show that Mycelium incurs a 20% write throughput overhead - significantly lower than the 35% to 60% overhead observed in naive approaches that perform data transformations outside of compaction-while achieving up to 425% improvements in read latency compared to RocksDB baseline.
Problem

Research questions and friction points this paper is trying to address.

Reducing high cost of LSM-tree compaction
Embedding data transformations during compaction
Improving read latency and write throughput
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embeds data transformations into LSM-tree compaction
Uses cross-column-family merging for compaction
Improves read latency by 425% over baseline
🔎 Similar Papers
No similar papers found.
H
Holly Casaletto
UC Santa Cruz
J
Jeff Lefevre
UC Santa Cruz
A
Aldrin Montana
UC Santa Cruz
Peter Alvaro
Peter Alvaro
Associate Professor of Computer Science, UC Santa Cruz
Distributed SystemsData Management SystemsOperating Systems