🤖 AI Summary
Join operations on LSM-tree–based storage systems (e.g., RocksDB) suffer from poor performance, high resource overhead, and a lack of theoretical foundations.
Method: This paper introduces the first LSM-aware join classification framework and a read-write asymmetry–aware theoretical cost model, revealing why conventional relational join strategies fail in LSM contexts. We design and implement a RocksDB extension supporting Nested Loop, Sort-Merge, and Index Nested Loop joins, compatible with both B⁺-tree and LSM-based inverted indexes, and integrating snapshot isolation and Read Committed consistency semantics.
Contribution/Results: Under typical OLAP workloads, the optimal configuration reduces end-to-end join latency by up to 5.3×. Our theoretical cost model achieves a mean absolute error of ≤12%, significantly enhancing predictability and optimizability of join performance on LSM storage.
📝 Abstract
LSM-tree-based data stores are widely adopted in industries for their excellent performance. As data scales increase, disk-based join operations become indispensable yet costly for the database, making the selection of suitable join methods crucial for system optimization. Current LSM-based stores generally adhere to conventional relational database practices and support only a limited number of join methods. However, the LSM-tree delivers distinct read and write efficiency compared to the relational databases, which could accordingly impact the performance of various join methods. Therefore, it is necessary to reconsider the selection of join methods in this context to fully explore the potential of various join algorithms and index designs. In this work, we present a systematic study and an exhaustive benchmark for joins over LSM-trees. We define a configuration space for join methods, encompassing various join algorithms, secondary index types, and consistency strategies. We also summarize a theoretical analysis to evaluate the overhead of each join method for an in-depth understanding. Furthermore, we implement all join methods in the configuration space on a unified platform and compare their performance through extensive experiments. Our theoretical and experimental results yield several insights and takeaways tailored to joins in LSM-based stores that aid developers in choosing proper join methods based on their working conditions.