🤖 AI Summary
To address the data movement bottleneck between CPU and memory in in-memory database analytics, this paper proposes a CPU-PIM collaborative query processing framework that transforms JOIN operations into fine-grained, PIM-friendly filtering tasks directly executable in DRAM. Our key contributions are: (1) the first bank-level DRAM-PIM mapping mechanism enabling fine-grained filtering; (2) a principled CPU-PIM division-of-labor paradigm that preserves system compatibility while balancing parallelism and flexibility; and (3) synergistic optimizations including pre-join denormalization and selective aggregation offloading. End-to-end evaluation on TPC-H and SSB benchmarks shows that our approach achieves 5.92×–6.5× speedup over conventional CPU-only execution, outperforms full denormalization by 3.03×–4.05×, and incurs only 9%–17% additional memory overhead.
📝 Abstract
In-memory database query processing frequently involves substantial data transfers between the CPU and memory, leading to inefficiencies due to Von Neumann bottleneck. Processing-in-Memory (PIM) architectures offer a viable solution to alleviate this bottleneck. In our study, we employ a commonly used software approach that streamlines JOIN operations into simpler selection or filtering tasks using pre-join denormalization which makes query processing workload more amenable to PIM acceleration. This research explores DRAM design landscape to evaluate how effectively these filtering tasks can be efficiently executed across DRAM hierarchy and their effect on overall application speedup. We also find that operations such as aggregates are more suitably executed on the CPU rather than PIM. Thus, we propose a cooperative query processing framework that capitalizes on both CPU and PIM strengths, where (i) the DRAM-based PIM block, with its massive parallelism, supports scan operations while (ii) CPU, with its flexible architecture, supports the rest of query execution. This allows us to utilize both PIM and CPU where appropriate and prevent dramatic changes to the overall system architecture. With these minimal modifications, our methodology enables us to faithfully perform end-to-end performance evaluations using established analytics benchmarks such as TPCH and star-schema benchmark (SSB). Our findings show that this novel mapping approach improves performance, delivering a 5.92x/6.5x speedup compared to a traditional schema and 3.03-4.05x speedup compared to a denormalized schema with 9-17% memory overhead, depending on the degree of partial denormalization. Further, we provide insights into query selectivity, memory overheads, and software optimizations in the context of PIM-based filtering, which better explain the behavior and performance of these systems across the benchmarks.