Efficient Batch Search Algorithm for B+ Tree Index Structures with Level-Wise Traversal on FPGAs

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the inefficiencies of conventional B+ tree search on FPGAs, which suffers from frequent memory accesses, low node reuse, and limited parallelism. To overcome these challenges, the authors propose a batched B+ tree search method tailored for FPGA implementation, employing a layer-wise traversal strategy that processes multiple query keys simultaneously. This approach substantially improves node cache reuse and reduces global memory traffic. A configurable search kernel is developed using high-level synthesis (HLS), enabling flexible tuning of batch size, node width, and tree depth. The design leverages on-chip parallel comparison and node reuse mechanisms on an AMD Alveo U250 FPGA. Experimental results demonstrate that the single-core FPGA implementation achieves a 4.9× speedup over a single-threaded CPU baseline on million-scale B+ trees, while a four-core configuration outperforms a 16-thread CPU by 2.1×.

Technology Category

Application Category

📝 Abstract

This paper introduces a search algorithm for index structures based on a B+ tree, specifically optimized for execution on a field-programmable gate array (FPGA). Our implementation efficiently traverses and reuses tree nodes by processing a batch of search keys level by level. This approach reduces costly global memory accesses, improves reuse of loaded B+ tree nodes, and enables parallel search key comparisons directly on the FPGA. Using a high-level synthesis (HLS) approach, we developed a highly flexible and configurable search kernel design supporting variable batch sizes, customizable node sizes, and arbitrary tree depths. The final design was implemented on an AMD Alveo U250 Data Center Accelerator Card, and was evaluated against the B+ tree search algorithm from the TLX library running on an AMD EPYC 7542 processor (2.9 GHz). With a batch size of 1000 search keys, a B+ tree containing one million entries, and a tree order of 16, we measured a 4.9x speedup for the single-kernel FPGA design compared to a single-threaded CPU implementation. Running four kernel instances in parallel on the FPGA resulted in a 2.1$\times$ performance improvement over a CPU implementation using 16 threads.

Problem

Research questions and friction points this paper is trying to address.

B+ tree

batch search

FPGA

index structures

memory access

Innovation

Methods, ideas, or system contributions that make the work stand out.

B+ tree

FPGA acceleration

batch search