Co-Designing Graph-based Approximate Nearest Neighbor Search at Billion Scale for Processing-in-Memory

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the memory bandwidth bottleneck in billion-scale graph-based approximate nearest neighbor search (ANNS), where existing processing-in-memory (PIM) approaches struggle to achieve both high recall and high throughput due to limited memory capacity, substantial communication overhead, and weak computational capability. Through algorithm-architecture co-design, the authors introduce a compact index layout that reduces memory footprint by 14.5×, an asynchronous pipelined scheduler that keeps the host-PIM interconnect continuously saturated, and a multiplication-free distance kernel that incurs less than 0.08% recall loss. For the first time, this enables PIM-based ANNS to approach the theoretical recall limits of graph methods. Evaluated on three billion-scale benchmarks, the system achieves 20× and 17.1× higher throughput than CPU and GPU baselines, respectively, and outperforms prior PIM solutions by up to 129× under high-recall settings, while supporting seamless scaling across multi-node systems and emerging PIM architectures.
📝 Abstract
Approximate Nearest Neighbor Search (ANNS) is a core primitive in modern AI systems, and graph-based methods currently offer the best accuracy-efficiency trade-off at scale. The workload is fundamentally memory-bound: graph traversal produces frequent, irregular memory accesses that cap CPU throughput at main-memory bandwidth, while GPUs lack the high-bandwidth memory capacity to host billion-scale indexes. Processing-in-Memory (PIM) is a natural candidate, as placing computation next to data unlocks the abundant internal bandwidth that such bandwidth-starved workloads demand. Porting graph-based ANNS to PIM, however, exposes several architectural mismatches: each processing unit has only a small local memory, inter-unit communication is costly, host coordination adds overhead, and in-memory compute units are relatively weak -- limitations that have forced prior PIM-based ANNS designs to fall back on cluster-based indexing, whose recall ceiling is far below that of graph methods. This paper presents an algorithm-architecture co-design that overcomes these obstacles through three components: a compacted index layout that shrinks the PIM-resident memory footprint by 14.5x; an asynchronous pipelined scheduler that keeps the host-to-PIM interconnect saturated; and a multiplication-free distance kernel that loses under 0.08% recall. Across three billion-scale benchmarks, the proposed design achieves up to 20x and 17.1x higher throughput than CPU and GPU baselines, respectively, outperforms prior PIM accelerators by 129x in the high-recall regime, and scales gracefully across multi-node deployments and emerging PIM architecture.
Problem

Research questions and friction points this paper is trying to address.

Approximate Nearest Neighbor Search
Graph-based Indexing
Processing-in-Memory
Billion-scale
Memory-bound Workload
Innovation

Methods, ideas, or system contributions that make the work stand out.

Processing-in-Memory
Graph-based ANNS
Algorithm-Architecture Co-Design
Billion-Scale Indexing
Memory-Bound Workload