🤖 AI Summary
This work addresses the challenges of billion-scale vector similarity search, which is constrained on CPUs by computational overhead and memory bandwidth, while existing 1-bit quantization methods remain incompatible with Neural Processing Units (NPUs). We propose IVF-RaBitQ, the first system tailored for NPU–CPU heterogeneous architectures, decoupling coarse-grained retrieval on the NPU from fine-grained reranking on the CPU and establishing a three-stage pipeline. For the first time, RaBitQ is efficiently deployed on NPUs through four novel NPU-native optimizations: fused AIC-AIV operators, rotation-based orthogonality reconstruction, block-level load balancing, and pipelined parallelism between AI Cores and CPUs. Experiments demonstrate that IVF-RaBitQ achieves index construction speeds 3.0–62.8× faster than CPU baselines and delivers up to 4.6× higher throughput—surpassing the fastest CPU implementation—while exhibiting strong scalability across multi-NPU systems.
📝 Abstract
Vector similarity search is a critical component of modern AI systems, but traditional CPU-based implementations face fundamental scalability bottlenecks for billion-scale corpora due to prohibitive computational overhead and memory bandwidth limitations. While Neural Processing Units (NPUs) offer orders-of-magnitude higher compute density, existing CPU/GPU-optimized 1-bit RaBitQ quantization implementations cannot be directly ported to NPU architectures due to fundamental hardware mismatches, and homogeneous design paradigms struggle to simultaneously balance accuracy, memory footprint, and performance. This paper presents Ascend-RaBitQ, the first heterogeneous NPU-CPU optimized IVF-RaBitQ system for billion-scale vector search, built on the core insight that decoupling coarse ranking (NPU) from fine ranking (CPU) allows each stage to leverage its optimal hardware, breaking the long-standing accuracy-memory-performance trade-off. We propose a three-stage heterogeneous pipeline comprising AI Core-accelerated coarse ranking on 1-bit quantized vectors, on-device AI CPU Top-k processing, and host CPU fine re-ranking on full-precision vectors. We introduce four NPU architecture-native optimizations: fused AIC-AIV operators for parallel distance computation, computation flow restructuring to exploit rotation orthogonality, fine-grained index block-level load balancing that breaks query boundaries, and intra-NPU pipeline parallelism between AI Core and AI CPU to mask Top-k latency. Evaluation on standard datasets shows that Ascend-RaBitQ achieves 3.0* to 62.8* faster index construction than the CPU baseline, up to 4.6* throughput improvement over the fastest CPU IVF-RaBitQ implementation, and over 100* over the mathematically equivalent CPU baseline, while demonstrating encouraging scalability on distributed multi-NPU systems.