🤖 AI Summary
This work addresses the computational intensity and memory bandwidth bottlenecks that plague private information retrieval (PIR) under multi-client concurrent settings. The authors propose an efficient GPU-accelerated PIR system featuring a phase-aware hybrid execution model that dynamically schedules both operator-level and phase-level kernels. By innovatively integrating phase-level kernels, adopting a transposed-layout GEMM design, and employing fine-grained pipelining, the system effectively mitigates data layout conflicts between the Number Theoretic Transform (NTT) and General Matrix Multiply (GEMM) operations. Combined with optimized multi-GPU communication, the proposed system achieves up to a 305.7× throughput improvement over the state-of-the-art GPU-based implementation, PironGPU, while maintaining scalability and significantly enhancing query efficiency for large-scale databases.
📝 Abstract
Private information retrieval (PIR) allows private database queries but is hindered by intense server-side computation and memory traffic. Modern lattice-based PIR protocols typically involve three phases: ExpandQuery (expanding a query into encrypted indices), RowSel (encrypted row selection), and ColTor (recursive "column tournament" for final selection). ExpandQuery and ColTor primarily perform number-theoretic transforms (NTTs), whereas RowSel reduces to large-scale independent matrix-matrix multiplications (GEMMs). GPUs are theoretically ideal for these tasks, provided multi-client batching is used to achieve high throughput. However, batching fundamentally reshapes performance bottlenecks; while it amortizes database access costs, it expands working sets beyond the L2 cache capacity, causing divergent memory behaviors and excessive DRAM traffic.
We present GPIR, a GPU-accelerated PIR system that rethinks kernel design, data layout, and execution scheduling. We introduce a stage-aware hybrid execution model that dynamically switches between operation-level kernels, which execute each primitive operation separately, and stage-level kernels, which fuse all operations within a protocol stage into a single kernel to maximize on-chip data reuse. For RowSel, we identify a performance gap caused by a structural mismatch between NTT-driven data layouts and tiled GEMM access patterns, which is exacerbated by multi-client batching. We resolve this through a transposed-layout GEMM design and fine-grained pipelining. Finally, we extend GPIR to multi-GPU systems, scaling both query throughput and database capacity with negligible communication overhead. GPIR achieves up to 305.7x higher throughput than PIRonGPU, the state-of-the-art GPU implementation.