An RDMA-First Object Storage System with SmartNIC Offload

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

High-frequency, fine-grained I/O in AI training imposes severe performance bottlenecks on TCP-based host-mediated storage stacks. Method: This paper proposes a POSIX-compliant, RDMA-first object storage system that fully offloads the DAOS client onto NVIDIA BlueField-3 SmartNICs—enabling host bypass and user-space storage stack offload. It adopts a decoupled architecture: a gRPC-based control plane and a UCX/libfabric-based data plane, supporting multi-tenancy isolation and inline services on the DPU. Contribution/Results: Experiments demonstrate significantly higher RDMA throughput than TCP across both small and large I/O workloads. End-to-end performance matches that of direct host-attached storage while preserving RDMA’s low latency and high bandwidth. This work presents the first validation of RDMA feasibility and scalability under full client-side offload, establishing an efficient, scalable data path foundation for GPU-centric large-model training.

Technology Category

Application Category

📝 Abstract

AI training and inference impose sustained, fine-grain I/O that stresses host-mediated, TCP-based storage paths. Motivated by kernel-bypass networking and user-space storage stacks, we revisit POSIX-compatible object storage for GPU-centric pipelines. We present ROS2, an RDMA-first object storage system design that offloads the DAOS client to an NVIDIA BlueField-3 SmartNIC while leaving the DAOS I/O engine unchanged on the storage server. ROS2 separates a lightweight control plane (gRPC for namespace and capability exchange) from a high-throughput data plane (UCX/libfabric over RDMA or TCP) and removes host mediation from the data path. Using FIO/DFS across local and remote configurations, we find that on server-grade CPUs RDMA consistently outperforms TCP for both large sequential and small random I/O. When the RDMA-driven DAOS client is offloaded to BlueField-3, end-to-end performance is comparable to the host, demonstrating that SmartNIC offload preserves RDMA efficiency while enabling DPU-resident features such as multi-tenant isolation and inline services (e.g., encryption/decryption) close to the NIC. In contrast, TCP on the SmartNIC lags host performance, underscoring the importance of RDMA for offloaded deployments. Overall, our results indicate that an RDMA-first, SmartNIC-offloaded object-storage stack is a practical foundation for scaling data delivery in modern LLM training environments; integrating optional GPU-direct placement for LLM tasks is left for future work.

Problem

Research questions and friction points this paper is trying to address.

Optimizing object storage for AI training's fine-grain I/O demands

Offloading storage client to SmartNIC while maintaining server compatibility

Enabling high-performance RDMA data paths with host mediation removal

Innovation

Methods, ideas, or system contributions that make the work stand out.

RDMA-first object storage system

SmartNIC offload for client operations

Control-data plane separation with UCX/libfabric

🔎 Similar Papers

No similar papers found.