TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenges of low weight-transfer efficiency and poor scalability in large-model reinforcement learning training under heterogeneous resources. The authors propose TensorHub, a novel system that introduces reference-oriented storage (ROS) abstraction to enable zero-copy, efficient weight sharing by servicing read requests on-demand using existing weight replicas already present on GPUs. TensorHub integrates RDMA-based high-speed networking, topology-aware transmission, strong consistency protocols, and fault-tolerance mechanisms to significantly enhance transfer efficiency. Empirical results demonstrate substantial reductions in GPU stall time—by 6.7×, 4.8×, and 19×—in single-node, elastic scaling, and cross-datacenter rollback scenarios, respectively. The system has been successfully deployed in production environments.

Technology Category

Application Category

📝 Abstract

Modern LLM reinforcement learning (RL) workloads require a highly efficient weight transfer system to scale training across heterogeneous computational resources. However, existing weight transfer approaches either fail to provide flexibility for dynamically scaling clusters or incur fundamental data movement overhead, resulting in poor performance. We introduce Reference-Oriented Storage (ROS), a new storage abstraction for RL weight transfer that exploits the highly replicated model weights in place. ROS presents the illusion that certain versions of the model weights are stored and can be fetched on demand. Underneath, ROS does not physically store any copies of the weights; instead, it tracks the workers that hold these weights on GPUs for inference. Upon request, ROS directly uses them to serve reads. We build TensorHub, a production-quality system that extends the ROS idea with topology-optimized transfer, strong consistency, and fault tolerance. Evaluation shows that TensorHub fully saturates RDMA bandwidth and adapts to three distinct rollout workloads with minimal engineering effort. Specifically, TensorHub reduces total GPU stall time by up to 6.7x for standalone rollouts, accelerates weight update for elastic rollout by 4.8x, and cuts cross-datacenter rollout stall time by 19x. TensorHub has been deployed in production to support cutting-edge RL training.

Problem

Research questions and friction points this paper is trying to address.

weight transfer

large language models

reinforcement learning

elastic scaling

heterogeneous resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-Oriented Storage

weight transfer

LLM reinforcement learning