Cross-Timeslot Optimization for Distributed GPU Inference Using Reinforcement Learning

📅 2025-07-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing distributed GPU inference schedulers make decisions solely based on instantaneous system states, neglecting the temporal evolution of task requirements and resource availability—leading to low GPU utilization, high migration overhead, and sluggish responsiveness under dynamic workloads. This paper proposes TORTA, the first spatiotemporal joint two-tier scheduling framework: a macro-level scheduler integrates long-horizon workload forecasting with optimal transport theory for cross-region resource allocation, while a micro-level scheduler employs reinforcement learning to satisfy short-horizon execution constraints. TORTA supports dynamic resource coordination across heterogeneous network topologies, achieving low migration cost alongside improved responsiveness and load balancing. Experiments demonstrate that TORTA reduces average inference latency by up to 15%, decreases load standard deviation by 4–5%, and cuts total operational cost by 10–20%, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
The rapid growth of large language model (LLM) services imposes increasing demands on distributed GPU inference infrastructure. Most existing scheduling systems rely on the current system state to make decisions, without considering how task demand and resource availability evolve over time. This lack of temporal awareness leads to inefficient GPU utilization, high task migration overhead, and poor system responsiveness under dynamic workloads. In this work, we identify the fundamental limitations of these instantaneous-state-only scheduling approaches and propose Temporal Optimal Resource scheduling via Two-layer Architecture (TORTA). TORTA introduces a spatiotemporal scheduling framework that captures both long-term workload patterns and short-term execution constraints. It adopts a two-layer design: a macro-level scheduler leverages reinforcement learning and optimal transport to coordinate inter-region task distribution, while a micro-level allocator refines task-to-server assignments within each region to reduce latency and switching costs. Experimental results across multiple network topologies show that TORTA reduces average inference response time by up to 15%, improves load balance by approximately 4-5%, and cuts total operational cost by 10-20% compared to state-of-the-art baseline methods.
Problem

Research questions and friction points this paper is trying to address.

Optimizing distributed GPU inference for dynamic workloads
Reducing task migration overhead and improving GPU utilization
Enhancing system responsiveness with spatiotemporal scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning optimizes long-term GPU resource allocation
Two-layer architecture balances macro and micro scheduling
Optimal transport reduces inter-region task migration costs
🔎 Similar Papers
No similar papers found.
C
Chengze Du
Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, China
Zhiwei Yu
Zhiwei Yu
BAAI
Mutimodality InteractionEmbodied AIKnowledge Based QA/QGComputational Humor
Heng Xu
Heng Xu
Professor of Information Technology, Analytics, and Operations (ITAO), University of Notre Dame
Information PrivacyResponsible AITech PolicyAI EthicsUsable Security and Privacy
H
Haojie Wang
China Mobile Research Institute, Beijing, China
B
Bo liu
Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, China
Jialong Li
Jialong Li
Waseda University
self-adaptive systemsrequirement engineeringhuman-in-the-loop