Scheduling Parallel Optical Circuit Switches for AI Training

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the traffic scheduling problem on parallel optical circuit switches with non-zero reconfiguration delay, aiming to minimize makespan under the high-bandwidth and low-energy demands of AI training. The authors propose a three-stage scheduling approach: first decomposing the traffic matrix into weighted permutations, then assigning these permutations to parallel optical switches in a load-aware manner, and finally applying a controlled permutation splitting strategy to balance switch loads. This method is the first to integrate permutation decomposition with load balancing while accounting for reconfiguration overhead, achieving performance that closely approaches a newly derived theoretical lower bound. Experimental results demonstrate significant makespan reductions—by factors of 1.4× and 1.9× on representative AI workloads such as GPT and Qwen MoE, respectively—and up to 2.4× on standard benchmarks.

Technology Category

Application Category

📝 Abstract
The rapid growth of AI training has dramatically increased datacenter traffic demand and energy consumption, which has motivated renewed interest in optical circuit switches (OCSes) as a high-bandwidth, energy-efficient alternative for AI fabrics. Deploying multiple parallel OCSes is a leading alternative. However, efficiently scheduling time-varying traffic matrices across parallel optical switches with non-negligible reconfiguration delays remains an open challenge. We consider the problem of scheduling a single AI traffic demand matrix $D$ over $s$ parallel OCSes while minimizing the makespan under reconfiguration delay $\delta$. Our algorithm Spectra relies on a three-step approach: Decompose $D$ into a minimal set of weighted permutations; Schedule these permutations across parallel switches using load-aware assignment; then Equalize the imbalanced loads on the switches via controlled permutation splitting. Evaluated on realistic AI training workloads (GPT model and Qwen MoE expert routing) as well as standard benchmarks, Spectra vastly outperforms a baseline based on state-of-the-art algorithms, reducing schedule makespan by an average factor of $1.4\times$ on GPT AI workloads, $1.9\times$ on MoE AI workloads, and $2.4\times$ on standard benchmarks. Further, the makespans achieved by Spectra consistently approach newly derived lower bounds.
Problem

Research questions and friction points this paper is trying to address.

optical circuit switches
AI training
traffic scheduling
reconfiguration delay
parallel switches
Innovation

Methods, ideas, or system contributions that make the work stand out.

optical circuit switching
AI training traffic scheduling
parallel OCS
reconfiguration delay
makespan minimization
🔎 Similar Papers
No similar papers found.
K
Kevin Liang
UC San Diego
L
Litao Qiao
UC San Diego
Isaac Keslassy
Isaac Keslassy
Technion
Computer Networks
Bill Lin
Bill Lin
Professor, UCSD