Photonic Rails in ML Datacenters

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address the high power consumption, cost, and complexity of optical rail networks in ML training—stemming from high-radix electrical switching—this paper proposes a time-division multiplexed optical-rail network architecture based on optical circuit switches (OCS). Our approach replaces electrical switching with OCS while preserving the optical-rail communication abstraction. We introduce a parallelism-driven dynamic rail reconfiguration mechanism that co-evolves network topology with model-level mixed parallelism dimensions. Additionally, we design Opus, a time-multiplexed control plane that jointly optimizes communication scheduling and OCS configuration. Experiments demonstrate that our architecture retains full-connectivity communication capability while significantly reducing power consumption and hardware overhead. It efficiently supports dynamic, fine-grained communication patterns required for large-scale distributed training.

Technology Category

Application Category

📝 Abstract

Rail-optimized network fabrics have become the de facto datacenter scale-out fabric for large-scale ML training. However, the use of high-radix electrical switches to provide all-to-all connectivity in rails imposes massive power, cost, and complexity overheads. We propose a rethinking of the rail abstraction by retaining its communication semantics, but realizing it using optical circuit switches. The key challenge is that optical switches support only one-to-one connectivity at a time, limiting the fan-out of traffic in ML workloads using hybrid parallelisms. We introduce parallelism-driven rail reconfiguration as a solution that leverages the sequential ordering between traffic from different parallelisms. We design a control plane, Opus, to enable time-multiplexed emulation of electrical rail switches using optical switches. More broadly, our work discusses a new research agenda: datacenter fabrics that co-evolve with the model parallelism dimensions within each job, as opposed to the prevailing mindset of reconfiguring networks before a job begins.

Problem

Research questions and friction points this paper is trying to address.

Overcoming power and cost inefficiencies in rail-optimized ML datacenter fabrics

Enabling optical switches to support ML workload fan-out demands

Co-evolving datacenter fabrics with dynamic model parallelism requirements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optical circuit switches replace electrical rail switches

Parallelism-driven rail reconfiguration for ML workloads

Opus control plane enables time-multiplexed emulation

🔎 Similar Papers

Streamlined optical training of large-scale modern deep learning architectures with direct feedback alignment