Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing optical interconnects for collective communication rely on static, one-time network reconfiguration, which fails to adapt to the dynamic, fine-grained communication demands of modern distributed training—leading to either resource over-provisioning or performance degradation. This work introduces “intra-collective reconfiguration,” a novel paradigm that enables overlapping optical switch reconfiguration with data transmission for the first time. We design a demand-aware optical network scheduler, a lightweight collective communication middleware layer, and a dynamic parallelization mechanism—all operating transparently without modifying upper-layer communication libraries. These components jointly support fine-grained, topology-adaptive collective operations. Simulation results demonstrate that our approach reduces communication latency by up to 42% and improves bandwidth utilization by 3.1× compared to static reconfiguration, while significantly enhancing support for complex collective algorithms such as AllReduce and Hierarchical AllGather.

Technology Category

Application Category

📝 Abstract
Collective communication (CC) is widely adopted for large-scale distributed machine learning (DML) training workloads. DML's predictable traffic pattern provides a great oppotunity for applying optical network technology. Existing optical interconnects-based CC schemes adopt ``one-shot network reconfiguration'', which provisions static high-capacity topologies for an entire collective operation -- sometimes for a full training iteration. However, this approach faces significant scalability limitations when supporting more complex and efficient CC algorithms required for modern workloads: the ``one-shot'' strategies either demand excessive resource overprovisioning or suffer performance degradation due to rigid resource allocation. To address these challenges, we propose SWOT, a demand-aware optical network framework. SWOT employs ``intra-collective reconfiguration'' and can dynamically align network resources with CC traffic patterns. SWOT incorporates a novel scheduling technique that overlaps optical switch reconfigurations with ongoing transmissions, and improves communication efficiency. SWOT introduce a lightweight collective communication shim that enables coordinated optical network configuration and transmission scheduling while supporting seamless integration with existing CC libraries. Our simulation results demonstrate SWOT's significant performance improvements.
Problem

Research questions and friction points this paper is trying to address.

Overcoming scalability limitations in optical network collective communication
Enabling dynamic resource alignment with communication traffic patterns
Achieving reconfiguration-transmission overlap to improve communication efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic optical network alignment with traffic patterns
Overlapping switch reconfigurations with data transmissions
Lightweight shim coordinating network configuration and scheduling
🔎 Similar Papers
No similar papers found.