🤖 AI Summary
Existing optical interconnects for collective communication rely on static, one-time network reconfiguration, which fails to adapt to the dynamic, fine-grained communication demands of modern distributed training—leading to either resource over-provisioning or performance degradation. This work introduces “intra-collective reconfiguration,” a novel paradigm that enables overlapping optical switch reconfiguration with data transmission for the first time. We design a demand-aware optical network scheduler, a lightweight collective communication middleware layer, and a dynamic parallelization mechanism—all operating transparently without modifying upper-layer communication libraries. These components jointly support fine-grained, topology-adaptive collective operations. Simulation results demonstrate that our approach reduces communication latency by up to 42% and improves bandwidth utilization by 3.1× compared to static reconfiguration, while significantly enhancing support for complex collective algorithms such as AllReduce and Hierarchical AllGather.
📝 Abstract
Collective communication (CC) is widely adopted for large-scale distributed machine learning (DML) training workloads. DML's predictable traffic pattern provides a great oppotunity for applying optical network technology. Existing optical interconnects-based CC schemes adopt ``one-shot network reconfiguration'', which provisions static high-capacity topologies for an entire collective operation -- sometimes for a full training iteration. However, this approach faces significant scalability limitations when supporting more complex and efficient CC algorithms required for modern workloads: the ``one-shot'' strategies either demand excessive resource overprovisioning or suffer performance degradation due to rigid resource allocation.
To address these challenges, we propose SWOT, a demand-aware optical network framework. SWOT employs ``intra-collective reconfiguration'' and can dynamically align network resources with CC traffic patterns. SWOT incorporates a novel scheduling technique that overlaps optical switch reconfigurations with ongoing transmissions, and improves communication efficiency. SWOT introduce a lightweight collective communication shim that enables coordinated optical network configuration and transmission scheduling while supporting seamless integration with existing CC libraries. Our simulation results demonstrate SWOT's significant performance improvements.