🤖 AI Summary
This work addresses the growing communication bottleneck in large model training, where inter-device data transfer limits performance, necessitating coordinated bandwidth and efficiency improvements at both intra-package and inter-package levels. The paper proposes a cross-layer, multi-objective optimization framework that, for the first time, jointly models chiplet architecture, training parallelization strategies, and optical interconnect network topology to enable holistic co-design from chip to system scale. By integrating chiplet-based integration, optical interconnects, and parallelism modeling—and combining black-box and white-box approaches for efficient design space exploration—the method substantially enhances training throughput. The study demonstrates the feasibility and effectiveness of cross-layer co-design in communication-intensive scenarios, offering a new paradigm for future large-model training clusters.
📝 Abstract
In large-scale distributed LLM training, communication between devices becomes the key performance bottleneck. Chiplet technology can integrate multiple dies into a package to scale-up node performance with higher bandwidth. Meanwhile, optical interconnect (OI) technology offers long-reach, high-bandwidth links, making it well suited for scale-out networks. The combination of these two technologies has the potential to overcome communication bottlenecks within and across packages. In this work, we present ChipLight, a cross-layer multi-objective design and optimization method for training clusters leveraging chiplet and OI. We first abstract an architecture model for such complex clusters, co-optimizing chiplet architecture, training parallel strategy, and OI network topology. Based on such models, we tailor the design space exploration flow by combining both black-box and white-box methodologies. Evaluated by our experimental results, ChipLight achieves significantly improved training efficiency and provides valuable design insights for the development of future training clusters.