🤖 AI Summary
To address workload interference caused by shared links in Dragonfly networks, this paper proposes a lightweight surrogate model integrating Graph Neural Networks (GNNs) and Large Language Models (LLMs) for high-accuracy, low-overhead runtime application performance prediction. The GNN captures topology-aware spatial dependencies at the router-port level, while the LLM encodes temporal dynamics of network traffic; their joint embedding is incorporated into a data-driven hybrid simulation framework. This approach significantly enhances prediction robustness and real-time responsiveness under dynamic network conditions. Evaluated across diverse real-world workloads, it achieves an average 32.7% reduction in prediction error compared to conventional statistical and machine learning baselines. The method enables scalable, low-latency network performance simulation, thereby supporting fine-grained resource scheduling and system optimization for Dragonfly architectures.
📝 Abstract
The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is commonly used to analyze workload interference. However, high-fidelity PDES is computationally expensive, making it impractical for large-scale or real-time scenarios. Hybrid simulation that incorporates data-driven surrogate models offers a promising alternative, especially for forecasting application runtime, a task complicated by the dynamic behavior of network traffic. We present ourmodel, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data. ourmodel outperforms existing statistical and machine learning baselines, enabling accurate runtime prediction and supporting efficient hybrid simulation of Dragonfly networks.