SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address workload interference caused by shared links in Dragonfly networks, this paper proposes a lightweight surrogate model integrating Graph Neural Networks (GNNs) and Large Language Models (LLMs) for high-accuracy, low-overhead runtime application performance prediction. The GNN captures topology-aware spatial dependencies at the router-port level, while the LLM encodes temporal dynamics of network traffic; their joint embedding is incorporated into a data-driven hybrid simulation framework. This approach significantly enhances prediction robustness and real-time responsiveness under dynamic network conditions. Evaluated across diverse real-world workloads, it achieves an average 32.7% reduction in prediction error compared to conventional statistical and machine learning baselines. The method enables scalable, low-latency network performance simulation, thereby supporting fine-grained resource scheduling and system optimization for Dragonfly architectures.

Technology Category

Application Category

📝 Abstract

The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is commonly used to analyze workload interference. However, high-fidelity PDES is computationally expensive, making it impractical for large-scale or real-time scenarios. Hybrid simulation that incorporates data-driven surrogate models offers a promising alternative, especially for forecasting application runtime, a task complicated by the dynamic behavior of network traffic. We present ourmodel, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data. ourmodel outperforms existing statistical and machine learning baselines, enabling accurate runtime prediction and supporting efficient hybrid simulation of Dragonfly networks.

Problem

Research questions and friction points this paper is trying to address.

Predicting application runtime in Dragonfly network systems

Addressing workload interference on shared network links

Reducing computational cost of high-fidelity network simulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines GNNs and LLMs for network analysis

Captures spatial and temporal router data patterns

Enables accurate runtime prediction for hybrid simulation

🔎 Similar Papers

Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis