Semantic-Aware Scheduling for GPU Clusters with Large Language Models

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current deep learning schedulers lack task semantic awareness, relying solely on limited metadata—resulting in inaccurate job duration prediction, delayed fault response, and poor observability. To address this, we propose SchedMate, the first framework to integrate large language models (LLMs) into GPU cluster scheduling. SchedMate automatically extracts deep semantic features by processing three types of unstructured data: source code, runtime logs, and historical job traces. Its three non-intrusive components seamlessly integrate with mainstream schedulers without requiring modifications to existing infrastructure. Extensive experiments on a 128-GPU cluster and real-world production traces demonstrate that SchedMate reduces average job completion time by up to 1.91×, significantly lowers prediction error, accelerates fault recovery, and—critically—enables the first semantic-driven intelligent scheduling system for deep learning workloads.

Technology Category

Application Category

📝 Abstract
Deep learning (DL) schedulers are pivotal in optimizing resource allocation in GPU clusters, but operate with a critical limitation: they are largely blind to the semantic context of the jobs they manage. This forces them to rely on limited metadata, leading to high profiling overhead, unreliable duration estimation, inadequate failure handling, and poor observability. To this end, we propose SchedMate, a framework that bridges this semantic gap by systematically extracting deep insights from overlooked, unstructured data sources: source code, runtime logs, and historical jobs. SchedMate enhances existing schedulers non-intrusively through three LLM-based components. Our implementation integrates seamlessly with existing deep learning schedulers. Evaluations on a 128-GPU physical cluster and extensive simulations on production traces show SchedMate reduces average job completion times by up to 1.91x, substantially enhancing the scheduling performance, demonstrating the critical role of semantic-awareness in modern DL scheduling.
Problem

Research questions and friction points this paper is trying to address.

Scheduling GPU clusters without understanding job semantics
Relying on limited metadata causes profiling and estimation issues
Extracting insights from unstructured data to improve scheduling efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLMs to extract insights from unstructured data
Integrates semantic-aware components into existing schedulers
Reduces job completion times via semantic context analysis
🔎 Similar Papers
No similar papers found.