Semantic Scheduling for LLM Inference

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional OS schedulers ignore request semantics, making it difficult to guarantee timeliness for high-priority tasks (e.g., medical emergencies). This paper introduces the first semantic-aware scheduling paradigm tailored for LLM inference requests, integrating natural language understanding directly into OS scheduling decisions. A lightweight LLM parses user intent and urgency from natural-language requests; priority constraints are then enforced via a low-complexity, real-time scheduling algorithm that dynamically adjusts task priorities based on semantic context. Our approach overcomes the limitations of conventional content-agnostic scheduling, achieving both high system throughput and significantly improved responsiveness for critical tasks: in medical emergency scenarios, average waiting time for high-priority requests decreases by 47%. The implementation is open-sourced.

Technology Category

Application Category

📝 Abstract
Conventional operating system scheduling algorithms are largely content-ignorant, making decisions based on factors such as latency or fairness without considering the actual intents or semantics of processes. Consequently, these algorithms often do not prioritize tasks that require urgent attention or carry higher importance, such as in emergency management scenarios. However, recent advances in language models enable semantic analysis of processes, allowing for more intelligent and context-aware scheduling decisions. In this paper, we introduce the concept of semantic scheduling in scheduling of requests from large language models (LLM), where the semantics of the process guide the scheduling priorities. We present a novel scheduling algorithm with optimal time complexity, designed to minimize the overall waiting time in LLM-based prompt scheduling. To illustrate its effectiveness, we present a medical emergency management application, underscoring the potential benefits of semantic scheduling for critical, time-sensitive tasks. The code and data are available at https://github.com/Wenyueh/latency_optimization_with_priority_constraints.
Problem

Research questions and friction points this paper is trying to address.

Improves LLM inference scheduling using semantic analysis
Prioritizes urgent tasks via context-aware scheduling decisions
Minimizes waiting time for critical time-sensitive applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic scheduling for LLM inference
Optimal time complexity algorithm
Medical emergency management application
🔎 Similar Papers
No similar papers found.