Semantic Scheduling for LLM Inference

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Traditional OS schedulers ignore request semantics, making it difficult to guarantee timeliness for high-priority tasks (e.g., medical emergencies). This paper introduces the first semantic-aware scheduling paradigm tailored for LLM inference requests, integrating natural language understanding directly into OS scheduling decisions. A lightweight LLM parses user intent and urgency from natural-language requests; priority constraints are then enforced via a low-complexity, real-time scheduling algorithm that dynamically adjusts task priorities based on semantic context. Our approach overcomes the limitations of conventional content-agnostic scheduling, achieving both high system throughput and significantly improved responsiveness for critical tasks: in medical emergency scenarios, average waiting time for high-priority requests decreases by 47%. The implementation is open-sourced.

Technology Category

Application Category

📝 Abstract

Conventional operating system scheduling algorithms are largely content-ignorant, making decisions based on factors such as latency or fairness without considering the actual intents or semantics of processes. Consequently, these algorithms often do not prioritize tasks that require urgent attention or carry higher importance, such as in emergency management scenarios. However, recent advances in language models enable semantic analysis of processes, allowing for more intelligent and context-aware scheduling decisions. In this paper, we introduce the concept of semantic scheduling in scheduling of requests from large language models (LLM), where the semantics of the process guide the scheduling priorities. We present a novel scheduling algorithm with optimal time complexity, designed to minimize the overall waiting time in LLM-based prompt scheduling. To illustrate its effectiveness, we present a medical emergency management application, underscoring the potential benefits of semantic scheduling for critical, time-sensitive tasks. The code and data are available at https://github.com/Wenyueh/latency_optimization_with_priority_constraints.

Problem

Research questions and friction points this paper is trying to address.

Improves LLM inference scheduling using semantic analysis

Prioritizes urgent tasks via context-aware scheduling decisions

Minimizes waiting time for critical time-sensitive applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic scheduling for LLM inference

Optimal time complexity algorithm

Medical emergency management application

🔎 Similar Papers

Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints