Exploring the Dynamic Scheduling Space of Real-Time Generative AI Applications on Emerging Heterogeneous Systems

📅 2025-07-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-time generative AI (RTGen) on heterogeneous SoCs (CPU/GPU/NPU) faces significant challenges in scheduling due to stringent latency requirements, strong multimodal concurrency constraints, and workload dynamism. Method: This paper proposes a runtime-aware dynamic scheduling mechanism. We establish realistic multi-model RTGen workloads on the AMD Ryzen AI platform and conduct systematic performance profiling across multiple backends to evaluate five scheduling strategies—measuring first-token latency, token generation rate, and deadline violation rate. Contribution/Results: Experiments reveal that workload dynamics and hardware heterogeneity severely limit static scheduling efficacy. Our dynamic mechanism reduces average deadline violation rate by 41.7% while improving generation throughput and response consistency—without compromising real-time guarantees. To our knowledge, this is the first empirical study demonstrating the critical role of dynamic scheduling in balancing performance and timeliness for RTGen on edge-side heterogeneous SoCs, providing a practical, deployable scheduling paradigm for edge generative AI systems.

Technology Category

Application Category

📝 Abstract
The integration of generative AI models, particularly large language models (LLMs), into real-time multi-model AI applications such as video conferencing and gaming is giving rise to a new class of workloads: real-time generative AI (RTGen). These workloads combine the compute intensity and dynamic execution patterns of generative models with the stringent latency and concurrency constraints of real-time inference. To meet the diverse demands of RTGen workloads, modern edge platforms increasingly adopt heterogeneous system-on-chip (SoC) architectures that integrate CPUs, GPUs, and NPUs. Despite the potential of heterogeneous SoC, the scheduling space complexity and performance implications of RTGen workloads on such platforms remain underexplored. In this work, we perform a comprehensive characterization of RTGen workloads on AMD's latest heterogeneous SoC, Ryzen AI. We construct realistic multi-model scenarios inspired by industry use cases and profile model performance across all available backends. Using this data, we evaluate five scheduling policies and their impact on both real-time metrics (e.g., deadline violation rate) and LLM performance (e.g., time-to-first-token and tokens-per-second). Our results show that scheduling decisions significantly affect workload performance (e.g., leading to a 41.7% difference in deadline violation rates on average), and highlight the need for scheduling strategies that are aware of workload dynamics and hardware heterogeneity. Our findings underscore the importance of workload-aware, dynamic heterogeneous scheduling in enabling high-performance, on-device RTGen applications.
Problem

Research questions and friction points this paper is trying to address.

Scheduling RTGen workloads on heterogeneous SoCs
Optimizing real-time metrics and LLM performance
Dynamic scheduling for hardware heterogeneity awareness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic scheduling for real-time generative AI workloads
Heterogeneous SoC with CPU, GPU, NPU integration
Workload-aware policies reduce deadline violations significantly