Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scheduling agent-based LLM workloads on heterogeneous SoCs (CPU + integrated GPU + NPU) in personal devices faces conflicting demands between reactive tasks (requiring low latency) and proactive tasks (demanding high throughput), while existing inference engines lack coordinated, cross-accelerator scheduling capabilities. Method: This paper proposes the first fine-grained heterogeneous scheduling framework tailored for agent scenarios. It introduces a preemptible heterogeneous execution graph; elastic accelerator mapping and slack-aware backfilling to resolve NPU/GPU resource contention; and integrates offline analysis, predictive kernel annotation, affinity-guided operator fusion and tiling, runtime preemption, and bandwidth-aware task dispatch. Results: Evaluated on Intel Core Ultra platforms, the framework reduces end-to-end latency of reactive tasks by 4.6× and improves throughput of proactive tasks by 1.6–6.8×, significantly enhancing system-level efficiency and hardware resource utilization.

Technology Category

Application Category

📝 Abstract
The proliferation of agentic Large Language Models (LLMs) on personal devices introduces a new class of workloads characterized by a dichotomy of objectives. Reactive tasks, initiated by users, demand immediate, low-latency responses, while proactive tasks operate invisibly and prioritize throughput. Existing on-device LLM engines, designed for isolated inferences, fail to efficiently manage these concurrent and conflicting requests on consumer-grade heterogeneous SoCs with CPU, integrated GPU, and NPU. This paper introduces Agent.xpu, an efficient serving system for agentic LLM workloads on memory-unified heterogeneous SoCs. With dedicated offline profiling, Agent.xpu first constructs a heterogeneous execution graph, which fuses and chunks model kernels for affinity-guided, elastic accelerator mapping with predictive kernel annotation. At runtime, its online scheduler enables fine-grained, kernel-level preemption to guarantee the responsiveness of reactive tasks. To maximize SoC utilization, it adopts slack-aware kernel backfill to opportunistically append proactive tasks, and mitigates NPU-iGPU contention via bandwidth-aware dispatch. Evaluation on an Intel Core Ultra SoC shows that Agent.xpu achieves 4.6$ imes$ lower latency for reactive tasks and sustains 1.6$ imes$-6.8$ imes$ higher throughput for proactive tasks compared to state-of-the-art inference engines.
Problem

Research questions and friction points this paper is trying to address.

Efficiently schedule agentic LLM workloads on heterogeneous SoCs
Manage concurrent reactive and proactive tasks with conflicting demands
Optimize resource use across CPU, GPU, and NPU accelerators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous execution graph for affinity-guided mapping
Kernel-level preemption for reactive task responsiveness
Slack-aware backfill and bandwidth-aware dispatch
🔎 Similar Papers
2024-03-25arXiv.orgCitations: 16
Xinming Wei
Xinming Wei
Peking University
Computer ArchitectureSecurityLLM
J
Jiahao Zhang
School of Computer Science, Peking University
H
Haoran Li
School of Computer Science, Peking University
J
Jiayu Chen
School of Computer Science, Peking University
R
Rui Qu
School of Computer Science, Peking University
M
Maoliang Li
School of Computer Science, Peking University
X
Xiang Chen
School of Computer Science, Peking University
Guojie Luo
Guojie Luo
Peking University
Electronic Design AutomationReconfigurable Architecture