Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Scheduling agent-based LLM workloads on heterogeneous SoCs (CPU + integrated GPU + NPU) in personal devices faces conflicting demands between reactive tasks (requiring low latency) and proactive tasks (demanding high throughput), while existing inference engines lack coordinated, cross-accelerator scheduling capabilities. Method: This paper proposes the first fine-grained heterogeneous scheduling framework tailored for agent scenarios. It introduces a preemptible heterogeneous execution graph; elastic accelerator mapping and slack-aware backfilling to resolve NPU/GPU resource contention; and integrates offline analysis, predictive kernel annotation, affinity-guided operator fusion and tiling, runtime preemption, and bandwidth-aware task dispatch. Results: Evaluated on Intel Core Ultra platforms, the framework reduces end-to-end latency of reactive tasks by 4.6× and improves throughput of proactive tasks by 1.6–6.8×, significantly enhancing system-level efficiency and hardware resource utilization.

Technology Category

Application Category

📝 Abstract

The proliferation of agentic Large Language Models (LLMs) on personal devices introduces a new class of workloads characterized by a dichotomy of objectives. Reactive tasks, initiated by users, demand immediate, low-latency responses, while proactive tasks operate invisibly and prioritize throughput. Existing on-device LLM engines, designed for isolated inferences, fail to efficiently manage these concurrent and conflicting requests on consumer-grade heterogeneous SoCs with CPU, integrated GPU, and NPU. This paper introduces Agent.xpu, an efficient serving system for agentic LLM workloads on memory-unified heterogeneous SoCs. With dedicated offline profiling, Agent.xpu first constructs a heterogeneous execution graph, which fuses and chunks model kernels for affinity-guided, elastic accelerator mapping with predictive kernel annotation. At runtime, its online scheduler enables fine-grained, kernel-level preemption to guarantee the responsiveness of reactive tasks. To maximize SoC utilization, it adopts slack-aware kernel backfill to opportunistically append proactive tasks, and mitigates NPU-iGPU contention via bandwidth-aware dispatch. Evaluation on an Intel Core Ultra SoC shows that Agent.xpu achieves 4.6$ imes$ lower latency for reactive tasks and sustains 1.6$ imes$-6.8$ imes$ higher throughput for proactive tasks compared to state-of-the-art inference engines.

Problem

Research questions and friction points this paper is trying to address.

Efficiently schedule agentic LLM workloads on heterogeneous SoCs

Manage concurrent reactive and proactive tasks with conflicting demands

Optimize resource use across CPU, GPU, and NPU accelerators

Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous execution graph for affinity-guided mapping

Kernel-level preemption for reactive task responsiveness

Slack-aware backfill and bandwidth-aware dispatch

🔎 Similar Papers

AIOS: LLM Agent Operating System