🤖 AI Summary
Scheduling agent-based LLM workloads on heterogeneous SoCs (CPU + integrated GPU + NPU) in personal devices faces conflicting demands between reactive tasks (requiring low latency) and proactive tasks (demanding high throughput), while existing inference engines lack coordinated, cross-accelerator scheduling capabilities.
Method: This paper proposes the first fine-grained heterogeneous scheduling framework tailored for agent scenarios. It introduces a preemptible heterogeneous execution graph; elastic accelerator mapping and slack-aware backfilling to resolve NPU/GPU resource contention; and integrates offline analysis, predictive kernel annotation, affinity-guided operator fusion and tiling, runtime preemption, and bandwidth-aware task dispatch.
Results: Evaluated on Intel Core Ultra platforms, the framework reduces end-to-end latency of reactive tasks by 4.6× and improves throughput of proactive tasks by 1.6–6.8×, significantly enhancing system-level efficiency and hardware resource utilization.
📝 Abstract
The proliferation of agentic Large Language Models (LLMs) on personal devices introduces a new class of workloads characterized by a dichotomy of objectives. Reactive tasks, initiated by users, demand immediate, low-latency responses, while proactive tasks operate invisibly and prioritize throughput. Existing on-device LLM engines, designed for isolated inferences, fail to efficiently manage these concurrent and conflicting requests on consumer-grade heterogeneous SoCs with CPU, integrated GPU, and NPU. This paper introduces Agent.xpu, an efficient serving system for agentic LLM workloads on memory-unified heterogeneous SoCs. With dedicated offline profiling, Agent.xpu first constructs a heterogeneous execution graph, which fuses and chunks model kernels for affinity-guided, elastic accelerator mapping with predictive kernel annotation. At runtime, its online scheduler enables fine-grained, kernel-level preemption to guarantee the responsiveness of reactive tasks. To maximize SoC utilization, it adopts slack-aware kernel backfill to opportunistically append proactive tasks, and mitigates NPU-iGPU contention via bandwidth-aware dispatch. Evaluation on an Intel Core Ultra SoC shows that Agent.xpu achieves 4.6$ imes$ lower latency for reactive tasks and sustains 1.6$ imes$-6.8$ imes$ higher throughput for proactive tasks compared to state-of-the-art inference engines.