HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiencies in scheduling multi-stage, dynamically executed agent-based RAG systems on heterogeneous mobile SoCs, which stem from accelerator affinity, sensitivity to computational shapes, and contention for shared memory bandwidth. To tackle these challenges, the authors propose HeRo, the first lightweight online scheduler tailored for mobile SoCs. HeRo enables fine-grained resource coordination by constructing per-substage performance models through profiling, and integrates shape-aware task partitioning, critical-path-aware accelerator mapping, and bandwidth-aware concurrency control. Experimental evaluation on real devices demonstrates that HeRo reduces end-to-end latency by up to 10.94×, significantly enhancing the practicality and efficiency of on-device agent-based RAG systems.

Technology Category

Application Category

📝 Abstract
With the increasing computational capability of mobile devices, deploying agentic retrieval-augmented generation (RAG) locally on heterogeneous System-on-Chips (SoCs) has become a promising way to enhance LLM-based applications. However, agentic RAG induces multi-stage workflows with heterogeneous models and dynamic execution flow, while mobile SoCs exhibit strong accelerator affinity, shape sensitivity, and shared-memory bandwidth contention, making naive scheduling ineffective. We present HeRo, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs. HeRo builds profiling-based performance models for each sub-stage and model-PU configuration, capturing latency, workload shape, and contention-induced slowdown, and leverages them in a lightweight online scheduler that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control. Experiments on commercial mobile devices show that HeRo reduces end-to-end latency by up to $10.94\times$ over existing deployment strategies, enabling practical on-device agentic RAG.
Problem

Research questions and friction points this paper is trying to address.

agentic RAG
heterogeneous mobile SoC
multi-stage workflow
accelerator affinity
memory bandwidth contention
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic RAG
heterogeneous SoC
adaptive orchestration
performance modeling
online scheduling
🔎 Similar Papers
No similar papers found.