Twill: Scheduling Compound AI Systems on Heterogeneous Mobile Edge Platforms

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the runtime scheduling challenge for dynamic concurrent inference of DNNs and Transformers on heterogeneous mobile edge platforms—characterized by unknown task arrivals, diverse computational demands, and stringent power constraints—where existing approaches fail to jointly optimize latency and energy efficiency. We propose the first runtime cooperative scheduling framework tailored for composite AI (cAI) systems. Key contributions include: (i) task affinity-aware cluster mapping and dynamic migration; (ii) priority-driven freezing/unfreezing of inference tasks; and (iii) a joint optimization algorithm integrating DVFS and load-aware resource allocation. Evaluated on an NVIDIA Jetson Orin NX platform, our framework achieves a 54% reduction in average inference latency over state-of-the-art methods while strictly adhering to prescribed power budgets.

Technology Category

Application Category

📝 Abstract

Compound AI (cAI) systems chain multiple AI models to solve complex problems. cAI systems are typically composed of deep neural networks (DNNs), transformers, and large language models (LLMs), exhibiting a high degree of computational diversity and dynamic workload variation. Deploying cAI services on mobile edge platforms poses a significant challenge in scheduling concurrent DNN-transformer inference tasks, which arrive dynamically in an unknown sequence. Existing mobile edge AI inference strategies manage multi-DNN or transformer-only workloads, relying on design-time profiling, and cannot handle concurrent inference of DNNs and transformers required by cAI systems. In this work, we address the challenge of scheduling cAI systems on heterogeneous mobile edge platforms. We present Twill, a run-time framework to handle concurrent inference requests of cAI workloads through task affinity-aware cluster mapping and migration, priority-aware task freezing/unfreezing, and DVFS, while minimizing inference latency within power budgets. We implement and deploy our Twill framework on the Nvidia Jetson Orin NX platform. We evaluate Twill against state-of-the-art edge AI inference techniques over contemporary DNNs and LLMs, reducing inference latency by 54% on average, while honoring power budgets.

Problem

Research questions and friction points this paper is trying to address.

Scheduling diverse AI models on mobile edge platforms

Managing concurrent DNN-transformer inference tasks dynamically

Optimizing latency and power for compound AI systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task affinity-aware cluster mapping and migration

Priority-aware task freezing and unfreezing

DVFS for minimizing latency within power budgets

🔎 Similar Papers

Synergy: Towards On-Body AI via Tiny AI Accelerator Collaboration on Wearables