A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the memory bottleneck that constrains autoregressive decoding of large language models on heterogeneous NPUs such as the Ascend 910B, where static deployment leads to the “model scaling paradox” and fine-grained speculative decoding is hindered by kernel synchronization overhead and graph compilation limitations. To overcome these challenges, the authors propose an adaptive inference orchestration mechanism that dynamically coordinates multi-scale model selection, computation graph compilation optimization, and speculative decoding scheduling at runtime. This approach effectively circumvents synchronization overhead and enhances memory bandwidth utilization, thereby transcending the limitations of static deployment and micro-level acceleration. The method achieves significant improvements in inference throughput and latency on memory-constrained NPUs, outperforming existing solutions.

Technology Category

Application Category

📝 Abstract

During the deployment of Large Language Models (LLMs), the autoregressive decoding phase on heterogeneous NPU platforms (e.g., Ascend 910B) faces severe memory-bound challenges. This study reveals the ``Model Scaling Paradox'' caused by the static deployment of single-sized models. It also points out the kernel synchronization overhead of fine-grained speculative decoding \cite{leviathan2023fast, chen2023speculative} under NPU computational graph compilation, and the severe limitations of purely relying on micro-level acceleration algorithms like Prompt LookUp Decoding (PLD)

Problem

Research questions and friction points this paper is trying to address.

memory-bound

Large Language Models

NPU

autoregressive decoding

speculative decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Inference Orchestration

Memory-Bound NPUs

Model Scaling Paradox