A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs

πŸ“… 2026-04-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

206K/year
πŸ€– AI Summary
This work addresses the memory bottleneck that constrains autoregressive decoding of large language models on heterogeneous NPUs such as the Ascend 910B, where static deployment leads to the β€œmodel scaling paradox” and fine-grained speculative decoding is hindered by kernel synchronization overhead and graph compilation limitations. To overcome these challenges, the authors propose an adaptive inference orchestration mechanism that dynamically coordinates multi-scale model selection, computation graph compilation optimization, and speculative decoding scheduling at runtime. This approach effectively circumvents synchronization overhead and enhances memory bandwidth utilization, thereby transcending the limitations of static deployment and micro-level acceleration. The method achieves significant improvements in inference throughput and latency on memory-constrained NPUs, outperforming existing solutions.

Technology Category

Application Category

πŸ“ Abstract
During the deployment of Large Language Models (LLMs), the autoregressive decoding phase on heterogeneous NPU platforms (e.g., Ascend 910B) faces severe memory-bound challenges. This study reveals the ``Model Scaling Paradox'' caused by the static deployment of single-sized models. It also points out the kernel synchronization overhead of fine-grained speculative decoding \cite{leviathan2023fast, chen2023speculative} under NPU computational graph compilation, and the severe limitations of purely relying on micro-level acceleration algorithms like Prompt LookUp Decoding (PLD)
Problem

Research questions and friction points this paper is trying to address.

memory-bound
Large Language Models
NPU
autoregressive decoding
speculative decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Inference Orchestration
Memory-Bound NPUs
Model Scaling Paradox
Speculative Decoding
Large Language Models
πŸ”Ž Similar Papers
No similar papers found.