π€ AI Summary
This work addresses the memory bottleneck that constrains autoregressive decoding of large language models on heterogeneous NPUs such as the Ascend 910B, where static deployment leads to the βmodel scaling paradoxβ and fine-grained speculative decoding is hindered by kernel synchronization overhead and graph compilation limitations. To overcome these challenges, the authors propose an adaptive inference orchestration mechanism that dynamically coordinates multi-scale model selection, computation graph compilation optimization, and speculative decoding scheduling at runtime. This approach effectively circumvents synchronization overhead and enhances memory bandwidth utilization, thereby transcending the limitations of static deployment and micro-level acceleration. The method achieves significant improvements in inference throughput and latency on memory-constrained NPUs, outperforming existing solutions.
π Abstract
During the deployment of Large Language Models (LLMs), the autoregressive decoding phase on heterogeneous NPU platforms (e.g., Ascend 910B) faces severe memory-bound challenges. This study reveals the ``Model Scaling Paradox'' caused by the static deployment of single-sized models. It also points out the kernel synchronization overhead of fine-grained speculative decoding \cite{leviathan2023fast, chen2023speculative} under NPU computational graph compilation, and the severe limitations of purely relying on micro-level acceleration algorithms like Prompt LookUp Decoding (PLD)