Bandwidth-Aware LLM Inference on Heterogeneous Many-Core Supercomputers

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the challenges of deploying large language models on heterogeneous many-core supercomputers, where high computational overhead and limited memory bandwidth hinder direct adoption of existing GPU inference frameworks. To overcome these limitations, the authors propose THInfer, a hardware-aware inference framework that enhances data locality through co-design of software and hardware. THInfer introduces a hand-optimized FP16 operator library tailored for VLIW SIMD architectures, a density-driven computation graph fusion mechanism, and a two-stage Prefill-Buffer-Decode communication pipeline. Combined with hybrid MPI and hthreads parallelism, this design enables efficient multi-cluster collaboration. Experiments demonstrate that THInfer significantly outperforms GPU baselines on Llama-family models—achieving 62%–84% higher throughput for the 7B model—and successfully scales to run the 70B model on such platforms for the first time.

📝 Abstract

Large language model (LLM) inference is limited by high computational cost and memory bandwidth demands, making deployment on heterogeneous many-core processors challenging. Taking the MT-3000 processor used in the Tianhe supercomputer as an example, its limited main-memory bandwidth and distributed memory hierarchy exemplify these bottlenecks, making it difficult to directly migrate existing GPU-based inference frameworks. To address this problem, we propose THInfer, a hardware-aware inference framework that maximizes data locality under bandwidth-constrained conditions through hardware-software co-design and parallel strategy optimization. THInfer incorporates three key techniques: (1) a high-performance operator library for the VLIW SIMD architecture, providing hand-optimized FP16 kernels that achieve up to 70 percent of the peak performance per cluster; (2) a density-driven computation graph fusion and unified kernel scheduling mechanism, combined with a staged pipelined attention fusion method; and (3) a Prefill-Buffer-Decode (P-B-D) pipeline and bounded buffer management strategy, which supports hybrid parallelism and enables efficient multi-cluster collaboration through two-level communication based on MPI and hthreads. Experiments on the Llama model series show that THInfer improves throughput on the 7B model by 62 percent to 73 percent over DeepSpeed on two V100S GPUs and by 67 percent to 84 percent over the A800 GPU. The 13B and 30B models also demonstrate comparable or better performance. Moreover, THInfer maintains stable performance on the 70B model, whereas typical GPU-based frameworks fail to run under the same setting. Overall, THInfer significantly enhances throughput, reduces latency, and improves scalability, providing a feasible system solution for efficient and scalable LLM inference on heterogeneous many-core architectures.

Problem

Research questions and friction points this paper is trying to address.

LLM inference

memory bandwidth

heterogeneous many-core

supercomputers

deployment bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

hardware-aware inference

bandwidth-constrained optimization

computation graph fusion