ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs

📅 2026-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited scalability of existing large language model (LLM) inference frameworks on multi-core CPUs, which is primarily hindered by the high overhead of cross-NUMA-node memory accesses. To overcome this challenge, we propose the first lightweight LLM inference framework explicitly optimized for NUMA-aware CPU architectures. By integrating NUMA-aware thread scheduling, efficient memory management, and fine-grained tensor parallelism, our approach substantially reduces inter-node communication costs. The design maintains full compatibility with arbitrary CPU devices while achieving up to a 46% improvement in inference throughput, significantly surpassing the performance ceiling of current mainstream frameworks.

Technology Category

Application Category

📝 Abstract
Although existing frameworks for large language model (LLM) inference on CPUs are mature, they fail to fully exploit the computation potential of many-core CPU platforms. Many-core CPUs are widely deployed in web servers and high-end networking devices, and are typically organized into multiple NUMA nodes that group cores and memory. Current frameworks largely overlook the substantial overhead of cross-NUMA memory access, limiting inference scalability and intelligence enabling on such platforms. To address this limitation, we build ArcLight, a lightweight LLM inference architecture designed from the ground up for many-core CPUs. ArcLight integrates efficient memory management and thread scheduling, and introduces finely controlled tensor parallelism to mitigate the cross-node memory access wall. Experimental results show that ArcLight significantly surpasses the performance ceiling of mainstream frameworks, achieving up to 46% higher inference throughput. Moreover, ArcLight maintains compatibility with arbitrary CPU devices. ArcLight is publicly available at https://github.com/OpenBMB/ArcLight.
Problem

Research questions and friction points this paper is trying to address.

LLM inference
many-core CPUs
NUMA
memory access overhead
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

many-core CPUs
NUMA-aware
tensor parallelism
LLM inference
memory management