Compass: Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing works primarily target static CNN/Transformer workloads and struggle to address the dynamic mapping challenges posed by mixed request types and variable sequence lengths in LLM inference. This paper proposes a fine-grained mapping space exploration framework for multi-chip accelerators. It introduces a computation-execution-graph-based mapping encoding scheme that decouples micro-batch scheduling from inter-layer dependencies, enabling precise execution control across heterogeneous chips. Furthermore, it develops a multi-objective evaluation engine integrating genetic algorithms for efficient search, jointly modeling tensor parallelism, pipeline parallelism, and expert parallelism. Experiments demonstrate that our approach reduces the energy-delay product (EDP) by 63.12% on average over state-of-the-art methods, significantly improving both resource utilization and inference throughput.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) impose massive computational demands, driving the need for scalable multi-chiplet accelerators. However, existing mapping space exploration efforts for such accelerators primarily focus on traditional CNN/Transformer workloads and fail to adequately support the dynamic behaviors of mixed request types and variable sequence lengths in real-world LLM inference serving. To bridge this gap, we first propose a computation execution graph-based mapping encoding scheme that decouples micro-batches and layers, enabling fine-grained execution control on heterogeneous chiplets and flexibly representing various parallelism strategies. Second, building upon this scheme, we develop the Compass framework, which integrates an evaluation engine and a genetic algorithm-based mapping generation engine to achieve efficient mapping search. Compared to state-of-the-art works, our solution achieves an average EDP reduction of 63.12%.
Problem

Research questions and friction points this paper is trying to address.

Mapping space exploration for multi-chiplet accelerators targeting LLM inference workloads
Addressing dynamic behaviors like mixed request types and variable sequence lengths
Improving efficiency over existing methods for CNN/Transformer-focused approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mapping encoding scheme decouples micro-batches and layers
Compass framework integrates evaluation and genetic algorithm engines
Achieves efficient mapping search for multi-chiplet LLM accelerators
🔎 Similar Papers
No similar papers found.
B
Boyu Li
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China; Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou 215123, China
Z
Zongwei Zhu
Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou 215123, China
Y
Yi Xiong
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China; Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou 215123, China
Q
Qianyue Cao
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China; Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou 215123, China
J
Jiawei Geng
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China; Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou 215123, China
Xiaonan Zhang
Xiaonan Zhang
Assistant Professor of Computer Science, Florida State University
Wireless communication and networkEdge AIInternet of Things
X
Xi Li
Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou 215123, China