Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the energy efficiency and inference performance potential of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator within the National Research Platform (NRP) high-performance computing (HPC) ecosystem for large language model (LLM) inference—marking the first empirical comparison of QAic against state-of-the-art GPUs (NVIDIA A100/H200, AMD MI300A) in this context. Leveraging the vLLM framework, we conduct end-to-end inference benchmarks across 15 open-source LLMs (117M–90B parameters), measuring throughput, latency, and performance-per-watt. Results demonstrate that QAic achieves up to 2.3× higher energy efficiency than leading GPUs for mid-scale models (7B–13B), while operating at lower total power consumption; its feasibility for HPC deployment is empirically validated. This work fills a critical gap in the energy-efficiency assessment of domestically developed AI accelerators for NRP-scale LLM inference, providing foundational insights for low-carbon, cost-effective co-design of HPC infrastructure and LLM workloads.

Technology Category

Application Category

📝 Abstract
This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt) and performance against leading NVIDIA (A100, H200) and AMD (MI300A) GPUs within the National Research Platform (NRP) ecosystem. A total of 15 open-source LLMs, ranging from 117 million to 90 billion parameters, are served using the vLLM framework. The QAic inference cards appears to be energy efficient and performs well in the energy efficiency metric in most cases. The findings offer insights into the potential of the Qualcomm Cloud AI 100 Ultra for high-performance computing (HPC) applications within the National Research Platform (NRP).
Problem

Research questions and friction points this paper is trying to address.

Compare Qualcomm AI 100 Ultra with NVIDIA and AMD GPUs for LLM inference
Evaluate energy efficiency and performance of 15 open-source LLMs
Assess Qualcomm AI 100 Ultra's potential in HPC applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarks Qualcomm AI 100 Ultra against GPUs
Uses vLLM framework for serving LLMs
Evaluates energy efficiency in HPC clusters
🔎 Similar Papers
No similar papers found.
M
Mohammad Firas Sada
University of California, San Diego
J
John J. Graham
University of California, San Diego
Elham E Khoda
Elham E Khoda
University of California, San Diego
M
Mahidhar Tatineni
University of California, San Diego
D
Dmitry Mishin
University of California, San Diego
Rajesh K. Gupta
Rajesh K. Gupta
Professor of Computer Science and Engineering, Halıcıoğlu Data Science Institute, UC San Diego
Embedded SystemsCyber-Physical SystemsComputer-Aided DesignDesign AutomationEDA
Rick Wagner
Rick Wagner
San Diego Supercomputer Center
AstrophysicsTurbulenceCyberinfrastructure
L
Larry Smarr
University of California, San Diego
T
Thomas A. DeFanti
University of California, San Diego
F
Frank Würthwein
University of California, San Diego