QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM evaluation frameworks critically lack coverage of computer architecture—the pivotal interdisciplinary domain bridging software and hardware. To address this gap, we introduce QuArch, the first domain-specific benchmark for evaluating LLMs on computer architecture. It comprises 2,671 expert-curated and validated question-answer pairs spanning core areas including processor design, memory systems, and interconnection networks, and supports fine-grained assessment across four task categories: knowledge comprehension, analysis, design, and implementation. QuArch is the first to systematically expose substantial performance disparities among state-of-the-art LLMs on architecture reasoning tasks, with accuracy ranging narrowly from 34% to 72%, and particularly poor performance on higher-order design and implementation questions. By filling a critical void in hardware-software co-intelligent evaluation, QuArch provides a reproducible, scalable, and scientifically rigorous foundation for diagnosing LLM capabilities and guiding model improvement in computer architecture.

Technology Category

Application Category

📝 Abstract
The field of computer architecture, which bridges high-level software abstractions and low-level hardware implementations, remains absent from current large language model (LLM) evaluations. To this end, we present QuArch (pronounced 'quark'), the first benchmark designed to facilitate the development and evaluation of LLM knowledge and reasoning capabilities specifically in computer architecture. QuArch provides a comprehensive collection of 2,671 expert-validated question-answer (QA) pairs covering various aspects of computer architecture, including processor design, memory systems, and interconnection networks. Our evaluation reveals that while frontier models possess domain-specific knowledge, they struggle with skills that require higher-order thinking in computer architecture. Frontier model accuracies vary widely (from 34% to 72%) on these advanced questions, highlighting persistent gaps in architectural reasoning across analysis, design, and implementation QAs. By holistically assessing fundamental skills, QuArch provides a foundation for building and measuring LLM capabilities that can accelerate innovation in computing systems. With over 140 contributors from 40 institutions, this benchmark represents a community effort to set the standard for architectural reasoning in LLM evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM reasoning in computer architecture domain
Assesses knowledge gaps in processor and memory systems
Measures higher-order thinking skills for hardware design
Innovation

Methods, ideas, or system contributions that make the work stand out.

QuArch benchmark for LLM computer architecture reasoning
Comprehensive expert-validated question-answer pairs collection
Holistically assesses fundamental architectural reasoning skills
🔎 Similar Papers
No similar papers found.
Shvetank Prakash
Shvetank Prakash
Harvard University
Ultra-Low Power ML SystemsComputer ArchitectureML for Systems
A
Andrew Cheng
Harvard University
Arya Tschand
Arya Tschand
Harvard University
Mark Mazumder
Mark Mazumder
Harvard University
V
Varun Gohil
Massachusetts Institute of Technology
Jeffrey Ma
Jeffrey Ma
Computer Science PhD at Harvard University
Machine Learning + SystemsDeep LearningLarge Language ModelsLLM4Code
Jason Yik
Jason Yik
Harvard University
Zishen Wan
Zishen Wan
Ph.D. Student, Georgia Tech
Computer ArchitectureVLSIAutonomous AgentsNeurosymbolic AIReliability
Jessica Quaye
Jessica Quaye
Harvard University
E
Elisavet Lydia Alvanaki
Columbia University
Avinash Kumar
Avinash Kumar
Research Assistant Soongsil University, Seoul, South Korea
Machine LearningDeep LearningComputer Vision GAN's
C
Chandrashis Mazumdar
UC Santa Cruz
Tuhin Khare
Tuhin Khare
Master's Student Georgia Tech, Prev. Research Associate, Indian Institute of Science
distributed systemsserverless computingcloud optimizationbig dataquantum computing
A
Alexander Ingare
Harvard University
Ikechukwu Uchendu
Ikechukwu Uchendu
Harvard University
machine learningartificial intelligencereinforcement learningrobotics
R
Radhika Ghosal
Harvard University
A
Abhishek Tyagi
University of Rochester
C
Chenyu Wang
Harvard University
Andrea Mattia Garavagno
Andrea Mattia Garavagno
PhD student
Embedded SystemsComputer ArchitectureMachine Learning
S
Sarah Gu
Harvard University
A
Alice Guo
Harvard University
G
Grace Hur
Harvard University
L
Luca Carloni
Columbia University
Tushar Krishna
Tushar Krishna
Associate Professor, Georgia Tech
Computer ArchitectureInterconnection NetworksNetwork-on-ChipDeep Learning Accelerators
Ankita Nayak
Ankita Nayak
Stanford University, Qualcomm AI Research
On-device MLML AcceleratorsEnergy Efficiency