🤖 AI Summary
To address the lack of high-quality, domain-specific evaluation benchmarks for large language models (LLMs) in computer architecture understanding, this paper introduces QuArch—the first fine-grained, human-expert-annotated, and multi-round-validated question-answering dataset tailored to architecture. Comprising 1,500 QA pairs, QuArch covers core topics including processor design, memory systems, and performance optimization. Methodologically, it employs a rigorous human-in-the-loop annotation protocol and adopts standard QA accuracy as the primary evaluation metric, supporting both supervised fine-tuning and zero-/few-shot assessment. Key contributions include: (1) establishing the first human-verified, architecture-specific QA benchmark; (2) revealing substantial capability gaps between state-of-the-art closed- and open-source small models—particularly in memory systems (84% vs. 72% accuracy); and (3) demonstrating that supervised fine-tuning improves small-model accuracy by up to 8 percentage points. The dataset and an online evaluation platform are publicly released.
📝 Abstract
We introduce QuArch, a dataset of 1500 human-validated question-answer pairs designed to evaluate and enhance language models' understanding of computer architecture. The dataset covers areas including processor design, memory systems, and performance optimization. Our analysis highlights a significant performance gap: the best closed-source model achieves 84% accuracy, while the top small open-source model reaches 72%. We observe notable struggles in memory systems, interconnection networks, and benchmarking. Fine-tuning with QuArch improves small model accuracy by up to 8%, establishing a foundation for advancing AI-driven computer architecture research. The dataset and leaderboard are at https://harvard-edge.github.io/QuArch/.