Empowering Vector Architectures for ML: The CAMP Architecture for Matrix Multiplication

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Contemporary vector architectures (VAs/SIMD) suffer from low throughput and poor energy efficiency when executing low-precision quantized matrix multiplication, severely hindering the deployment of quantized neural networks (QNNs) on edge devices. To address this, we propose the Cartesian Accumulation Matrix Pipeline (CAMP), a novel microarchitecture featuring a mixed-precision multiplier-driven tiled accumulation pipeline, tightly integrated with hardware-level quantized dataflow scheduling and software–hardware co-optimization—natively supporting both ARMv8 SVE and RISC-V SIMD instruction sets. Implemented in 7 nm and 22 nm technologies, CAMP incurs only 1% and 4% area overhead, respectively. On LLM and CNN workloads, it achieves 17× and 23× speedup over ARM A64FX and RISC-V-based edge SoC baselines, respectively, while simultaneously delivering high throughput, low power consumption, and minimal silicon area.

Technology Category

Application Category

📝 Abstract
This study presents the Cartesian Accumulative Matrix Pipeline (CAMP) architecture, a novel approach designed to enhance matrix multiplication in Vector Architectures (VAs) and Single Instruction Multiple Data (SIMD) units. CAMP improves the processing efficiency of Quantized Neural Networks (QNNs). Matrix multiplication is a cornerstone of machine learning applications, and its quantized versions are increasingly popular for more efficient operations. Unfortunately, existing VAs and SIMD-support units struggle to efficiently handle these quantized formats. In this work, we propose CAMP, a simple yet effective architecture that leverages a hybrid multiplier. The CAMP architecture significantly advances the performance of vector architectures in handling quantized data, enabling more efficient execution of matrix multiplication across various platforms, specifically targeting the ARMv8 Scalable Vector Extension (SVE) and edge RISC-V SIMD-based architectures. In addition to increasing throughput, CAMP's architectural design also contributes to energy efficiency, making it an effective solution for low-power applications. Evaluations on a range of Large Language Models (LLMs) and Convolutional Neural Networks (CNNs) demonstrate that matrix multiplication operations using the proposed micro-architecture achieve up to 17$ imes$ and 23$ imes$ performance improvements compared to their respective baselines, the ARM A64FX core and a RISC-V-based edge System-on-Chip (SoC). Furthermore, synthesis and place-and-route (PnR) of the CAMP micro-architecture using Synopsys tools -- targeting ARM TSMC 7nm for A64FX and GlobalFoundries 22nm for the RISC-V SoC -- add only 1% and 4% area overhead, respectively, compared to the baseline designs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing matrix multiplication in Vector Architectures and SIMD units
Improving efficiency of Quantized Neural Networks (QNNs) processing
Boosting performance and energy efficiency for low-power applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

CAMP architecture enhances matrix multiplication efficiency
Hybrid multiplier improves quantized data handling
Energy-efficient design for low-power applications
🔎 Similar Papers
No similar papers found.
M
Mohammadreza Esmali Nojehdeh
Barcelona Supercomputing Center
H
Hossein Mokhtarnia
Barcelona Supercomputing Center
J
Julian Pavon Rivera
Barcelona Supercomputing Center
N
Narcis Rodas Quiroga
Barcelona Supercomputing Center
R
Roger Figueras Bagu'e
Barcelona Supercomputing Center
Enrico Reggiani
Enrico Reggiani
Barcelona Supercomputing Center
Hardware Architectures
Miquel Moretó
Miquel Moretó
Barcelona Supercomputing Center
Osman S. Unsal
Osman S. Unsal
Barcelona Supercomputing Center
Computer architecturereliabilitytransactional memorybig data
A
A. Cristal
Barcelona Supercomputing Center
Eduard Ayguade
Eduard Ayguade
Universitat Politecnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC)
High-performance computingProgramming modelsComputer architecture