🤖 AI Summary
This work addresses the inefficiency of Microscaling (MX) formats on vector processors, where block-wise scaling and mixed-precision operations disrupt pipeline regularity, leading to low resource utilization. To overcome this, the authors propose VMXDOTP—an extension to the RISC-V Vector 1.0 ISA that natively supports variable-block-size MX-format dot-product operations for the first time. VMXDOTP accommodates MXFP8/MXFP4 inputs, FP32/BF16 accumulation, and software-defined block sizes. By integrating a dedicated cluster of vector processing units, the design significantly enhances computational regularity, energy efficiency, and area efficiency. Implemented in 12nm technology, the accelerator achieves up to 125 MXFP8-GFLOPS and 250 MXFP4-GFLOPS, with an energy efficiency of 1632 GFLOPS/W—7.0× faster than software emulation—while incurring only a 7.2% area overhead.
📝 Abstract
Compared to the first generation of deep neural networks, dominated by regular, compute-intensive kernels such as matrix multiplications (MatMuls) and convolutions, modern decoder-based transformers interleave attention, normalization, and data-dependent control flow. This demands flexible accelerators, a requirement met by scalable, highly energy-efficient shared-L1-memory vector processing element (VPE) clusters. Meanwhile, the ever-growing size and bandwidth needs of state-of-the-art models make reduced-precision formats increasingly attractive. Microscaling (MX) data formats, based on block floating-point (BFP) representations, have emerged as a promising solution to reduce data volumes while preserving accuracy. However, MX semantics are poorly aligned with vector execution: block scaling and multi-step mixed-precision operations break the regularity of vector pipelines, leading to underutilized compute resources and performance degradation. To address these challenges, we propose VMXDOTP, a RISC-V Vector (RVV) 1.0 instruction set architecture (ISA) extension for efficient MX dot product execution, supporting MXFP8 and MXFP4 inputs, FP32 and BF16 accumulation, and software-defined block sizes. A VMXDOTP-enhanced VPE cluster achieves up to 97 % utilization on MX-MatMul. Implemented in 12 nm FinFET, it achieves up to 125 MXFP8-GFLOPS and 250 MXFP4-GFLOPS, with 843/1632 MXFP8/MXFP4-GFLOPS/W at 1 GHz, 0.8 V, and only 7.2 % area overhead. Our design yields up to 7.0x speedup and 4.9x energy efficiency with respect to software-emulated MXFP8-MatMul. Compared with prior MX engines, VMXDOTP supports variable block sizes, is up to 1.4x more area-efficient, and delivers up to 2.1x higher energy efficiency.