🤖 AI Summary
Conventional vector architectures for wireless baseband processing suffer from limited register capacity, rigid power-of-two vector lengths, and inflexible permutation support. To address these limitations, this paper proposes the Unbounded Vector Processing (UVP) architecture, a RISC-V extension. Methodologically, UVP introduces a novel programming model supporting non-power-of-two register grouping and hardware-automated strip-mining; defines symmetric and asymmetric vector instruction classes with customized memory-access strategies; and integrates a highly robust permutation engine alongside a fixed-point-optimized pipeline. Implemented in SMIC 40 nm CMOS, the RTL prototype demonstrates 3.0× and 2.1× speedup over lane-based architectures for matrix multiplication and FFT, respectively. Under a 16-lane configuration, the design occupies only 0.94 mm² and achieves an energy efficiency of 21.2 GOPS/mm².
📝 Abstract
Wireless baseband processing (WBP) serves as an ideal scenario for utilizing vector processing, which excels in managing data-parallel operations due to its parallel structure. However, conventional vector architectures face certain constraints such as limited vector register sizes, reliance on power-of-two vector length multipliers, and vector permutation capabilities tied to specific architectures. To address these challenges, we have introduced an instruction set extension (ISE) based on RISC-V known as unlimited vector processing (UVP). This extension enhances both the flexibility and efficiency of vector computations. UVP employs a novel programming model that supports non-power-of-two register groupings and hardware strip-mining, thus enabling smooth handling of vectors of varying lengths while reducing the software strip-mining burden. Vector instructions are categorized into symmetric and asymmetric classes, complemented by specialized load/store strategies to optimize execution. Moreover, we present a hardware implementation of UVP featuring sophisticated hazard detection mechanisms, optimized pipelines for symmetric tasks such as fixed-point multiplication and division, and a robust permutation engine for effective asymmetric operations. Comprehensive evaluations demonstrate that UVP significantly enhances performance, achieving up to 3.0$ imes$ and 2.1$ imes$ speedups in matrix multiplication and fast Fourier transform (FFT) tasks, respectively, when measured against lane-based vector architectures. Our synthesized RTL for a 16-lane configuration using SMIC 40nm technology spans 0.94 mm$^2$ and achieves an area efficiency of 21.2 GOPS/mm$^2$.