🤖 AI Summary
Existing FPGA floating-point units suffer from high area overhead, poor resource sharing, and limited throughput when supporting cross-precision dot-product accumulation (e.g., FP8→FP32). This work proposes TransDot, a reconfigurable floating-point unit that, for the first time, integrates multi-precision dot-product accumulation (FP16/FP8/FP4→FP32) into a unified architecture. By leveraging shared datapaths and reconfigurable submodules, TransDot reuses arithmetic resources to perform SIMD dot-product accumulation of 2×FP16, 4×FP8, or 8×FP4 into FP32. Compared to the FPnew baseline, TransDot achieves 1.46× and 2.92× higher area efficiency for FP16 and FP8 dot-product accumulation, respectively, while delivering 2–8× higher throughput, with only a 37.3% area increase and one additional pipeline stage.
📝 Abstract
Commercial FPGAs, such as AMD Versal devices, increasingly incorporate AI engines that exploit low-precision packed-SIMD fused multiply-accumulate (FMA) to achieve proportional throughput gains. However, trans-precision FMA (e.g., multiplying two FP16 numbers and adding their result to an FP32 accumulator), which preserves numerical stability by accumulating in higher precision, remains bottlenecked by the highest-precision, lowest-throughput operation. Dot-product accumulation (DPA) (e.g., performing a dot-product on two 4-element FP8 vectors and adding its result to an FP32 accumulator) can fully utilize the input/output bandwidth and computational resources. Existing flexible open-source FPUs, such as FPnew, do not support DPA and implement SIMD FMA on low-precision formats by replicating independent FMA lanes, which increases area, underutilizes shared arithmetic resources, and complicates the integration of DPA operations. This paper presents TransDot, a reconfigurable FPU that unifies multi-precision SIMD FMA and trans-precision DPA within a shared, reconfigurable datapath. TransDot extends the baseline design with 2-term FP16, 4-term FP8, and 8-term FP4 dot-product accumulation into FP32 using reconfigurable subcomponents. Evaluation shows that TransDot delivers 2$\times$ FP16, 4$\times$ FP8, and 8$\times$ FP4 throughput via DPA with FP32 accumulation, and 1.46$\times$ area efficiency in FP16 DPA and 2.92$\times$ area efficiency in FP8 DPA, at the cost of 37.3% larger area on average and an additional pipeline stage in dot-product mode compared to the FPnew baseline. These results demonstrate that TransDot's area-efficient design enables scalable deployment in next-generation AMD Versal AI engines.