Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing processors for robotic edge continual learning (e.g., Dacapo) suffer from limited data-type support—only MXINT—and inefficient vector grouping during backpropagation, hindering simultaneous optimization of accuracy, energy efficiency, and throughput. Method: This work introduces the first scalable hardware architecture supporting full-precision Microscaling (MX) across all six MX formats. It features a precision-scalable arithmetic unit and a novel square-shared exponent grouping mechanism, enabling unified integer–floating-point hybrid computation while eliminating memory redundancy and quantization overhead in backpropagation. Contribution/Results: Implemented in TSMC 16nm FinFET at 500 MHz, the design reduces chip area by 25.6% and on-chip memory footprint by 51% versus Dacapo, achieves 4× higher effective training throughput, and maintains comparable energy efficiency.

Technology Category

Application Category

📝 Abstract

Autonomous robots require efficient on-device learning to adapt to new environments without cloud dependency. For this edge training, Microscaling (MX) data types offer a promising solution by combining integer and floating-point representations with shared exponents, reducing energy consumption while maintaining accuracy. However, the state-of-the-art continuous learning processor, namely Dacapo, faces limitations with its MXINT-only support and inefficient vector-based grouping during backpropagation. In this paper, we present, to the best of our knowledge, the first work that addresses these limitations with two key innovations: (1) a precision-scalable arithmetic unit that supports all six MX data types by exploiting sub-word parallelism and unified integer and floating-point processing; and (2) support for square shared exponent groups to enable efficient weight handling during backpropagation, removing storage redundancy and quantization overhead. We evaluate our design against Dacapo under iso-peak-throughput on four robotics workloads in TSMC 16nm FinFET technology at 500MHz, reaching a 25.6% area reduction, a 51% lower memory footprint, and 4x higher effective training throughput while achieving comparable energy-efficiency, enabling efficient robotics continual learning at the edge.

Problem

Research questions and friction points this paper is trying to address.

Enables efficient on-device robotics learning with MX data types

Overcomes Dacapo's MXINT-only and inefficient vector grouping limitations

Proposes precision-scalable arithmetic and square exponent groups

Innovation

Methods, ideas, or system contributions that make the work stand out.

Precision-scalable arithmetic unit for MX data types

Square shared exponent groups for backpropagation

Unified integer and floating-point processing

🔎 Similar Papers

TinyCL: An Efficient Hardware Architecture for Continual Learning on Autonomous Systems