🤖 AI Summary
This work addresses the longstanding challenge in analog in-memory computing (AIMC) of simultaneously enabling on-chip training, long-term weight retention, and high-efficiency inference. We propose a fully functional analog AI accelerator based on back-end-of-line (BEOL)-integrated CMO/HfOₓ 1T1R ReRAM devices, realizing for the first time a hardware architecture that unifies training, inference, and nonvolatile storage. The device achieves >32 stable multilevel states (5-bit precision), sub-10-ns programming noise, and weight-mapping fidelity over an order of magnitude higher than state-of-the-art memristors. Circuit-level simulations of a 64×64 crossbar show matrix-vector multiplication root-mean-square error (RMSE) maintained at 0.06–0.2 over retention times from 1 second to 10 years. On-chip training achieves software-level accuracy across multiple benchmarks, demonstrating hardware-closed-loop learning feasibility. This work establishes a scalable, low-power, autonomous, and continuously adaptive AI hardware foundation.
📝 Abstract
Analog in-memory computing is an emerging paradigm designed to efficiently accelerate deep neural network workloads. Recent advancements have demonstrated significant improvements in throughput and efficiency, focusing independently on either inference or training acceleration. However, a unified analog in-memory technology platform-capable of performing on-chip training, retaining the weights, and sustaining long-term inference acceleration-has yet to be reported. In this work, an all-in-one analog AI accelerator is presented and benchmarked, combining these capabilities to enable autonomous, energy-efficient, and continuously adaptable AI systems. The platform leverages an array of filamentary conductive-metal-oxide (CMO)/HfOx redox-based resistive switching memory cells (ReRAM) in one-transistor one-ReRAM (1T1R) configuration, integrated into the back-end-of-line (BEOL) of a 130 nm technology node. The array characterization demonstrates reliable and optimized resistive switching with voltage amplitudes of less than 1.5 V, enabling compatibility with advanced technology nodes. The multi-bit capability of over 32 stable states (5 bits) and record-low programming noise down to 10 nS enable an almost ideal weight transfer process, more than an order of magnitude better than other memristive technologies. The array's inference performance is validated through realistic matrix-vector multiplication simulations on a 64x64 array, achieving a record-low root-mean-square error ranging from 0.06 at 1 second to 0.2 at 10 years after programming, compared to the ideal floating-point case. The array is then measured under the same conditions as those used for on-chip training. Training accuracy closely matching the software equivalent is achieved across different datasets, with high-fidelity modelling of the device response based on experimental-only data.