The Quest for Reliable AI Accelerators: Cross-Layer Evaluation and Design Optimization

📅 2025-10-21
🏛️ International Conference on ASIC
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical challenge of reliability degradation in AI accelerators fabricated in nanoscale CMOS technologies, where aging effects and process variations severely compromise system dependability. Conventional guardbanding approaches incur significant performance overheads, rendering them inadequate for high-performance AI computing. To overcome this limitation, the paper proposes a cross-layer reliability modeling and co-optimization framework spanning device, circuit, architecture, and application levels, with a novel focus on large language models. By integrating aging- and variation-aware dynamic timing analysis, dataflow optimization, and resilient architecture characterization, the approach breaks the traditional trade-off between reliability and computational efficiency. The resulting design simultaneously ensures system reliability and substantially improves energy efficiency, establishing a new paradigm for dependable AI hardware.

Technology Category

Application Category

📝 Abstract
As the CMOS technology pushes to the nanoscale, aging effects and process variations have become increasingly pronounced, posing significant reliability challenges for AI accelerators. Traditional guardband-based design approaches, which rely on pessimistic timing margin, sacrifice significant performance and computational efficiency, rendering them in-adequate for high-performance AI computing demands. Current reliability-aware AI accelerator design faces two core challenges: (1) the lack of systematic cross-layer analysis tools to capture coupling reliability effects across device, circuit, architecture, and application layers; and (2) the fundamental trade-off between conventional reliability optimization and computational efficiency. To address these challenges, this paper systematically presents a series of reliability-aware accelerator designs, encompassing (1) aging and variation-aware dynamic timing analyzer, (2) accelerator dataflow optimization using critical input pattern reduction, and (3) resilience characterization and novel architecture design for large language models (LLMs). By tightly integrating cross-layer reliability modeling and AI workload characteristics, these co-optimization approaches effectively achieve reliable and efficient AI acceleration.
Problem

Research questions and friction points this paper is trying to address.

AI accelerators
reliability
process variations
aging effects
cross-layer analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-layer reliability
AI accelerator
aging-aware design
dataflow optimization
LLM resilience
🔎 Similar Papers
No similar papers found.