🤖 AI Summary
The rapid scaling of large-model accelerators has rendered conventional fault assessment methods computationally expensive and ineffective at covering critical faults. To address this, we propose RIFT, a reinforcement learning–based intelligent fault-targeting framework that, for the first time, formulates fault search as a sequential decision-making problem. RIFT integrates hybrid sensitivity analysis with the Proximal Policy Optimization (PPO) algorithm to enable efficient design-space pruning. It supports UVM-compliant test-case generation and jointly optimizes RTL-level error-correcting codes to enhance hardware protection strategies. Compared to evolutionary algorithms, RIFT achieves a 2.2× speedup in fault search, reduces test vectors by over 99%, significantly improves fault coverage—particularly for critical faults—and enhances the cost-effectiveness of hardware protection by 12.8×.
📝 Abstract
The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault Targeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a extbf{2.2$ imes$} fault assessment speedup over evolutionary methods and reduces the required test vector volume by over extbf{99%} compared to random fault injection, all while achieving extbf{superior fault coverage}. The proposed framework also provides actionable data to enable intelligent hardware protection strategies, demonstrating that RIFT-guided selective error correction code provides a extbf{12.8$ imes$} improvement in extbf{cost-effectiveness} (coverage per unit area) compared to uniform triple modular redundancy protection. RIFT automatically generates UVM-compliant verification artifacts, ensuring its findings are directly actionable and integrable into commercial RTL verification workflows.