RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

The rapid scaling of large-model accelerators has rendered conventional fault assessment methods computationally expensive and ineffective at covering critical faults. To address this, we propose RIFT, a reinforcement learning–based intelligent fault-targeting framework that, for the first time, formulates fault search as a sequential decision-making problem. RIFT integrates hybrid sensitivity analysis with the Proximal Policy Optimization (PPO) algorithm to enable efficient design-space pruning. It supports UVM-compliant test-case generation and jointly optimizes RTL-level error-correcting codes to enhance hardware protection strategies. Compared to evolutionary algorithms, RIFT achieves a 2.2× speedup in fault search, reduces test vectors by over 99%, significantly improves fault coverage—particularly for critical faults—and enhances the cost-effectiveness of hardware protection by 12.8×.

Technology Category

Application Category

📝 Abstract

The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault Targeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a extbf{2.2$ imes$} fault assessment speedup over evolutionary methods and reduces the required test vector volume by over extbf{99%} compared to random fault injection, all while achieving extbf{superior fault coverage}. The proposed framework also provides actionable data to enable intelligent hardware protection strategies, demonstrating that RIFT-guided selective error correction code provides a extbf{12.8$ imes$} improvement in extbf{cost-effectiveness} (coverage per unit area) compared to uniform triple modular redundancy protection. RIFT automatically generates UVM-compliant verification artifacts, ensuring its findings are directly actionable and integrable into commercial RTL verification workflows.

Problem

Research questions and friction points this paper is trying to address.

Develops scalable fault assessment for large AI accelerators

Automates discovery of critical failure modes efficiently

Reduces test volume while maintaining superior fault coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning automates minimal high-impact fault discovery

Hybrid sensitivity analysis prunes search space for efficiency

Generates UVM-compliant verification artifacts for commercial workflows

🔎 Similar Papers

No similar papers found.