LM-Fix: Lightweight Bit-Flip Detection and Rapid Recovery Framework for Language Models

๐Ÿ“… 2025-11-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing integrity-checking methods for detecting inference errors in large language models (LLMs) caused by memory bit flips suffer from high computational overhead and significant latency. This paper proposes a lightweight, online detection and localized recovery framework: it generates compact hash signatures via forward execution of short test vectors, enabling rapid fault localization through hash comparison; subsequently, it triggers block-level weight repairโ€”bypassing full-model reloading. The core innovation lies in the synergistic co-design of a hash-guided verification mechanism and localization-driven localized recovery. Experimental evaluation across multiple mainstream LLMs demonstrates single- and multi-bit flip detection rates of 94% and โ‰ˆ100%, respectively, with only 1โ€“7.7% inference overhead and recovery speed over 100ร— faster than conventional model reloading.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper presents LM-Fix, a lightweight detection and rapid recovery framework for faults in large language models (LLMs). Existing integrity approaches are often heavy or slow for modern LLMs. LM-Fix runs a short test-vector pass and uses hash-guided checks to detect bit-flip faults, then repairs them locally without a full reload. Across multiple models, it detects over 94% of single-bit flips at TVL=200 and nearly 100% of multi-bit flips with approximately 1% to 7.7% runtime overhead; recovery is more than 100x faster than reloading. These results show a practical, low-overhead solution to keep LLMs reliable in production
Problem

Research questions and friction points this paper is trying to address.

Detects bit-flip faults in large language models using lightweight hash-guided checks
Repairs detected faults locally without requiring full model reloading
Achieves high detection rates with minimal runtime overhead for production reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight bit-flip detection using hash-guided checks
Local repair of faults without full model reload
Rapid recovery with over 100x speed improvement
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Ahmad Tahmasivand
Electrical and Computer Engineering Department, George Mason University, Fairfax, USA
N
Noureldin Zahran
Electrical and Computer Engineering Department, George Mason University, Fairfax, USA
S
Saba A. Al-Sayouri
The National Institutes of Health, Maryland, USA
M
Mohammed Fouda
Compumacy for Artificial Intelligence Solutions, Cairo, Egypt
Khaled N. Khasawneh
Khaled N. Khasawneh
George Mason University
SecurityMachine LearningComputer Architecture