LM-Fix: Lightweight Bit-Flip Detection and Rapid Recovery Framework for Language Models

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing integrity-checking methods for detecting inference errors in large language models (LLMs) caused by memory bit flips suffer from high computational overhead and significant latency. This paper proposes a lightweight, online detection and localized recovery framework: it generates compact hash signatures via forward execution of short test vectors, enabling rapid fault localization through hash comparison; subsequently, it triggers block-level weight repair—bypassing full-model reloading. The core innovation lies in the synergistic co-design of a hash-guided verification mechanism and localization-driven localized recovery. Experimental evaluation across multiple mainstream LLMs demonstrates single- and multi-bit flip detection rates of 94% and ≈100%, respectively, with only 1–7.7% inference overhead and recovery speed over 100× faster than conventional model reloading.

Technology Category

Application Category

📝 Abstract

This paper presents LM-Fix, a lightweight detection and rapid recovery framework for faults in large language models (LLMs). Existing integrity approaches are often heavy or slow for modern LLMs. LM-Fix runs a short test-vector pass and uses hash-guided checks to detect bit-flip faults, then repairs them locally without a full reload. Across multiple models, it detects over 94% of single-bit flips at TVL=200 and nearly 100% of multi-bit flips with approximately 1% to 7.7% runtime overhead; recovery is more than 100x faster than reloading. These results show a practical, low-overhead solution to keep LLMs reliable in production

Problem

Research questions and friction points this paper is trying to address.

Detects bit-flip faults in large language models using lightweight hash-guided checks

Repairs detected faults locally without requiring full model reloading

Achieves high detection rates with minimal runtime overhead for production reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight bit-flip detection using hash-guided checks

Local repair of faults without full model reload

Rapid recovery with over 100x speed improvement

🔎 Similar Papers

Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models