DaiFu: In-Situ Crash Recovery for Deep Learning Systems

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep learning systems frequently crash due to software stack complexity, even from minor programming errors or transient faults, wasting computational resources and hindering development efficiency. Existing checkpoint-and-restart mechanisms incur high recovery overhead and latency, failing to meet rapid-recovery requirements. This paper proposes the first in-situ, instantaneous recovery framework tailored for deep learning systems. It employs lightweight source-code transformation to intercept crashes and enables dynamic, runtime hot-updates of model code, hyperparameters, and other execution contexts—without external storage. Our core innovation shifts fault recovery from the conventional “restart-and-load” paradigm to “in-situ repair-and-resume.” Experimental evaluation demonstrates a 1372× speedup in recovery time over the best baseline, with system runtime overhead under 0.40%. The framework’s effectiveness and robustness are validated across seven representative crash scenarios.

Technology Category

Application Category

📝 Abstract
Deep learning (DL) systems have been widely adopted in many areas, and are becoming even more popular with the emergence of large language models. However, due to the complex software stacks involved in their development and execution, crashes are unavoidable and common. Crashes severely waste computing resources and hinder development productivity, so efficient crash recovery is crucial. Existing solutions, such as checkpoint-retry, are too heavyweight for fast recovery from crashes caused by minor programming errors or transient runtime errors. Therefore, we present DaiFu, an in-situ recovery framework for DL systems. Through a lightweight code transformation to a given DL system, DaiFu augments it to intercept crashes in situ and enables dynamic and instant updates to its program running context (e.g., code, configurations, and other data) for agile crash recovery. Our evaluation shows that DaiFu helps reduce the restore time for crash recovery, achieving a 1372x speedup compared with state-of-the-art solutions. Meanwhile, the overhead of DaiFu is negligible (under 0.40%). We also construct a benchmark spanning 7 distinct crash scenarios in DL systems, and show the effectiveness of DaiFu in diverse situations.
Problem

Research questions and friction points this paper is trying to address.

DL systems suffer from frequent crashes wasting resources
Existing recovery solutions are too slow and heavyweight
DaiFu enables lightweight in-situ crash recovery for DL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight code transformation for crash interception
Dynamic instant updates for program context
Negligible overhead under 0.40%
🔎 Similar Papers
No similar papers found.