FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations

📅 2022-09-28
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
In edge computing, user mobility triggers frequent service migrations, yet existing reinforcement learning (RL) approaches—trained solely on simulated data—fail to robustly handle rare but catastrophic server failures, severely compromising reliability for latency-critical applications like autonomous driving. Method: We propose ImRE Q-learning, an importance-sampling–based RL framework operating within a digital twin environment that explicitly models failure rarity. It is the first to quantify rare-event impact directly in value function optimization. We further develop scalable algorithms—ImDQL and ImACRE—that jointly schedule services for users with heterogeneous risk preferences. Contribution/Results: We provide theoretical guarantees on convergence and optimality bounds. Experiments driven by real-world mobility traces demonstrate that our approach significantly reduces total scheduling cost—including latency, migration, failure response, and backup overhead—under failure scenarios, outperforming standard RL and greedy baselines.
📝 Abstract
In edge computing, users' service profiles are migrated due to user mobility. Reinforcement learning (RL) frameworks have been proposed to do so, often trained on simulated data. However, existing RL frameworks overlook occasional server failures, which although rare, impact latency-sensitive applications like autonomous driving and real-time obstacle detection. Nevertheless, these failures (rare events), being not adequately represented in historical training data, pose a challenge for data-driven RL algorithms. As it is impractical to adjust failure frequency in real-world applications for training, we introduce FIRE, a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function. FIRE considers delay, migration, failure, and backup placement costs across individual and shared service profiles. We prove ImRE's boundedness and convergence to optimality. Next, we introduce novel deep Q-learning (ImDQL) and actor critic (ImACRE) versions of our algorithm to enhance scalability. We extend our framework to accommodate users with varying risk tolerances. Through trace driven experiments, we show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
Problem

Research questions and friction points this paper is trying to address.

Addressing server failures in edge computing migrations
Adapting reinforcement learning to rare failure events
Optimizing service migration under latency constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Digital twin environment trains RL for edge migration
Importance sampling algorithm focuses on rare failures
Deep Q-learning variants enhance scalability and risk adaptation
🔎 Similar Papers
No similar papers found.
Marie Siew
Marie Siew
Faculty Early Career Award Fellow, Singapore University of Technology and Design
network optimizationedge computingfederated learninghuman-in-the-loopnetwork economics
S
Shikhar Sharma
Electrical and Computer Engineering department, Carnegie Mellon University, 15213 Pittsburgh, P A, USA
Kun Guo
Kun Guo
Shanghai Key Laboratory of Multidimensional Information Processing, School of Communications and Electronics Engineering, East China Normal University, Shanghai 200241, China
C
Chao Xu
School of Information Engineering, Northwest A&F University, Yangling 712100, China
T
Tony Q. S. Quek
Information Systems and Technology and Design Pillar, Singapore University of Technology and Design, Singapore 487372
Carlee Joe-Wong
Carlee Joe-Wong
Robert E. Doherty Associate Professor, Carnegie Mellon University
Network economicsdistributed learningcloud and mobile computing