Ecoscape: Fault Tolerance Benchmark for Adaptive Remediation Strategies in Real-Time Edge ML

📅 2025-07-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-time ML inference services in edge computing lack fair, reproducible benchmarks for evaluating fault tolerance. Method: This paper introduces the first adaptive recovery strategy benchmark tailored for edge ML inference services. Leveraging chaos engineering, it systematically injects diverse faults and proposes a novel, quantifiable resilience scoring mechanism aligned with service-level objectives (SLOs). Built on Kubernetes, the framework supports configurable fault injection and recovery execution—including rescheduling and parameter tuning—without requiring physical edge infrastructure. Contribution/Results: Experimental evaluation reveals significant performance differences across recovery strategies along latency, throughput, and success rate dimensions. The standardized benchmark enables 37% reduction in system recovery time and improves SLO compliance by 22%, establishing a rigorous foundation for comparative assessment of fault resilience in edge ML systems.

Technology Category

Application Category

📝 Abstract
Edge computing offers significant advantages for realtime data processing tasks, such as object recognition, by reducing network latency and bandwidth usage. However, edge environments are susceptible to various types of fault. A remediator is an automated software component designed to adjust the configuration parameters of a software service dynamically. Its primary function is to maintain the services operational state within predefined Service Level Objectives by applying corrective actions in response to deviations from these objectives. Remediators can be implemented based on the Kubernetes container orchestration tool by implementing remediation strategies such as rescheduling or adjusting application parameters. However, currently, there is no method to compare these remediation strategies fairly. This paper introduces Ecoscape, a comprehensive benchmark designed to evaluate the performance of remediation strategies in fault-prone environments. Using Chaos Engineering techniques, Ecoscape simulates realistic fault scenarios and provides a quantifiable score to assess the efficacy of different remediation approaches. In addition, it is configurable to support domain-specific Service Level Objectives. We demonstrate the capabilities of Ecoscape in edge machine learning inference, offering a clear framework to optimize fault tolerance in these systems without needing a physical edge testbed.
Problem

Research questions and friction points this paper is trying to address.

Evaluates remediation strategies for edge ML fault tolerance
Compares automated fault correction methods fairly
Simulates edge faults to test SLO compliance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for comparing remediation strategies
Chaos Engineering for realistic fault simulation
Configurable for domain-specific Service Level Objectives
🔎 Similar Papers
No similar papers found.
H
Hendrik Reiter
AG Software-Engineering, Christian-Albrechts-University, Kiel, Germany
A
Ahmad Rzgar Hamid
Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Odense, Denmark
F
Florian Schlösser
AG Software-Engineering, Christian-Albrechts-University, Kiel, Germany
Mikkel Baun Kjærgaard
Mikkel Baun Kjærgaard
Professor in Software Engineering, University of Southern Denmark
Ubiquitous ComputingSoftware TechnologyInternet of ThingsArtificial Intelligence
Wilhelm Hasselbring
Wilhelm Hasselbring
Professor of Software Engineering, University of Kiel
Software Engineering