🤖 AI Summary
This work addresses the challenges of unreliable diagnosis and disruptive interventions in computing continua caused by gray failures—characterized by ambiguous symptoms, lack of causal awareness, and high epistemic uncertainty. To this end, we propose AURORA, a lightweight framework that deploys parallel micro-agents at the edge. By integrating the free energy principle, do-calculus, and local causal state graphs, AURORA achieves causal observability and counterfactual root-cause analysis within Markov blankets. It further introduces a novel dual-gating mechanism that triggers repair actions only when both causal confidence is high and epistemic uncertainty is low. Experimental results demonstrate that AURORA significantly reduces inference overhead while achieving zero disruptive operations, 62.0% repair accuracy, and an average repair latency of 3 ms, outperforming all existing baselines.
📝 Abstract
Grey failures in the computing continuum produce ambiguous overlapping symptoms that existing approaches fail to diagnose reliably, either due to a lack of causal awareness or acting under high epistemic uncertainty, risking destructive interventions. This paper presents an uncertainty-aware resilience micro-agent for causal observability (AURORA), a lightweight framework for diagnosing and mitigating grey failures in edge-tier environments. The framework employs parallel micro-agents that integrate the free-energy principle, causal do-calculus, and localized causal state-graphs to support counterfactual root-cause analysis within each fault's Markov blanket. Restricting inference to causally relevant variables reduces computational overhead while preserving diagnostic fidelity. AURORA further introduces a dual-gated execution mechanism that authorizes remediation only when causal confidence is high and predicted epistemic uncertainty is bounded; otherwise, it abstains from local intervention and escalates the diagnostic payload to the fog tier. Our experiments demonstrate that AURORA outperforms baselines, achieving a 0% destructive action rate, while maintaining 62.0% repair accuracy and a 3ms mean time to repair.