๐ค AI Summary
Heterogeneous device failures in the Distributed Computing Continuum (DCC) severely undermine system reliability and global consistency. Method: This paper proposes PAIR-Agent, an active inferenceโbased framework integrating device log analysis, causal fault graph modeling, and the free energy principle; it quantifies uncertainty via Markov blankets to enable closed-loop, end-to-end fault awareness, diagnosis, and autonomous recovery. Unlike passive fault-tolerance mechanisms, PAIR-Agent supports real-time, adaptive resilience coordination across the edge-to-HPC continuum. Contribution/Results: Theoretical analysis and multi-scenario experiments demonstrate that PAIR-Agent significantly improves service stability and recovery latency for AI workloads under dynamic heterogeneity. It establishes a novel paradigm for autonomous resilience in DCC-based intelligent systems, advancing self-healing capabilities through principled, inference-driven control.
๐ Abstract
Failures are the norm in highly complex and heterogeneous devices spanning the distributed computing continuum (DCC), from resource-constrained IoT and edge nodes to high-performance computing systems. Ensuring reliability and global consistency across these layers remains a major challenge, especially for AI-driven workloads requiring real-time, adaptive coordination. This paper introduces a Probabilistic Active Inference Resilience Agent (PAIR-Agent) to achieve resilience in DCC systems. PAIR-Agent performs three core operations: (i) constructing a causal fault graph from device logs, (ii) identifying faults while managing certainties and uncertainties using Markov blankets and the free-energy principle, and (iii) autonomously healing issues through active inference. Through continuous monitoring and adaptive reconfiguration, the agent maintains service continuity and stability under diverse failure conditions. Theoretical validations confirm the reliability and effectiveness of the proposed framework.