Resilient by Design - Active Inference for Distributed Continuum Intelligence

๐Ÿ“… 2025-11-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Heterogeneous device failures in the Distributed Computing Continuum (DCC) severely undermine system reliability and global consistency. Method: This paper proposes PAIR-Agent, an active inferenceโ€“based framework integrating device log analysis, causal fault graph modeling, and the free energy principle; it quantifies uncertainty via Markov blankets to enable closed-loop, end-to-end fault awareness, diagnosis, and autonomous recovery. Unlike passive fault-tolerance mechanisms, PAIR-Agent supports real-time, adaptive resilience coordination across the edge-to-HPC continuum. Contribution/Results: Theoretical analysis and multi-scenario experiments demonstrate that PAIR-Agent significantly improves service stability and recovery latency for AI workloads under dynamic heterogeneity. It establishes a novel paradigm for autonomous resilience in DCC-based intelligent systems, advancing self-healing capabilities through principled, inference-driven control.

Technology Category

Application Category

๐Ÿ“ Abstract
Failures are the norm in highly complex and heterogeneous devices spanning the distributed computing continuum (DCC), from resource-constrained IoT and edge nodes to high-performance computing systems. Ensuring reliability and global consistency across these layers remains a major challenge, especially for AI-driven workloads requiring real-time, adaptive coordination. This paper introduces a Probabilistic Active Inference Resilience Agent (PAIR-Agent) to achieve resilience in DCC systems. PAIR-Agent performs three core operations: (i) constructing a causal fault graph from device logs, (ii) identifying faults while managing certainties and uncertainties using Markov blankets and the free-energy principle, and (iii) autonomously healing issues through active inference. Through continuous monitoring and adaptive reconfiguration, the agent maintains service continuity and stability under diverse failure conditions. Theoretical validations confirm the reliability and effectiveness of the proposed framework.
Problem

Research questions and friction points this paper is trying to address.

Achieving resilience in distributed computing continuum systems
Managing reliability across IoT edge and high-performance layers
Autonomous fault identification and healing through active inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs causal fault graphs from device logs
Manages uncertainties using Markov blankets principle
Autonomously heals issues through active inference
๐Ÿ”Ž Similar Papers