Robust Policy Switching for Antifragile Reinforcement Learning for UAV Deconfliction in Adversarial Environments

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning (RL) policies for unmanned aerial vehicles (UAVs) often fail under adversarial sensor attacks, compromising navigation robustness and safety. Method: This paper proposes a fragility-resistant RL framework that models robust policy selection as a non-stationary multi-armed bandit problem. We introduce discounted Thompson sampling to enable dynamic, adaptive switching among multiple pre-trained policies. Crucially, we theoretically optimize regret under out-of-distribution (OOD) reward shifts—achieving, to the best of our knowledge, the first such guarantee—thereby endowing the system with antifragility. The approach integrates robust RL, multi-policy ensembling, and explicit modeling of PGD- and deception-based sensor attacks, supporting adaptive decision-making under non-stationary rewards. Results: Experiments in 3D dynamic obstacle environments demonstrate that our method significantly shortens navigation paths and increases collision-free trajectory rates. It consistently outperforms existing robust RL baselines in both robustness and adaptability under adversarial sensor perturbations.

Technology Category

Application Category

📝 Abstract
The increasing automation of navigation for unmanned aerial vehicles (UAVs) has exposed them to adversarial attacks that exploit vulnerabilities in reinforcement learning (RL) through sensor manipulation. Although existing robust RL methods aim to mitigate such threats, their effectiveness has limited generalization to out-of-distribution shifts from the optimal value distribution, as they are primarily designed to handle fixed perturbation. To address this limitation, this paper introduces an antifragile RL framework that enhances adaptability to broader distributional shifts by incorporating a switching mechanism based on discounted Thompson sampling (DTS). This mechanism dynamically selects among multiple robust policies to minimize adversarially induced state-action-value distribution shifts. The proposed approach first derives a diverse ensemble of action robust policies by accounting for a range of perturbations in the policy space. These policies are then modeled as a multiarmed bandit (MAB) problem, where DTS optimally selects policies in response to nonstationary Bernoulli rewards, effectively adapting to evolving adversarial strategies. Theoretical framework has also been provided where by optimizing the DTS to minimize the overall regrets due to distributional shift, results in effective adaptation against unseen adversarial attacks thus inducing antifragility. Extensive numerical simulations validate the effectiveness of the proposed framework in complex navigation environments with multiple dynamic three-dimensional obstacles and with stronger projected gradient descent (PGD) and spoofing attacks. Compared to conventional robust, non-adaptive RL methods, the antifragile approach achieves superior performance, demonstrating shorter navigation path lengths and a higher rate of conflict-free navigation trajectories compared to existing robust RL techniques
Problem

Research questions and friction points this paper is trying to address.

Enhances UAV adaptability to adversarial sensor attacks
Dynamically switches robust policies for distribution shifts
Optimizes policy selection against nonstationary adversarial strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Antifragile RL framework with policy switching
Discounted Thompson sampling for dynamic policy selection
Multiarmed bandit modeling for robust policy adaptation
🔎 Similar Papers
D
Deepak Kumar Panda
Faculty of Engineering and Applied Sciences, Cranfield University, MK43 0AL Cranfield, U.K
Weisi Guo
Weisi Guo
Professor & Head of Centre - Cranfield University; Visiting Fellow - Alan Turing Inst.
Graph Signal ProcessingNetworksAdversarial AIAutonomySocial Physics