🤖 AI Summary
To address the lack of formal modeling and efficient learning of defender strategies in the CAGE-2 benchmark, this paper introduces the first partial observable Markov decision process (POMDP)-based formalization of defender behavior, rigorously defining the optimal defense policy. To overcome computational bottlenecks arising from large state spaces, we propose BF-PPO—a novel algorithm that integrates particle filtering (PF) into the proximal policy optimization (PPO) framework, enabling robust belief-state estimation and sample-efficient policy learning. Experiments on the CybORG platform demonstrate that BF-PPO significantly outperforms the current state-of-the-art (CARDIFF) in CAGE-2: achieving a 12.3% improvement in defense success rate while reducing training time by 47%. Our core contributions are threefold: (i) the first POMDP formalization of CAGE-2 defender dynamics; (ii) the design of BF-PPO, a principled PF-augmented deep RL algorithm; and (iii) a practical defense-policy learning paradigm that jointly optimizes performance and efficiency.
📝 Abstract
CAGE-2 is an accepted benchmark for learning and evaluating defender strategies against cyberattacks. It reflects a scenario where a defender agent protects an IT infrastructure against various attacks. Many defender methods for CAGE-2 have been proposed in the literature. In this paper, we construct a formal model for CAGE-2 using the framework of Partially Observable Markov Decision Process (POMDP). Based on this model, we define an optimal defender strategy for CAGE-2 and introduce a method to efficiently learn this strategy. Our method, called BF-PPO, is based on PPO, and it uses particle filter to mitigate the computational complexity due to the large state space of the CAGE-2 model. We evaluate our method in the CAGE-2 CybORG environment and compare its performance with that of CARDIFF, the highest ranked method on the CAGE-2 leaderboard. We find that our method outperforms CARDIFF regarding the learned defender strategy and the required training time.