Learning Optimal Defender Strategies for CAGE-2 using a POMDP Model

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

259K/year

🤖 AI Summary

To address the lack of formal modeling and efficient learning of defender strategies in the CAGE-2 benchmark, this paper introduces the first partial observable Markov decision process (POMDP)-based formalization of defender behavior, rigorously defining the optimal defense policy. To overcome computational bottlenecks arising from large state spaces, we propose BF-PPO—a novel algorithm that integrates particle filtering (PF) into the proximal policy optimization (PPO) framework, enabling robust belief-state estimation and sample-efficient policy learning. Experiments on the CybORG platform demonstrate that BF-PPO significantly outperforms the current state-of-the-art (CARDIFF) in CAGE-2: achieving a 12.3% improvement in defense success rate while reducing training time by 47%. Our core contributions are threefold: (i) the first POMDP formalization of CAGE-2 defender dynamics; (ii) the design of BF-PPO, a principled PF-augmented deep RL algorithm; and (iii) a practical defense-policy learning paradigm that jointly optimizes performance and efficiency.

Technology Category

Application Category

📝 Abstract

CAGE-2 is an accepted benchmark for learning and evaluating defender strategies against cyberattacks. It reflects a scenario where a defender agent protects an IT infrastructure against various attacks. Many defender methods for CAGE-2 have been proposed in the literature. In this paper, we construct a formal model for CAGE-2 using the framework of Partially Observable Markov Decision Process (POMDP). Based on this model, we define an optimal defender strategy for CAGE-2 and introduce a method to efficiently learn this strategy. Our method, called BF-PPO, is based on PPO, and it uses particle filter to mitigate the computational complexity due to the large state space of the CAGE-2 model. We evaluate our method in the CAGE-2 CybORG environment and compare its performance with that of CARDIFF, the highest ranked method on the CAGE-2 leaderboard. We find that our method outperforms CARDIFF regarding the learned defender strategy and the required training time.

Problem

Research questions and friction points this paper is trying to address.

Learning optimal defender strategies against cyberattacks

Modeling CAGE-2 benchmark using POMDP framework

Mitigating computational complexity in large state spaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses POMDP framework for modeling

Applies BF-PPO method with particle filter

Leverages PPO algorithm for optimization

🔎 Similar Papers

No similar papers found.