Multi-Agent LLM Governance for Safe Two-Timescale Reinforcement Learning in SDN-IoT Defense

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses critical limitations in existing SDN-IoT defense mechanisms, which often neglect controller coupling effects and lack security-constrained, auditable policy evolution, leading to control-plane instability and degraded QoS. To overcome these challenges, the authors propose an introspective dual-timescale defense framework. At the fast timescale, switches deploy Proximal Policy Optimization (PPO) agents equipped with security constraints and action masking to mitigate attacks in real time. At the slow timescale, a novel multi-agent large language model (LLM)-based governance engine generates and verifies interpretable, auditable “policy constitutions” for secure strategy evolution without retraining. Experimental results demonstrate significant performance gains under heterogeneous IoT environments and adversarial attacks: the approach improves Macro-F1 by 9.1% and 15.4% over PPO and static baselines, respectively; reduces worst-case degradation by 36.8%; lowers peak controller backlog by 42.7%; limits high-load RTT p95 increase to under 5.8%; achieves policy convergence within five rounds; and decreases catastrophic overload incidents from 11.6% to 2.3%.
📝 Abstract
Software-Defined Networking (SDN) is increasingly adopted to secure Internet-of-Things (IoT) networks due to its centralized control and programmable forwarding. However, SDN-IoT defense is inherently a closed-loop control problem in which mitigation actions impact controller workload, queue dynamics, rule-installation delay, and future traffic observations. Aggressive mitigation may destabilize the control plane, degrade Quality of Service (QoS), and amplify systemic risk. Existing learning-based approaches prioritize detection accuracy while neglecting controller coupling and short-horizon Reinforcement Learning (RL) optimization without structured, auditable policy evolution. This paper introduces a self-reflective two-timescale SDN-IoT defense solution separating fast mitigation from slow policy governance. At the fast timescale, per-switch Proximal Policy Optimization (PPO) agents perform controller-aware mitigation under safety constraints and action masking. At the slow timescale, a multi-agent Large Language Model (LLM) governance engine generates machine-parsable updates to the global policy constitution Pi, which encodes admissible actions, safety thresholds, and reward priorities. Updates (Delta Pi) are validated through stress testing and deployed only with non-regression and safety guarantees, ensuring an auditable evolution without retraining RL agents. Evaluation under heterogeneous IoT traffic and adversarial stress shows improvements of 9.1% Macro-F1 over PPO and 15.4% over static baselines. Worst-case degradation drops by 36.8%, controller backlog peaks by 42.7%, and RTT p95 inflation remains below 5.8% under high-intensity attacks. Policy evolution converges within five cycles, reducing catastrophic overload from 11.6% to 2.3%.
Problem

Research questions and friction points this paper is trying to address.

SDN-IoT defense
two-timescale reinforcement learning
controller coupling
policy governance
systemic risk
Innovation

Methods, ideas, or system contributions that make the work stand out.

two-timescale reinforcement learning
multi-agent LLM governance
SDN-IoT defense
policy constitution
safety-constrained mitigation
🔎 Similar Papers
No similar papers found.