Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

228K/year
🤖 AI Summary
This work addresses the challenge of enabling reinforcement learning agents to autonomously determine when to trigger control actions—rather than how to act—to reduce communication frequency while preserving system stability and safety. The authors propose a novel safe reinforcement learning framework that jointly learns control policies and communication timing, explicitly modeling “when to act” as a learnable decision. Safety is rigorously enforced via a runtime assurance (RTA) layer that integrates Lyapunov-based safety shielding with a CARE-based LQR backup controller, providing stronger guarantees than conventional expectation-based constraints. Combining Lyapunov reward shaping, preference-conditioned policies, and soft actor-critic (SAC), the approach supports both continuous and discrete action spaces. Experiments on inverted pendulum, cart-pole, and a 12-dimensional quadrotor demonstrate 1.45–3.51× longer average sampling intervals with robustness to mass perturbations, whereas fixed-rate LQR controllers become unstable at comparable communication rates, underscoring the critical role of adaptive timing.
📝 Abstract
Safe reinforcement learning (RL) typically asks $\textit{what}$ an agent should do. We ask $\textit{when}$ it needs to act, and show that a single policy can jointly learn control inputs and communication-efficient timing decisions under a pointwise Lyapunov safety shield. We focus on stabilization around a known equilibrium, where CARE-based LQR backups, Lyapunov certificates, and classical Lyapunov-STC are well defined, enabling clean comparison against analytical baselines. A run-time assurance (RTA) layer overrides the policy via a one-step-ahead Lyapunov prediction and a precomputed LQR backup, providing a strictly stronger guarantee than constrained MDP methods that enforce safety only in expectation. On an inverted pendulum, cart--pole, and planar quadrotor, the learned policy achieves $1.91\times$, $1.45\times$, and $3.51\times$ higher mean inter-sample interval (MSI) than a Lyapunov-triggered baseline; a fixed LQR controller at the same average rate is unstable on all three plants, showing that adaptive timing, not a lower average rate, makes sparsity safe. A CARE-derived Lyapunov reward transfers across environments without redesign, with a single weight $w_c$ controlling the stability--communication tradeoff; ablations confirm the RTA shield is essential, with its removal reducing MSI by $1.27$--$1.84\times$ and degrading state norms. A preference-conditioned extension recovers the full tradeoff frontier from one model at $\tfrac{2}{11}$ of training compute, and SAC experiments show the results are algorithm-agnostic across discrete and continuous domains. A 12-state 3D quadrotor case study extends the framework to higher-dimensional systems where classical STC is intractable, and robustness to $\pm30\%$ mass variation and disturbances shows graceful degradation, with the RTA absorbing what the learned policy cannot.
Problem

Research questions and friction points this paper is trying to address.

safe reinforcement learning
communication-efficient control
run-time assurance
event-triggered control
Lyapunov stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Run-Time Assurance
Communication-Efficient RL
Lyapunov Safety
Adaptive Timing
Safe Reinforcement Learning