Bellman Value Decomposition for Task Logic in Safe Optimal Control

๐Ÿ“… 2026-02-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of efficiently integrating task objectives with safety specifications in high-dimensional, complex environmentsโ€”a setting where conventional approaches often struggle due to reliance on extensive manual tuning and limited scalability. The authors propose a novel Bellman value graph decomposition framework that decomposes temporal logic tasks into canonical subproblems, including Reach-Avoid, Avoid, and a newly introduced Reach-Avoid-Loop formulation, each governed by tailored Bellman equations. An end-to-end learning architecture is developed, leveraging a dual-layer neural network and the VDPPO algorithm to automatically balance safety (liveness) and performance without handcrafted heuristics. Empirical evaluations on high-dimensional nonlinear systems and heterogeneous multi-agent hardware platforms demonstrate substantial improvements over existing baselines, achieving synergistic optimization of safety constraints and task objectives.

Technology Category

Application Category

๐Ÿ“ Abstract
Real-world tasks involve nuanced combinations of goal and safety specifications. In high dimensions, the challenge is exacerbated: formal automata become cumbersome, and the combination of sparse rewards tends to require laborious tuning. In this work, we consider the innate structure of the Bellman Value as a means to naturally organize the problem for improved automatic performance. Namely, we prove the Bellman Value for a complex task defined in temporal logic can be decomposed into a graph of Bellman Values, connected by a set of well-known Bellman equations (BEs): the Reach-Avoid BE, the Avoid BE, and a novel type, the Reach-Avoid-Loop BE. To solve the Value and optimal policy, we propose VDPPO, which embeds the decomposed Value graph into a two-layer neural net, bootstrapping the implicit dependencies. We conduct a variety of simulated and hardware experiments to test our method on complex, high-dimensional tasks involving heterogeneous teams and nonlinear dynamics. Ultimately, we find this approach greatly improves performance over existing baselines, balancing safety and liveness automatically.
Problem

Research questions and friction points this paper is trying to address.

Safe Optimal Control
Temporal Logic
Bellman Value
High-dimensional Tasks
Sparse Rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bellman Value Decomposition
Temporal Logic
Safe Reinforcement Learning
Reach-Avoid-Loop Bellman Equation
VDPPO
๐Ÿ”Ž Similar Papers