Multi-CALF: A Policy Combination Approach with Statistical Guarantees

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing policy combination methods in reinforcement learning lack theoretical guarantees on stability and convergence. Method: This paper proposes Multi-CALF, the first framework unifying statistically grounded weighted policy composition with formally verified robust fallback policies. It leverages value-function-based relative improvement analysis and stochastic stability theory to rigorously derive probabilistic lower bounds on convergence, upper bounds on maximum trajectory deviation, and upper bounds on convergence time. Results: Multi-CALF significantly improves performance over single-policy baselines across multiple control tasks, while ensuring—under a user-specified high probability—convergence to a target set and bounded state deviation. Its core contribution is the joint design of policy fusion and formal stability guarantees, thereby filling a critical theoretical gap in RL policy composition.

Technology Category

Application Category

📝 Abstract
We introduce Multi-CALF, an algorithm that intelligently combines reinforcement learning policies based on their relative value improvements. Our approach integrates a standard RL policy with a theoretically-backed alternative policy, inheriting formal stability guarantees while often achieving better performance than either policy individually. We prove that our combined policy converges to a specified goal set with known probability and provide precise bounds on maximum deviation and convergence time. Empirical validation on control tasks demonstrates enhanced performance while maintaining stability guarantees.
Problem

Research questions and friction points this paper is trying to address.

Combining RL policies for better performance and stability
Proving convergence with specified probability and bounds
Empirical validation on control tasks shows improved results
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines RL policies using value improvements
Integrates standard and theoretical policies
Ensures convergence with precise bounds
G
Georgiy Malaniya
Skolkovo Institute of Science and Technology
A
Anton Bolychev
Skolkovo Institute of Science and Technology
G
Grigory Yaremenko
Skolkovo Institute of Science and Technology
A
Anastasia Krasnaya
Skolkovo Institute of Science and Technology
Pavel Osinenko
Pavel Osinenko
Professor (Associate), Skolkovo Institute of Science and Technology
AIReinforcement LearningDynamical SystemsComputation