HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses the rationality gap arising from human behavioral diversity and heterogeneous robot strategies in human-robot collaboration by proposing a decentralized multi-agent policy optimization method grounded in Lyapunov stability theory. By constructing Lyapunov stability conditions in the parameter space, the approach employs quadratic programming to project and correct heterogeneous policy gradients, ensuring monotonic convergence during learning. This study is the first to integrate Lyapunov stability into heterogeneous multi-agent reinforcement learning, moving beyond conventional safe reinforcement learning that relies solely on state constraints, and instead guarantees dynamic stability through a parameter divergence metric. Experiments demonstrate that the proposed method significantly enhances the generalization and robustness of collaborative policies in edge-case scenarios, both in simulation and on real humanoid robots, effectively mitigating training oscillations and divergence.

Technology Category

Application Category

📝 Abstract

To improve generalization and resilience in human-robot collaboration (HRC), robots must handle the combinatorial diversity of human behaviors and contexts, motivating multi-agent reinforcement learning (MARL). However, inherent heterogeneity between robots and humans creates a rationality gap (RG) in the learning process-a variational mismatch between decentralized best-response dynamics and centralized cooperative ascent. The resulting learning problem is a general-sum differentiable game, so independent policy-gradient updates can oscillate or diverge without added structure. We propose heterogeneous-agent Lyapunov policy optimization (HALyPO), which establishes formal stability directly in the policy-parameter space by enforcing a per-step Lyapunov decrease condition on a parameter-space disagreement metric. Unlike Lyapunov-based safe RL, which targets state/trajectory constraints in constrained Markov decision processes, HALyPO uses Lyapunov certification to stabilize decentralized policy learning. HALyPO rectifies decentralized gradients via optimal quadratic projections, ensuring monotonic contraction of RG and enabling effective exploration of open-ended interaction spaces. Extensive simulations and real-world humanoid-robot experiments show that this certified stability improves generalization and robustness in collaborative corner cases.

Problem

Research questions and friction points this paper is trying to address.

human-robot collaboration

heterogeneous agents

rationality gap

multi-agent reinforcement learning

general-sum games

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lyapunov policy optimization

heterogeneous-agent reinforcement learning

rationality gap