Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Offline-to-online reinforcement learning (O2O-RL) faces two key challenges: insufficient coverage of multimodal behaviors and distributional shift during online adaptation. To address these, we propose Unified Generative Policy Optimization (UEPO), a novel framework that integrates the large-model pretraining–fine-tuning paradigm into RL. UEPO is the first to jointly incorporate multiple sub-dynamics-aware diffusion policies, dynamic discrepancy regularization, and a diffusion-based data augmentation module within a single unified architecture—enabling physically plausible, highly diverse policy generation and safe policy transfer. By enhancing policy expressivity and dynamics generalization, UEPO achieves state-of-the-art performance on the D4RL benchmark: +5.9% improvement over Uni-O4 on locomotion tasks and +12.4% on dexterous manipulation tasks. These results demonstrate UEPO’s strong generalization capability, scalability, and robustness to distributional shifts in O2O-RL settings.

Technology Category

Application Category

📝 Abstract
Offline-to-online reinforcement learning (O2O-RL) has emerged as a promising paradigm for safe and efficient robotic policy deployment but suffers from two fundamental challenges: limited coverage of multimodal behaviors and distributional shifts during online adaptation. We propose UEPO, a unified generative framework inspired by large language model pretraining and fine-tuning strategies. Our contributions are threefold: (1) a multi-seed dynamics-aware diffusion policy that efficiently captures diverse modalities without training multiple models; (2) a dynamic divergence regularization mechanism that enforces physically meaningful policy diversity; and (3) a diffusion-based data augmentation module that enhances dynamics model generalization. On the D4RL benchmark, UEPO achieves +5.9% absolute improvement over Uni-O4 on locomotion tasks and +12.4% on dexterous manipulation, demonstrating strong generalization and scalability.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited multimodal behavior coverage in robot learning
Mitigates distributional shifts during online policy adaptation
Enhances dynamics generalization through diffusion-based data augmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-seed dynamics-aware diffusion policy captures diverse modalities
Dynamic divergence regularization enforces physically meaningful policy diversity
Diffusion-based data augmentation enhances dynamics model generalization
🔎 Similar Papers
No similar papers found.
H
Haidong Huang
Eastern Institute of Technology, Ningbo
H
Haiyue Zhu
SIMTech, Agency for Science, Technology and Research (A*STAR)
Jiayu Song
Jiayu Song
Mary Queen University of London, Rawmantic
NLP,CV
X
Xixin Zhao
Eastern Institute of Technology, Ningbo
Y
Yaohua Zhou
Eastern Institute of Technology, Ningbo
J
Jiayi Zhang
University of Nottingham
Y
Yuze Zhai
Southern University of Science and Technology
X
Xiaocong Li
Eastern Institute of Technology, Ningbo