Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the challenges of supervision conflicts and inefficient learning in offline heterogeneous robotic datasets, which arise from discrepancies in embodiment, camera configurations, and demonstration quality. To this end, the authors propose a reward-free conservative offline policy learning method that introduces a novel posterior-uniform ratio as sample weights. By leveraging latent goal encoding, candidate goal pool matching, and a transition scorer to estimate the posterior distribution, the approach enables adaptive credit assignment without requiring policy likelihoods, and is compatible with both diffusion models and flow-matching action heads. Combined with clipped mixture weighting and self-normalized weighted regression, the method significantly enhances the policy’s conservative adaptability to heterogeneous data, effectively mitigating performance degradation caused by low-quality or conflicting demonstrations.

Technology Category

Application Category

📝 Abstract

Offline post-training adapts a pretrained robot policy to a target dataset by supervised regression on recorded actions. In practice, robot datasets are heterogeneous: they mix embodiments, camera setups, and demonstrations of varying quality, so many trajectories reflect recovery behavior, inconsistent operator skill, or weakly informative supervision. Uniform post-training gives equal credit to all samples and can therefore average over conflicting or low-attribution data. We propose Posterior-Transition Reweighting (PTR), a reward-free and conservative post-training method that decides how much each training sample should influence the supervised update. For each sample, PTR encodes the observed post-action consequence as a latent target, inserts it into a candidate pool of mismatched targets, and uses a separate transition scorer to estimate a softmax identification posterior over target indices. The posterior-to-uniform ratio defines the PTR score, which is converted into a clipped-and-mixed weight and applied to the original action objective through self-normalized weighted regression. This construction requires no tractable policy likelihood and is compatible with both diffusion and flow-matching action heads. Rather than uniformly trusting all recorded supervision, PTR reallocates credit according to how attributable each sample's post-action consequence is under the current representation, improving conservative offline adaptation to heterogeneous robot data.

Problem

Research questions and friction points this paper is trying to address.

offline robot learning

heterogeneous data

conservative policy learning

supervision attribution

post-training adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Posterior-Transition Reweighting

conservative offline learning

heterogeneous robot data