🤖 AI Summary
This work addresses truncation bias in offline evaluation of robotic manipulation policies, which arises from sparse rewards, non-monotonic progress, and limited trajectory lengths. The authors frame policy evaluation as a task-completion problem and introduce a liveness-based Bellman operator. By incorporating a discounted liveness formalism, the approach encodes task progress while preserving the contraction property of value functions, yielding a conservative value estimator robust to temporal truncation. Integrated with off-policy evaluation, vision-language-action models, diffusion policies, and human demonstration data, the method substantially outperforms baseline estimators such as TD(0) and Monte Carlo across multiple simulated manipulation tasks and real-world cloth-folding experiments, providing more accurate assessments of task progress and significantly reducing truncation bias.
📝 Abstract
Policy evaluation is a fundamental component of the development and deployment pipeline for robotic policies. In modern manipulation systems, this problem is particularly challenging: rewards are often sparse, task progression of evaluation rollouts are often non-monotonic as the policies exhibit recovery behaviors, and evaluation rollouts are necessarily of finite length. This finite length introduces truncation bias, breaking the infinite-horizon assumptions underlying standard methods relying on Bellman equations/principle of optimality. In this work, we propose a framework for offline policy evaluation from sparse rewards based on a liveness-based Bellman operator. Our formulation interprets policy evaluation as a task-completion problem and yields a conservative fixed-point value function that is robust to finite-horizon truncation. We analyze the theoretical properties of the proposed operator, including contraction guarantees, and show how it encodes task progression while mitigating truncation bias. We evaluate our method on two simulated manipulation tasks using both a Vision-Language-Action model and a diffusion policy, and a cloth folding task using human demonstrations. Empirical results demonstrate that our approach more accurately reflects task progress and substantially reduces truncation bias, outperforming classical baselines such as TD(0) and Monte Carlo policy evaluation.