๐ค AI Summary
This work addresses the performance degradation of existing policy-guidance methods when task semantics are ambiguous or the underlying policy lacks sufficient capability, often exacerbated by overconfident visual-language models. To mitigate this, the authors propose UPS (Uncertainty-aware Policy Steering), a novel framework that unifies the modeling of uncertainty in both task semantics and action feasibility, enabling dynamic selection among execution, clarification, or learning strategies. UPS introduces a natural languageโbased clarification mechanism and a minimal-intervention continual learning paradigm. It leverages conformal prediction to calibrate both the visual-language model and the pretrained policy, and employs residual learning for efficient policy updates. Experiments in both simulated and real-world robotic settings demonstrate that UPS effectively distinguishes between states of confidence, ambiguity, and incapability, significantly reducing human intervention and outperforming uncalibrated baselines and existing continual learning approaches.
๐ Abstract
Policy steering is an emerging way to adapt robot behaviors at deployment-time: a learned verifier analyzes low-level action samples proposed by a pre-trained policy (e.g., diffusion policy) and selects only those aligned with the task. While Vision-Language Models (VLMs) are promising general-purpose verifiers due to their reasoning capabilities, existing frameworks often assume these models are well-calibrated. In practice, the overconfident judgment from VLM can degrade the steering performance under both high-level semantic uncertainty in task specifications and low-level action uncertainty or incapability of the pre-trained policy. We propose uncertainty-aware policy steering (UPS), a framework that jointly reasons about semantic task uncertainty and low-level action feasibility, and selects an uncertainty resolution strategy: execute a high-confidence action, clarify task ambiguity via natural language queries, or ask for action interventions to correct the low-level policy when it is deemed incapable at the task. We leverage conformal prediction to calibrate the composition of the VLM and the pre-trained base policy, providing statistical assurances that the verifier selects the correct strategy. After collecting interventions during deployment, we employ residual learning to improve the capability of the pre-trained policy, enabling the system to learn continually but with minimal expensive human feedback. We demonstrate our framework through experiments in simulation and on hardware, showing that UPS can disentangle confident, ambiguous, and incapable scenarios and minimizes expensive user interventions compared to uncalibrated baselines and prior human- or robot-gated continual learning approaches. Videos can be found at https://jessie-yuan.github.io/ups/