From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Generative robotic policies exhibit high failure rates in real-world deployment, primarily because current vision-language models (VLMs) lack the capacity to reason about the physical consequences of low-level actions. Method: We propose a VLM-in-the-loop framework for real-time policy guidance that decouples *foresight*—predicting action outcomes via a latent-space world model—from *deliberation*—natural-language evaluation of potential states by an aligned VLM. Contributions/Results: First, we introduce the novel use of VLMs as open-vocabulary verifiers for low-level action filtering. Second, we design a latent-space alignment mechanism to bridge semantic representations from VLMs and robot control spaces. Third, we integrate multimodal policy distillation with semantic-driven action re-ranking. Experiments demonstrate substantial failure-rate reduction, strong generalization to unseen objects and environments, and robust policy guidance without requiring additional real-world interaction.

Technology Category

Application Category

📝 Abstract

While generative robot policies have demonstrated significant potential in learning complex, multimodal behaviors from demonstrations, they still exhibit diverse failures at deployment-time. Policy steering offers an elegant solution to reducing the chance of failure by using an external verifier to select from low-level actions proposed by an imperfect generative policy. Here, one might hope to use a Vision Language Model (VLM) as a verifier, leveraging its open-world reasoning capabilities. However, off-the-shelf VLMs struggle to understand the consequences of low-level robot actions as they are represented fundamentally differently than the text and images the VLM was trained on. In response, we propose FOREWARN, a novel framework to unlock the potential of VLMs as open-vocabulary verifiers for runtime policy steering. Our key idea is to decouple the VLM's burden of predicting action outcomes (foresight) from evaluation (forethought). For foresight, we leverage a latent world model to imagine future latent states given diverse low-level action plans. For forethought, we align the VLM with these predicted latent states to reason about the consequences of actions in its native representation--natural language--and effectively filter proposed plans. We validate our framework across diverse robotic manipulation tasks, demonstrating its ability to bridge representational gaps and provide robust, generalizable policy steering.

Problem

Research questions and friction points this paper is trying to address.

Generative robot policies fail at deployment.

VLMs struggle understanding low-level robot actions.

FOREWARN decouples VLM foresight from forethought.

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM verifies low-level robot actions

Latent world model predicts future states

Aligns VLM with predicted latent states

🔎 Similar Papers

A Survey on Reinforcement Learning Applications in SLAM