AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing vision-language models (VLMs) applied to robotic manipulation often compress intermediate representations, leading to loss of fine-grained spatial and semantic information critical for precise control. Method: We propose *anti-grounding*—a novel paradigm that bypasses conventional instruction-to-action mapping by directly embedding candidate actions into the VLM’s high-dimensional representation space. Leveraging multi-view trajectory rendering and structured visual question answering (S-VQA), our approach enables instruction-driven, zero-shot closed-loop decision-making. An offline policy refinement module further enhances long-horizon performance. Contribution/Results: This is the first method to achieve end-to-end action synthesis entirely within the VLM representation space, enabling cross-task zero-shot generalization. Evaluated in both simulation and on real robotic platforms, it significantly outperforms baselines across diverse manipulation tasks, achieving high-success-rate zero-shot trajectory generation without task-specific fine-tuning.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) encode knowledge and reasoning capabilities for robotic manipulation within high-dimensional representation spaces. However, current approaches often project them into compressed intermediate representations, discarding important task-specific information such as fine-grained spatial or semantic details. To address this, we propose AntiGrounding, a new framework that reverses the instruction grounding process. It lifts candidate actions directly into the VLM representation space, renders trajectories from multiple views, and uses structured visual question answering for instruction-based decision making. This enables zero-shot synthesis of optimal closed-loop robot trajectories for new tasks. We also propose an offline policy refinement module that leverages past experience to enhance long-term performance. Experiments in both simulation and real-world environments show that our method outperforms baselines across diverse robotic manipulation tasks.

Problem

Research questions and friction points this paper is trying to address.

Lifting robotic actions into VLM space for decision-making

Preserving fine-grained spatial and semantic task details

Enabling zero-shot synthesis of optimal robot trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lifts actions into VLM representation space

Uses multi-view trajectory rendering

Applies structured visual question answering

🔎 Similar Papers

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

2024-05-22arXiv.orgCitations: 1

Toyota Research Institute

Los Altos, CA / Cambridge, MA

AI Research Scientist, VLM (vision language models)