🤖 AI Summary
This work addresses the challenge of generalizing force and motion control in zero-shot robotic manipulation. We propose a pretraining-free, vision-language model (VLM)-driven approach that explicitly overlays unified coordinate-system annotations onto monocular robot images and integrates wrench-space prompting to directly elicit six-dimensional force outputs—bypassing trajectory prediction. A physics-grounded interaction feedback loop enables autonomous failure recovery. To our knowledge, this is the first method to jointly leverage coordinate-system visual annotation and wrench-space reasoning, enabling zero-shot generalization across tasks, platforms, and multimodal motions (translation/rotation). Evaluated on four manipulation tasks—lid opening, cup pushing, etc.—across 220 trials, our method achieves a 51% success rate without fine-tuning or human intervention. Furthermore, we identify a novel safety concern: coordinate-system annotations may inadvertently circumvent VLM safety mechanisms.
📝 Abstract
Vision language models (VLMs) exhibit vast knowledge of the physical world, including intuition of physical and spatial properties, affordances, and motion. With fine-tuning, VLMs can also natively produce robot trajectories. We demonstrate that eliciting wrenches, not trajectories, allows VLMs to explicitly reason about forces and leads to zero-shot generalization in a series of manipulation tasks without pretraining. We achieve this by overlaying a consistent visual representation of relevant coordinate frames on robot-attached camera images to augment our query. First, we show how this addition enables a versatile motion control framework evaluated across four tasks (opening and closing a lid, pushing a cup or chair) spanning prismatic and rotational motion, an order of force and position magnitude, different camera perspectives, annotation schemes, and two robot platforms over 220 experiments, resulting in 51% success across the four tasks. Then, we demonstrate that the proposed framework enables VLMs to continually reason about interaction feedback to recover from task failure or incompletion, with and without human supervision. Finally, we observe that prompting schemes with visual annotation and embodied reasoning can bypass VLM safeguards. We characterize prompt component contribution to harmful behavior elicitation and discuss its implications for developing embodied reasoning. Our code, videos, and data are available at: https://scalingforce.github.io/.