🤖 AI Summary
Current vision-language models (VLMs) struggle to accurately infer non-visual object attributes—such as weight, relative spatial position, and physical dimensions. To address this limitation, we propose the Perception-Action API framework, which leverages a large language model (LLM) as a program generator that dynamically invokes robotic control primitives to enable active perception and embodied reasoning—thereby overcoming the static, passive nature of conventional VLM understanding. The framework tightly integrates VLMs, LLMs, and robot execution interfaces to support natural-language-driven generation of executable reasoning programs. Evaluated on the Odd-One-Out benchmark, our approach significantly outperforms baseline VLMs. Furthermore, it demonstrates robust performance in both the AI2-THOR simulation environment and on the real-world DJI RoboMaster EP robot platform, achieving precise identification and manipulation based on relative position, size, and weight. Our core contribution is the introduction of the “Perception-Action API” paradigm—a novel framework enabling the transition from passive visual recognition to active, embodied reasoning.
📝 Abstract
There has been a lot of interest in grounding natural language to physical entities through visual context. While Vision Language Models (VLMs) can ground linguistic instructions to visual sensory information, they struggle with grounding non-visual attributes, like the weight of an object. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a perception-action API that consists of VLMs and Large Language Models (LLMs) as backbones, together with a set of robot control functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image. Offline testing on the Odd-One-Out dataset demonstrates that our framework outperforms vanilla VLMs in detecting attributes like relative object location, size, and weight. Online testing in realistic household scenes on AI2-THOR and a real robot demonstration on a DJI RoboMaster EP robot highlight the efficacy of our approach.