Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

While modern vision-language models (VLMs) possess extensive world knowledge, they lack systematic evaluation on embodied reasoning tasks requiring precise visual grounding. Method: We introduce Point-It-Out, the first benchmark specifically designed for indoor, kitchen, driving, and robotic manipulation scenarios, featuring a three-stage hierarchical evaluation protocol: S1 (referential localization), S2 (task-driven pointing), and S3 (visual trajectory prediction). The benchmark integrates real-world images, human-annotated bounding boxes, and task-oriented pointing instructions. Contribution/Results: Evaluating over a dozen state-of-the-art VLMs reveals counterintuitive findings: general-purpose models like GPT-4o underperform several open-source VLMs on precise visual grounding (S1/S2), while models such as MoLMO exhibit significant performance degradation on higher-order trajectory prediction (S3). These results highlight a critical bottleneck in current VLMs—namely, the inability to jointly reason about fine-grained visual grounding and spatiotemporal trajectory planning.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have demonstrated impressive world knowledge across a wide range of tasks, making them promising candidates for embodied reasoning applications. However, existing benchmarks primarily evaluate the embodied reasoning ability of VLMs through multiple-choice questions based on image annotations -- for example, selecting which trajectory better describes an event in the image. In this work, we introduce the Point-It-Out (PIO) benchmark, a novel benchmark designed to systematically assess the embodied reasoning abilities of VLMs through precise visual grounding. We propose a hierarchical evaluation protocol spanning three stages (S1: referred-object localization, S2: task-driven pointing, and S3: visual trace prediction), with data collected from critical domains for embodied intelligence, including indoor, kitchen, driving, and robotic manipulation scenarios. Extensive experiments with over ten state-of-the-art VLMs reveal several interesting findings. For example, strong general-purpose models such as GPT-4o, while excelling on many benchmarks (e.g., language, perception, and reasoning), underperform compared to some open-source models in precise visual grounding; models such as MoLMO perform well in S1 and S2 but struggle in S3, where requires grounding combined with visual trace planning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' embodied reasoning through precise visual grounding tasks

Assessing multi-stage visual grounding in real-world embodied scenarios

Benchmarking VLMs' ability to combine grounding with visual trace planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical evaluation protocol spanning three stages

Precise visual grounding for embodied reasoning assessment

Data collected from critical embodied intelligence domains

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling