🤖 AI Summary
Existing benchmarks for multi-focus referring expression grounding inadequately characterize multimodal models’ pointing capabilities across diverse reasoning scenarios. Method: We introduce Point-Bench—the first language-guided pointing evaluation platform for multimodal grounding—comprising five task categories (referring localization, spatial reasoning, causal inference, etc.), a curated benchmark dataset, an interactive blind-test arena (Point-Battle), and a ROS-driven real-robot execution system (Point-Act). It establishes a novel three-stage closed-loop evaluation paradigm: “Recognition → Interactive Voting → Physical Execution.” Contribution/Results: Experiments show pointing fine-tuning improves accuracy by 23.6% on average; strong cross-stage correlations (ρ > 0.89) confirm pointing as a critical bridge from abstract reasoning to embodied action; Molmo-72B achieves top performance. Point-Bench enables fair, reproducible, cross-modal comparison of both open- and closed-source models.
📝 Abstract
Pointing serves as a fundamental and intuitive mechanism for grounding language within visual contexts, with applications spanning robotics, assistive technologies, and interactive AI systems. While recent multimodal models have started to support pointing capabilities, existing benchmarks typically focus only on referential object localization tasks. We introduce PointArena, a comprehensive platform for evaluating multimodal pointing across diverse reasoning scenarios. PointArena comprises three components: (1) Point-Bench, a curated dataset containing approximately 1,000 pointing tasks across five reasoning categories; (2) Point-Battle, an interactive, web-based arena facilitating blind, pairwise model comparisons, which has already gathered over 4,500 anonymized votes; and (3) Point-Act, a real-world robotic manipulation system allowing users to directly evaluate multimodal model pointing capabilities in practical settings. We conducted extensive evaluations of both state-of-the-art open-source and proprietary multimodal models. Results indicate that Molmo-72B consistently outperforms other models, though proprietary models increasingly demonstrate comparable performance. Additionally, we find that supervised training specifically targeting pointing tasks significantly enhances model performance. Across our multi-stage evaluation pipeline, we also observe strong correlations, underscoring the critical role of precise pointing capabilities in enabling multimodal models to effectively bridge abstract reasoning with concrete, real-world actions. Project page: https://pointarena.github.io/