Spot The Ball: A Benchmark for Visual Social Inference

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This study addresses a critical bottleneck in vision-language models (VLMs): their inability to perform visual social reasoning—specifically, inferring hidden scene elements (e.g., occluded or removed balls) from subtle behavioral cues such as gaze direction and body posture. Method: We introduce Spot The Ball, the first systematic benchmark designed to evaluate VLMs’ capacity to localize target objects in sports scenes using social cues. Our approach employs a scalable test-generation pipeline integrating leading VLMs (Gemini, GPT-4V, LLaVA, Qwen-VL) and three prompting strategies, with human performance serving as a strong baseline. Results: Humans achieve 20–34% accuracy—significantly outperforming all VLMs (≤17%). This gap reveals VLMs’ overreliance on spatial heuristics and fundamental deficits in modeling structured social behavior. Our work quantifies, for the first time, the human–machine disparity in visual social reasoning, establishing a novel benchmark and concrete directions for next-generation embodied social AI.

Technology Category

Application Category

📝 Abstract

Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models ($leq$ 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.

Problem

Research questions and friction points this paper is trying to address.

Benchmark evaluates visual social inference in AI models

Models localize removed sports ball using behavioral cues

Reveals human-model gap in leveraging gaze and pose

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates visual social inference in VLMs

Pipeline generates scalable test items with human baselines

Models use spatial heuristics instead of social cues

🔎 Similar Papers

No similar papers found.