Spot The Ball: A Benchmark for Visual Social Inference

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses a critical bottleneck in vision-language models (VLMs): their inability to perform visual social reasoning—specifically, inferring hidden scene elements (e.g., occluded or removed balls) from subtle behavioral cues such as gaze direction and body posture. Method: We introduce Spot The Ball, the first systematic benchmark designed to evaluate VLMs’ capacity to localize target objects in sports scenes using social cues. Our approach employs a scalable test-generation pipeline integrating leading VLMs (Gemini, GPT-4V, LLaVA, Qwen-VL) and three prompting strategies, with human performance serving as a strong baseline. Results: Humans achieve 20–34% accuracy—significantly outperforming all VLMs (≤17%). This gap reveals VLMs’ overreliance on spatial heuristics and fundamental deficits in modeling structured social behavior. Our work quantifies, for the first time, the human–machine disparity in visual social reasoning, establishing a novel benchmark and concrete directions for next-generation embodied social AI.

Technology Category

Application Category

📝 Abstract
Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models ($leq$ 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.
Problem

Research questions and friction points this paper is trying to address.

Benchmark evaluates visual social inference in AI models
Models localize removed sports ball using behavioral cues
Reveals human-model gap in leveraging gaze and pose
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates visual social inference in VLMs
Pipeline generates scalable test items with human baselines
Models use spatial heuristics instead of social cues
🔎 Similar Papers
No similar papers found.
N
Neha Balamurugan
Department of Computer Science, Stanford University
S
Sarah Wu
Department of Psychology, Stanford University
A
Adam Chun
Department of Computer Science, Stanford University
G
Gabe Gaw
Department of Computer Science, Stanford University
Cristobal Eyzaguirre
Cristobal Eyzaguirre
Ph.D. Student, Stanford University
Tobias Gerstenberg
Tobias Gerstenberg
Stanford University
Cognitive ScienceCausal CognitionMoral PsychologyMental SimulationCounterfactual Reasoning