EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild

๐Ÿ“… 2026-03-23
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing benchmarks for social group detection are limited by their reliance on single-scenario datasets and third-person perspectives, which hinders the evaluation of model generalization in real-world, culturally diverse environments. To address this gap, this work proposes EgoGroupsโ€”the first egocentric, globally sourced benchmark for social group detection, encompassing data from 65 countries across varying crowd densities, weather conditions, and times of day. It provides dense annotations of individuals and groups alongside rich geographic and contextual metadata. EgoGroups enables systematic evaluation of vision-language models (VLMs), large language models (LLMs), and supervised approaches under both zero-shot and supervised settings. Experiments reveal that VLMs and LLMs can outperform supervised baselines in zero-shot scenarios, with model performance significantly influenced by crowd density and cultural region, offering new insights and a robust foundation for developing socially aware agents.

Technology Category

Application Category

๐Ÿ“ Abstract
Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 65 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.
Problem

Research questions and friction points this paper is trying to address.

social group detection
benchmark
first-person view
scene diversity
cultural context
Innovation

Methods, ideas, or system contributions that make the work stand out.

EgoGroups
first-person vision
social group detection
visual-language models
cross-cultural benchmark
๐Ÿ”Ž Similar Papers
No similar papers found.