EgoGroups: A Benchmark For Detecting Social Groups of People in the Wild

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing benchmarks for social group detection are limited by their reliance on single-scenario datasets and third-person perspectives, which hinders the evaluation of model generalization in real-world, culturally diverse environments. To address this gap, this work proposes EgoGroups—the first egocentric, globally sourced benchmark for social group detection, encompassing data from 65 countries across varying crowd densities, weather conditions, and times of day. It provides dense annotations of individuals and groups alongside rich geographic and contextual metadata. EgoGroups enables systematic evaluation of vision-language models (VLMs), large language models (LLMs), and supervised approaches under both zero-shot and supervised settings. Experiments reveal that VLMs and LLMs can outperform supervised baselines in zero-shot scenarios, with model performance significantly influenced by crowd density and cultural region, offering new insights and a robust foundation for developing socially aware agents.

Technology Category

Application Category

📝 Abstract

Social group detection, or the identification of humans involved in reciprocal interpersonal interactions (e.g., family members, friends, and customers and merchants), is a crucial component of social intelligence needed for agents transacting in the world. The few existing benchmarks for social group detection are limited by low scene diversity and reliance on third-person camera sources (e.g., surveillance footage). Consequently, these benchmarks generally lack real-world evaluation on how groups form and evolve in diverse cultural contexts and unconstrained settings. To address this gap, we introduce EgoGroups, a first-person view dataset that captures social dynamics in cities around the world. EgoGroups spans 65 countries covering low, medium, and high-crowd settings under four weather/time-of-day conditions. We include dense human annotations for person and social groups, along with rich geographic and scene metadata. Using this dataset, we performed an extensive evaluation of state-of-the-art VLM/LLMs and supervised models on their group detection capabilities. We found several interesting findings, including VLMs and LLMs can outperform supervised baselines in a zero-shot setting, while crowd density and cultural regions clearly influence model performance.

Problem

Research questions and friction points this paper is trying to address.

social group detection

benchmark

first-person view

scene diversity

cultural context

Innovation

Methods, ideas, or system contributions that make the work stand out.

EgoGroups

first-person vision

social group detection