MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This paper addresses the challenge of understanding group-level social interactions in urban public spaces by introducing the novel task of *Social Group Region Detection*, which aims to localize semantically meaningful regions defined by abstract interpersonal relationships—such as intimate proximity, coordinated motion, and visual engagement. Methodologically, we propose MINGLE, a modular three-stage framework: (1) joint human detection and depth estimation to reconstruct 3D spatial layouts; (2) fine-grained pairwise social relation classification using a vision-language model (VLM); and (3) lightweight spatial clustering to aggregate individuals into socially coherent group regions. Our contributions include: the first formal definition of this task; the MINGLE framework; and a large-scale, manually annotated dataset of 100,000 street-view images, featuring both individual- and group-level annotations. Extensive experiments demonstrate high accuracy and strong generalization across diverse real-world urban scenes.

Technology Category

Application Category

📝 Abstract

Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Detecting social group interactions from urban images

Interpreting subtle visual cues beyond object detection

Spatially grounding regions defined by interpersonal relations

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM-based reasoning for social affiliation

Lightweight spatial aggregation algorithm

Modular three-stage pipeline integration

🔎 Similar Papers

UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction