COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication

๐Ÿ“… 2025-06-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study investigates the dynamic contextual dependency mechanism of vision-language models (VLMs) in referring expression comprehension. To this end, we introduce COOCO, the first benchmark dataset designed for scene-object semantic consistency analysis, featuring controllable scene-object matching gradients and diverse image perturbations. Through attention visualization, cross-layer contextual sensitivity analysis, and multimodal consistency modeling, we find that VLMs do not rely on context statically; instead, they adaptively balance local target features and global scene cuesโ€”enhancing contextual utilization significantly under high semantic alignment or target degradation. Notably, mid-level visual encoders exhibit peak sensitivity to scene-guided target localization under moderate noise. This work provides the first empirical characterization of context-adaptive mechanisms in VLM-based referring understanding, establishing a new paradigm for interpretable multimodal reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
Natural scenes provide us with rich contexts for object recognition and reference. In particular, knowing what type of scene one is looking at generates expectations about which objects will occur, and what their spatial configuration should be. Do Vision-Language Models (VLMs) learn to rely on scene contexts in a similar way, when generating references to objects? To address this question, we introduce the extit{Common Objects Out-of-Context (COOCO)} dataset and test to what extent VLMs rely on scene context to refer to objects under different degrees of scene-object congruency, and different perturbations. Our findings show that models leverage scene context adaptively, depending on both the semantic relatedness between object and scene and the level of noise. In particular, models rely more on context under high target-scene congruence or when objects are degraded. Attention analysis reveals that successful object categorisation involves increased focus on the target in mid-level layers, especially under moderate noise, suggesting that VLMs dynamically balance local and contextual information for reference generation. We make our dataset, code and models available at href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}.
Problem

Research questions and friction points this paper is trying to address.

Investigates if VLMs use scene context for object reference
Tests VLMs with varying scene-object congruency and noise
Analyzes attention in VLMs for dynamic context usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces COOCO dataset for scene-object congruency
Tests VLMs' adaptive use of scene context
Analyzes attention dynamics in object categorization
๐Ÿ”Ž Similar Papers
2024-02-12International Conference on Information PhotonicsCitations: 1