🤖 AI Summary
This study addresses the tendency of visual grounding models to produce near-miss predictions under semantically mismatched referring expressions, which undermines their reliability and interpretability. From the perspective of mechanistic interpretability, the work investigates whether embedding space anisotropy contributes to counterfactual failures and introduces a similarity-controlled counterfactual caption generation protocol. This protocol systematically perturbs either object or contextual components to analyze model behavior. Leveraging Transformer-based architectures such as TransVG and SwimVG, and combining geometric analysis of embeddings with controlled textual perturbations, experiments reveal no significant correlation between cosine similarity and near-miss behavior. These findings suggest that anisotropy is not the primary cause, pointing instead to the need for deeper exploration of finer-grained geometric properties of the embedding space.
📝 Abstract
Visual Grounding benchmarks assume that the object described by a referring expression is always present in the image, and grounding models are therefore rarely evaluated under semantically mismatched captions. In such cases, models frequently exhibit approximation behavior, producing a plausible bounding box that satisfies only part of the expression (\eg, preserving the original object while ignoring modified contextual cues). Because mismatched captions represent realistic edge cases, this behavior compromises reliability and raises concerns from an explainability perspective. Identifying its underlying causes is thus essential for improving model faithfulness and interpretability.
Adopting a mechanistic interpretability viewpoint, this work examines whether embedding anisotropy contributes to counterfactual failures. A similarity-controlled counterfactual caption generation protocol is introduced to systematically perturb object or contextual components within predefined embedding similarity intervals, enabling a fine-grained analysis of grounding behavior as a function of alignment. Experiments on two Transformer-based models with markedly different embedding geometries (BERT-based TransVG and CLIP-based SwimVG) reveal no meaningful correlation between cosine similarity and approximation. These findings suggest that anisotropy alone does not account for counterfactual errors, and that robustness requires investigating finer-grained geometric properties of the embedding space.