🤖 AI Summary
This study addresses the challenge that multimodal models struggle to reliably bind textual prompts (e.g., “image”) to their corresponding input modalities, resulting in inadequate source modality tracking. The work formally defines and empirically investigates the “source modality monitoring” problem for the first time, introducing an evaluation paradigm grounded in target-modality information retrieval. By integrating syntactic manipulation with semantic perturbation, the authors systematically assess binding mechanisms across eleven prominent vision-language models. Their findings reveal that semantic cues dominate the binding process when modality distributions exhibit significant divergence, consistently outweighing syntactic signals. These insights offer critical evidence and a novel perspective for enhancing the reliability and robustness of multimodal agents.
📝 Abstract
We define and investigate source-modality monitoring -- the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of the more general binding problem, and evaluate the extent to which models exploit syntactic vs. semantic signals in order to bind words like image in a user-provided prompt to specific components of their input and context (i.e., actual images). Across experiments spanning 11 vision-language models (VLMs) performing target-modality information retrieval tasks, we find that both syntactic and semantic signals play an important role, but that the latter tend to outweigh the former in cases when modalities are highly distinct distributionally. We discuss the implications of these findings for model robustness, and in the context of increasingly multimodal agentic systems.