Vision-Language Models Struggle to Align Entities across Modalities

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Vision-language models (VLMs) lack systematic evaluation of cross-modal entity alignment—the precise matching of objects and their attributes between images and text. Method: We formally define this capability and introduce MATE, the first fine-grained, object-level diagnostic benchmark (5.5K image-text pairs), evaluating alignment via question-answering–based cross-modal attribute retrieval. We incorporate chain-of-thought (CoT) prompting and design controllably complex scenarios, alongside human baseline experiments. Results: State-of-the-art VLMs significantly underperform humans—especially in multi-object settings, where accuracy drops precipitously. CoT yields only marginal gains, indicating that current VLM architectures fail to effectively model cross-modal entity alignment. This work establishes a new paradigm and a critical benchmark for assessing fine-grained multimodal understanding in VLMs.

Technology Category

Application Category

📝 Abstract

Cross-modal entity linking refers to the ability to align entities and their attributes across different modalities. While cross-modal entity linking is a fundamental skill needed for real-world applications such as multimodal code generation, fake news detection, or scene understanding, it has not been thoroughly studied in the literature. In this paper, we introduce a new task and benchmark to address this gap. Our benchmark, MATE, consists of 5.5k evaluation instances featuring visual scenes aligned with their textual representations. To evaluate cross-modal entity linking performance, we design a question-answering task that involves retrieving one attribute of an object in one modality based on a unique attribute of that object in another modality. We evaluate state-of-the-art Vision-Language Models (VLMs) and humans on this task, and find that VLMs struggle significantly compared to humans, particularly as the number of objects in the scene increases. Our analysis also shows that, while chain-of-thought prompting can improve VLM performance, models remain far from achieving human-level proficiency. These findings highlight the need for further research in cross-modal entity linking and show that MATE is a strong benchmark to support that progress.

Problem

Research questions and friction points this paper is trying to address.

Addressing cross-modal entity linking challenges in Vision-Language Models.

Introducing MATE benchmark for evaluating entity alignment across modalities.

Highlighting VLMs' struggle with entity linking compared to humans.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MATE benchmark for cross-modal entity linking

Uses question-answering task to evaluate entity alignment

Applies chain-of-thought prompting to improve model performance

🔎 Similar Papers

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension