🤖 AI Summary
This study investigates capability bottlenecks of vision-language models (VLMs) in cross-modal entity state tracking—specifically, the joint updating and long-term maintenance of entity states across text and image modalities. To address this, we introduce MET-Bench, the first dedicated multimodal entity tracking benchmark, grounded in two structured domains: chess and shell games. It features cross-modal state annotations, controllable-difficulty trajectory generation, and a behavioral attribution analysis framework. Key findings reveal that VLMs underperform significantly on image-modality tracking compared to text-modality tracking (average gap >40%), primarily due to deficits in visual reasoning—not low-level perception. Chain-of-thought prompting yields modest improvements, yet long-horizon, cross-modal tracking remains a fundamental limitation. This work establishes a reproducible, diagnostic benchmark and analytical paradigm for advancing multimodal representation learning and reasoning research.
📝 Abstract
Entity tracking is a fundamental challenge in natural language understanding, requiring models to maintain coherent representations of entities. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate the ability of vision-language models to track entity states across modalities. Using two structured domains, Chess and the Shell Game, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based tracking and that this performance gap stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet substantial limitations remain, especially in long-horizon multimodal scenarios. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.