🤖 AI Summary
This study addresses the over-reliance on textual modality in video-language models (VLMs) for multiple-choice video question answering (VQA). We propose the first Shapley-value-based framework jointly quantifying multimodal feature attribution and modality-wise contribution. Our method uniformly maps video frames and textual tokens to a comparable feature space, defines an interactive contribution metric across video, question, and answer modalities, and introduces a configurable feature partitioning strategy. Experiments across four state-of-the-art VLMs and four benchmark datasets reveal that performance gains stem primarily from identifying and filtering distractor options—not genuine cross-modal reasoning—and that video modality contributes significantly less than text, indicating implicit task degeneration into “textual distractor identification.” This work establishes a novel paradigm and empirical benchmark for interpretable evaluation of multimodal models.
📝 Abstract
As we become increasingly dependent on vision language models (VLMs) to answer questions about the world around us, there is a significant amount of research devoted to increasing both the difficulty of video question answering (VQA) datasets, and the context lengths of the models that they evaluate. The reliance on large language models as backbones has lead to concerns about potential text dominance, and the exploration of interactions between modalities is underdeveloped. How do we measure whether we're heading in the right direction, with the complexity that multi-modal models introduce? We propose a joint method of computing both feature attributions and modality scores based on Shapley values, where both the features and modalities are arbitrarily definable. Using these metrics, we compare $6$ VLM models of varying context lengths on $4$ representative datasets, focusing on multiple-choice VQA. In particular, we consider video frames and whole textual elements as equal features in the hierarchy, and the multiple-choice VQA task as an interaction between three modalities: video, question and answer. Our results demonstrate a dependence on text and show that the multiple-choice VQA task devolves into a model's ability to ignore distractors. Code available at https://github.com/sjpollard/a-video-is-not-worth-a-thousand-words.