KARMA-MV: A Benchmark for Causal Question Answering on Music Videos

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenge that existing video question-answering methods struggle to model the causal influence of visual dynamics on musical structure in music videos. To bridge this gap, the authors introduce KARMA-MV, the first causal question-answering benchmark specifically designed for music videos, which integrates temporal audiovisual cues to support causal, predictive, and counterfactual reasoning. Methodologically, they leverage large language models to automatically generate and validate questions, construct a causal knowledge graph (CKG) to enhance cross-modal dependency modeling, and jointly employ vision-language models for structured retrieval and reasoning. Experiments demonstrate that the CKG substantially improves causal reasoning capabilities of mainstream vision-language and large language models, with particularly pronounced gains for smaller architectures, thereby advancing music video understanding from correlation-based to causality-aware paradigms.

📝 Abstract

While significant progress has been made in Video Question Answering and cross-modal understanding, causal reasoning about how visual dynamics drive musical structure in music videos remains under-explored. We introduce KARMA-MV, a large-scale multiple-choice QA dataset derived from 2,682 YouTube music videos, designed to test models' ability to integrate temporal audio-visual cues and reason about visual-to-musical influence across reasoning, prediction, and counterfactual questions. Unlike traditional datasets requiring manual annotation, KARMA-MV leverages LLM reasoning for scalable generation and validation, yielding 37,737 MCQs. We propose a causal knowledge graph (CKG) approach that augments vision-language models (VLMs) with structured retrieval of cross-modal dependencies. Experiments on state-of-the-art VLMs and LLMs show consistent gains from CKG grounding -- especially for smaller models -- establishing the value of explicit causal structure for music-video reasoning. KARMA-MV provides a new benchmark for advancing causal audio-visual understanding beyond correlation.

Problem

Research questions and friction points this paper is trying to address.

causal reasoning

music videos

video question answering

audio-visual understanding

cross-modal dependencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal reasoning

music videos

vision-language models