π€ AI Summary
Existing multimodal benchmarks struggle to disentangle a modelβs genuine modality dependence from information asymmetry and lack evaluation of its ability to calibrate abstention under cross-modal conflicts. This work proposes OMD-Bench, which introduces systematic perturbations in video, audio, and text to deliberately break modality consistency, enabling the first decoupled diagnosis of modality dependence, robustness, and uncertainty calibration. Constructed from 27 anchor points under eight perturbation conditions, the benchmark comprises 4,080 evaluation instances. Leveraging multimodal perturbations, zero-shot prompting, and chain-of-thought reasoning, the study reveals that models tend to over-abstain when two modalities are corrupted yet under-abstain when all three are degraded, while maintaining high confidence despite perturbations. Chain-of-thought prompting improves alignment in abstention behavior but exacerbates overconfidence.
π Abstract
Existing omni-modal benchmarks attempt to measure modality-specific contributions, but their measurements are confounded: naturally co-occurring modalities carry correlated yet unequal information, making it unclear whether results reflect true modality reliance or information asymmetry. We introduce OMD-Bench, where all modalities are initially congruent - each presenting the same anchor, an object or event independently perceivable through video, audio, and text - which we then systematically corrupt to isolate each modality's contribution. We also evaluate calibrated abstention: whether models appropriately refrain from answering when evidence is conflicting. The benchmark comprises 4,080 instances spanning 27 anchors across eight corruption conditions. Evaluating ten omni-modal models under zero-shot and chain-of-thought prompting, we find that models over-abstain when two modalities are corrupted yet under-abstain severely when all three are, while maintaining high confidence (~60-100%) even under full corruption. Chain-of-thought prompting improves abstention alignment with human judgment but amplifies overconfidence rather than mitigating it. OMD-Bench provides a diagnostic benchmark for diagnosing modality reliance, robustness to cross-modal inconsistency, and uncertainty calibration in omni-modal systems.