🤖 AI Summary
Current benchmarks inadequately assess the stability and robustness of audio-visual large language models (AVLLMs) against misinformation detection, cross-modal compositional understanding, and audio-visual disentanglement—revealing a substantial gap between AVLLM and human-level comprehension. To address this, we introduce AVTrustBench, the first comprehensive benchmark for evaluating AVLLM trustworthiness, comprising 600K samples across nine tasks and systematically measuring adversarial robustness, compositional reasoning, and modality dependency. We propose a novel joint audio-visual trust evaluation framework, featuring modality-coordinated perturbations and calibrated response assessment. Furthermore, we develop CAVPref—a model-agnostic preference optimization method for training AVLLMs with calibrated audio-visual preferences. Extensive evaluation on 13 state-of-the-art models demonstrates that CAVPref achieves an average 30.19% improvement across all nine tasks. Both AVTrustBench and the CAVPref implementation will be publicly released.
📝 Abstract
With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, model-agnostic calibrated audio-visual preference optimization based training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks. We will publicly release our code and benchmark to facilitate future research in this direction.