🤖 AI Summary
This work addresses the absence of a unified framework for fine-grained, probabilistic uncertainty reasoning across text, audio, and video modalities. It introduces the first multimodal fine-grained probabilistic judgment task, requiring models to produce calibrated probability estimates for a given hypothesis based on arbitrary unimodal or multimodal inputs. To tackle this challenge, the authors propose CLUE (Calibrated Latent Uncertainty Estimation), which integrates self-consistent teacher calibration with a distributional confidence probing mechanism, alongside a dedicated evaluation benchmark. Experimental results demonstrate that the proposed 3B-parameter model matches or exceeds the performance of baseline models up to 32B parameters across all modalities, highlighting its efficiency and effectiveness in calibrated multimodal uncertainty quantification.
📝 Abstract
We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.