Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Blind and low-vision (BLV) users face safety and social risks because they cannot visually detect visual reasoning errors in multimodal large language models (MLLMs). Method: We propose the first non-visual MLLM output credibility assessment method, systematically designing a multi-model response divergence space. It employs three accessible divergence representations—semantic disagreement visualization, critical variation highlighting, and structured comparison—enabling reliable error detection without visual image verification. Contribution/Results: User studies show the method improves unreliable information detection by 4.9×; 14 of 15 participants significantly preferred it; all expressed willingness to adopt it in real-world tasks (e.g., medication identification, clothing selection). This work pioneers leveraging inter-model consistency as an auditory/tactile-perceivable credibility cue, establishing a new paradigm for accessible AI interaction.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect without sight, posing safety and social risks in scenarios from medication identification to outfit selection. While BLV MLLM users use creative workarounds such as cross-checking between tools and consulting sighted individuals, these approaches are often time-consuming and impractical. We explore how systematically surfacing variations across multiple MLLM responses can support BLV users to detect unreliable information without visually inspecting the image. We contribute a design space for eliciting and presenting variations in MLLM descriptions, a prototype system implementing three variation presentation styles, and findings from a user study with 15 BLV participants. Our results demonstrate that presenting variations significantly increases users' ability to identify unreliable claims (by 4.9x using our approach compared to single descriptions) and significantly decreases perceived reliability of MLLM responses. 14 of 15 participants preferred seeing variations of MLLM responses over a single description, and all expressed interest in using our system for tasks from understanding a tornado's path to posting an image on social media.

Problem

Research questions and friction points this paper is trying to address.

Detect unreliable MLLM-generated image descriptions for BLV users

Reduce safety risks in daily visual tasks without sight

Improve reliability assessment through systematic response variations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically surfaces MLLM response variations

Design space for MLLM description variations

Prototype with three variation presentation styles

🔎 Similar Papers

MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification