🤖 AI Summary
Prior work lacks empirical evaluation of multimodal large language models (MLLMs) for visual explanation in real-world, daily-life contexts of blind and low-vision (BLV) users—particularly regarding trustworthiness, usability, and high-stakes scenarios like medication dose identification.
Method: We conducted a two-week user diary study with 20 BLV participants, collecting 553 authentic interaction logs; 60 logs underwent mixed-methods analysis to assess explanation reliability, trust formation mechanisms, and performance across structured and unstructured visual tasks.
Contribution/Results: Participants reported high average trust (3.75/5) and satisfaction (4.15/5), even in critical medical tasks—demonstrating robust MLLM reliability under real-world constraints. This is the first empirical study to evaluate MLLMs for BLV users in authentic daily settings, bridging a key gap in trustworthy multimodal accessibility research. Findings provide empirically grounded design principles and methodological guidance for developing credible, inclusive multimodal human-AI interfaces.
📝 Abstract
Blind and Low Vision (BLV) people have adopted AI-powered visual interpretation applications to address their daily needs. While these applications have been helpful, prior work has found that users remain unsatisfied by their frequent errors. Recently, multimodal large language models (MLLMs) have been integrated into visual interpretation applications, and they show promise for more descriptive visual interpretations. However, it is still unknown how this advancement has changed people's use of these applications. To address this gap, we conducted a two-week diary study in which 20 BLV people used an MLLM-enabled visual interpretation application we developed, and we collected 553 entries. In this paper, we report a preliminary analysis of 60 diary entries from 6 participants. We found that participants considered the application's visual interpretations trustworthy (mean 3.75 out of 5) and satisfying (mean 4.15 out of 5). Moreover, participants trusted our application in high-stakes scenarios, such as receiving medical dosage advice. We discuss our plan to complete our analysis to inform the design of future MLLM-enabled visual interpretation systems.