🤖 AI Summary
This study investigates whether large vision-language models (LVLMs) and their standard automatic evaluation metrics accurately reflect the authentic preferences of blind and low-vision (BLV) users in navigation assistance. Method: We introduce Eye4B, the first BLV-oriented benchmark dataset comprising 1,100 real-world scenes, each paired with 5–10 navigation queries, and conduct multidimensional preference assessments—covering fear, operability, clarity, conciseness, and non-actionability—with eight BLV participants across six LVLMs (e.g., LLaVA, Qwen-VL). Contribution/Results: We find that conventional automatic metrics (e.g., CLIPScore, BLEU) exhibit significant misalignment with BLV preferences (confirmed via Spearman and Kendall correlation analyses). Crucially, conciseness and non-actionability emerge as dominant preference dimensions. This work delivers the first quantitative, multidimensional characterization of BLV preferences for navigation responses, bridging a critical gap in BLV-aligned evaluation and providing empirical foundations and design principles for accessible AI.
📝 Abstract
Vision is a primary means of how humans perceive the environment, but Blind and Low-Vision (BLV) people need assistance understanding their surroundings, especially in unfamiliar environments. The emergence of semantic-based systems as assistance tools for BLV users has motivated many researchers to explore responses from Large Vision-Language Models (LVLMs). However, it has yet been studied preferences of BLV users on diverse types/styles of responses from LVLMs, specifically for navigational aid. To fill this gap, we first construct Eye4B dataset, consisting of human-validated 1.1k curated outdoor/indoor scenes with 5-10 relevant requests per scene. Then, we conduct an in-depth user study with eight BLV users to evaluate their preferences on six LVLMs from five perspectives: Afraidness, Nonactionability, Sufficiency, and Conciseness. Finally, we introduce Eye4B benchmark for evaluating alignment between widely used model-based image-text metrics and our collected BLV preferences. Our work can be set as a guideline for developing BLV-aware LVLMs towards a Barrier-Free AI system.