Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Large vision-language models (VLMs) face prohibitive computational and memory overhead, hindering their deployment on mobile devices for blind and low-vision (BLV) users. Method: This work systematically evaluates lightweight VLMs—specifically the SmolVLM2 series—for fine-grained, context-aware video captioning in both indoor and outdoor settings. We introduce two novel accessibility-oriented evaluation frameworks: a multi-context BLV framework and a navigation-assistance framework, and investigate the impact of prompt engineering on caption quality. Experiments are conducted on smartphones using FP32 and INT8 quantized inference, validated on AVCaps and Charades datasets. Contribution/Results: Lightweight VLMs achieve efficient, high-quality, task-adapted caption generation on resource-constrained mobile hardware. Our evaluation frameworks enable rigorous, scenario-specific assessment aligned with real-world BLV needs. Results demonstrate significant improvements in practicality and deployability of vision-assistance technologies, advancing accessible AI for mobile platforms.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.

Problem

Research questions and friction points this paper is trying to address.

Evaluating lightweight VLMs for blind users' accessibility with detailed descriptions

Developing novel evaluation frameworks for BLV spatial and navigational assistance

Assessing mobile deployment constraints of VLMs on resource-limited devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated lightweight VLM variants with 500M and 2.2B parameters

Introduced two novel BLV accessibility-focused evaluation frameworks

Deployed models on smartphones with different precision variants

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment