🤖 AI Summary
Large vision-language models (VLMs) face prohibitive computational and memory overhead, hindering their deployment on mobile devices for blind and low-vision (BLV) users. Method: This work systematically evaluates lightweight VLMs—specifically the SmolVLM2 series—for fine-grained, context-aware video captioning in both indoor and outdoor settings. We introduce two novel accessibility-oriented evaluation frameworks: a multi-context BLV framework and a navigation-assistance framework, and investigate the impact of prompt engineering on caption quality. Experiments are conducted on smartphones using FP32 and INT8 quantized inference, validated on AVCaps and Charades datasets. Contribution/Results: Lightweight VLMs achieve efficient, high-quality, task-adapted caption generation on resource-constrained mobile hardware. Our evaluation frameworks enable rigorous, scenario-specific assessment aligned with real-world BLV needs. Results demonstrate significant improvements in practicality and deployability of vision-assistance technologies, advancing accessible AI for mobile platforms.
📝 Abstract
Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.