AMAVA: Adaptive Motion-Aware Video-to-Audio Framework for Visually-Impaired Assistance

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the cognitive overload experienced by visually impaired users due to continuous, undifferentiated feedback in existing navigation aids, which often impedes effective communication of dynamic environmental information. To mitigate this, we propose a motion-aware adaptive video-to-audio conversion framework that intelligently alternates between spoken descriptions and non-verbal auditory cues based on scene dynamics. The system incorporates prompt caching and category-based rate limiting to minimize auditory interference while maintaining low latency. It integrates a lightweight AI classifier, a decoder-only Transformer-based vision-language model enhanced with mixture-of-experts and cross-modal attention mechanisms, neural text-to-speech synthesis, and naturalistic sound generation. Real-world navigation experiments demonstrate that, compared to using a white cane alone, our system significantly improves users’ environmental awareness, subjective sense of safety, and confidence in navigation.

Technology Category

Application Category

📝 Abstract

Navigational aids for blind and low vision individuals struggle conveying dynamic real-world environments, leading to cognitive overload from continuous, undifferentiated feedback. We present AMAVA, a novel real-time video-to-audio framework that converts mobile device video into contextually relevant sound effects or text-to-speech descriptions. We propose a motion-aware pipeline using a lightweight AI classification model to distinguish between low and high-movement scenes followed by a real-time text-to-audio synthesis pipeline to enhance environmental perception more efficiently. In static environments, AMAVA generates spoken audio scene descriptions for situational awareness. In high-movement situations, it prioritizes safety by delivering sound cues, such as spoken hazard alerts and environmental sound effects. These audio outputs are produced by a decoder-only transformer-based vision-language model with mixture-of-experts and cross-modal attention for visual understanding, in conjunction with neural text-to-speech and natural sound synthesis networks. The proposed framework uses prompt-based caching and category-specific throttling to avoid auditory clutter and minimize latency. We present a comprehensive evaluation of the system, including a real-time navigation study comparing a white cane alone versus with AMAVA, that shows a significant increase in user confidence and perceived safety.

Problem

Research questions and friction points this paper is trying to address.

visually-impaired assistance

dynamic environments

cognitive overload

navigation aids

environmental perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

motion-aware video-to-audio

adaptive assistive technology

vision-language model