Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high memory footprint, poor real-time performance, and weak long-term scene understanding of vision-language models (VLMs) in visual impairment assistance, this paper proposes a scene-aware vectorized memory multi-agent framework. Our method introduces: (1) a cross-modal differential quantization mechanism that applies asymmetric precision compression across visual and language modules; and (2) a lightweight vectorized memory architecture for storage and retrieval, enabling continual scene-based reasoning and cross-modal collaboration. Experiments demonstrate that our approach reduces model memory from 38 GB to 16 GB (a 57.9% reduction), with only a 2.05% degradation in VLM performance. It achieves 63.7% accuracy on OCR-VQA and end-to-end response latency of 2.83–3.52 seconds—significantly outperforming resource-constrained baseline small models.

Technology Category

Application Category

📝 Abstract
This study proposes the dual technological innovation framework, including a cross-modal differ entiated quantization framework for vision-language models (VLMs) and a scene-aware vectorized memory multi-agent system for visually impaired assistance. The modular framework was developed implementing differentiated processing strategies, effectively reducing memory requirements from 38GB to 16GB while maintaining model performance. The multi-agent architecture combines scene classification, vectorized memory, and multimodal interaction, enabling persistent storage and efficient retrieval of scene memories. Through perception-memory-reasoning workflows, the system provides environmental information beyond the current view using historical memories. Experiments show the quantized 19B-parameter model only experiences a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9), outperforming smaller models with equivalent memory requirements like the Molmo-7B series. The system maintains response latency between 2.83-3.52 seconds from scene analysis to initial speech output, substantially faster than non-streaming methods. This research advances computational efficiency and assistive technology, offering visually impaired users comprehensive real-time assistance in scene perception, text recognition, and navigation.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory requirements for vision-language models efficiently
Providing real-time environmental assistance for visually impaired users
Enabling persistent storage and retrieval of scene memories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal differentiated quantization for VLMs
Scene-aware vectorized memory multi-agent system
Efficient perception-memory-reasoning workflow integration
🔎 Similar Papers
No similar papers found.