🤖 AI Summary
To address the high memory footprint, poor real-time performance, and weak long-term scene understanding of vision-language models (VLMs) in visual impairment assistance, this paper proposes a scene-aware vectorized memory multi-agent framework. Our method introduces: (1) a cross-modal differential quantization mechanism that applies asymmetric precision compression across visual and language modules; and (2) a lightweight vectorized memory architecture for storage and retrieval, enabling continual scene-based reasoning and cross-modal collaboration. Experiments demonstrate that our approach reduces model memory from 38 GB to 16 GB (a 57.9% reduction), with only a 2.05% degradation in VLM performance. It achieves 63.7% accuracy on OCR-VQA and end-to-end response latency of 2.83–3.52 seconds—significantly outperforming resource-constrained baseline small models.
📝 Abstract
This study proposes the dual technological innovation framework, including a cross-modal differ entiated quantization framework for vision-language models (VLMs) and a scene-aware vectorized
memory multi-agent system for visually impaired assistance. The modular framework was developed
implementing differentiated processing strategies, effectively reducing memory requirements from
38GB to 16GB while maintaining model performance. The multi-agent architecture combines
scene classification, vectorized memory, and multimodal interaction, enabling persistent storage
and efficient retrieval of scene memories. Through perception-memory-reasoning workflows, the
system provides environmental information beyond the current view using historical memories.
Experiments show the quantized 19B-parameter model only experiences a 2.05% performance drop
on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9), outperforming smaller
models with equivalent memory requirements like the Molmo-7B series. The system maintains
response latency between 2.83-3.52 seconds from scene analysis to initial speech output, substantially
faster than non-streaming methods. This research advances computational efficiency and assistive
technology, offering visually impaired users comprehensive real-time assistance in scene perception,
text recognition, and navigation.