Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

To address the high memory footprint, poor real-time performance, and weak long-term scene understanding of vision-language models (VLMs) in visual impairment assistance, this paper proposes a scene-aware vectorized memory multi-agent framework. Our method introduces: (1) a cross-modal differential quantization mechanism that applies asymmetric precision compression across visual and language modules; and (2) a lightweight vectorized memory architecture for storage and retrieval, enabling continual scene-based reasoning and cross-modal collaboration. Experiments demonstrate that our approach reduces model memory from 38 GB to 16 GB (a 57.9% reduction), with only a 2.05% degradation in VLM performance. It achieves 63.7% accuracy on OCR-VQA and end-to-end response latency of 2.83–3.52 seconds—significantly outperforming resource-constrained baseline small models.

Technology Category

Application Category

📝 Abstract

This study proposes the dual technological innovation framework, including a cross-modal differ entiated quantization framework for vision-language models (VLMs) and a scene-aware vectorized memory multi-agent system for visually impaired assistance. The modular framework was developed implementing differentiated processing strategies, effectively reducing memory requirements from 38GB to 16GB while maintaining model performance. The multi-agent architecture combines scene classification, vectorized memory, and multimodal interaction, enabling persistent storage and efficient retrieval of scene memories. Through perception-memory-reasoning workflows, the system provides environmental information beyond the current view using historical memories. Experiments show the quantized 19B-parameter model only experiences a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9), outperforming smaller models with equivalent memory requirements like the Molmo-7B series. The system maintains response latency between 2.83-3.52 seconds from scene analysis to initial speech output, substantially faster than non-streaming methods. This research advances computational efficiency and assistive technology, offering visually impaired users comprehensive real-time assistance in scene perception, text recognition, and navigation.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory requirements for vision-language models efficiently

Providing real-time environmental assistance for visually impaired users

Enabling persistent storage and retrieval of scene memories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal differentiated quantization for VLMs

Scene-aware vectorized memory multi-agent system

Efficient perception-memory-reasoning workflow integration

🔎 Similar Papers

Find Everything: A General Vision Language Model Approach to Multi-Object Search