🤖 AI Summary
To address visual privacy leakage and insufficient edge-side real-time performance caused by cloud-dependent multimodal interaction, this paper introduces the novel task of “visual instruction rewriting”: automatically converting multimodal visual instructions into pure-text commands, enabling lightweight edge-side vision-language models (VLMs, 250M parameters) to collaborate with existing conversational AI systems while preserving privacy—without uploading raw images. Our contributions include: (1) the first formal definition of this task; (2) construction of a high-quality dataset comprising over 39,000 samples spanning 14 domains; (3) an end-to-end pipeline integrating pretraining, supervised fine-tuning, and quantization compression (model size <500 MB); and (4) a joint evaluation framework combining BLEU, METEOR, ROUGE, and semantic parsing metrics. Experiments demonstrate that the quantized model achieves practical performance in both generation quality and semantic accuracy, validating the feasibility of privacy-first, edge-deployable multimodal understanding.
📝 Abstract
Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.