ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address visual privacy leakage and insufficient edge-side real-time performance caused by cloud-dependent multimodal interaction, this paper introduces the novel task of “visual instruction rewriting”: automatically converting multimodal visual instructions into pure-text commands, enabling lightweight edge-side vision-language models (VLMs, 250M parameters) to collaborate with existing conversational AI systems while preserving privacy—without uploading raw images. Our contributions include: (1) the first formal definition of this task; (2) construction of a high-quality dataset comprising over 39,000 samples spanning 14 domains; (3) an end-to-end pipeline integrating pretraining, supervised fine-tuning, and quantization compression (model size <500 MB); and (4) a joint evaluation framework combining BLEU, METEOR, ROUGE, and semantic parsing metrics. Experiments demonstrate that the quantized model achieves practical performance in both generation quality and semantic accuracy, validating the feasibility of privacy-first, edge-deployable multimodal understanding.

Technology Category

Application Category

📝 Abstract
Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.
Problem

Research questions and friction points this paper is trying to address.

Privacy-preserving multimodal interaction
On-device visual instruction rewriting
Compact VLM for privacy-focused AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight on-device VLM integration
Text-only command transformation
Privacy-focused multimodal applications
🔎 Similar Papers
No similar papers found.
Abhijit Mishra
Abhijit Mishra
Assistant Professor of Practice, iSchool, University of Texas at Austin
Machine LearningNatural Language ProcessingCognitive ScienceEye-Tracking
R
Richard Noh
School of Information, University of Texas at Austin
H
Hsiang Fu
School of Information, University of Texas at Austin
M
Mingda Li
Department of Statistics and Data Science, Yale University
M
Minji Kim
School of Information, University of Texas at Austin