Lessons Learned from Developing a Privacy-Preserving Multimodal Wearable for Local Voice-and-Vision Inference

πŸ“… 2025-11-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the tension between privacy leakage and computational overhead in ear-worn multimodal wearables, this paper proposes a privacy-first hardware-software co-design framework: leveraging the smartphone as a trusted edge node to enable fully localized AI inference for audio and visual data. Our approach features wake-word-triggered synchronized audio-video acquisition on resource-constrained embedded devices, fully offline execution of quantized vision-language models (VLMs) and large language models (LLMs), and a low-power multimodal fusion algorithm. By integrating model quantization, on-device wake-word detection, optimized edge communication, and energy-efficient architecture, the system achieves end-to-end multimodal inference with interactive latency (<300 ms) on commodity smartphone hardware. Experimental evaluation demonstrates feasibility in terms of power consumption, connection robustness, and social acceptability, providing a deployable technical pathway for continuous, privacy-sensitive perception and natural human–device interaction.

Technology Category

Application Category

πŸ“ Abstract
Many promising applications of multimodal wearables require continuous sensing and heavy computation, yet users reject such devices due to privacy concerns. This paper shares our experiences building an ear-mounted voice-and-vision wearable that performs local AI inference using a paired smartphone as a trusted personal edge. We describe the hardware--software co-design of this privacy-preserving system, including challenges in integrating a camera, microphone, and speaker within a 30-gram form factor, enabling wake word-triggered capture, and running quantized vision-language and large-language models entirely offline. Through iterative prototyping, we identify key design hurdles in power budgeting, connectivity, latency, and social acceptability. Our initial evaluation shows that fully local multimodal inference is feasible on commodity mobile hardware with interactive latency. We conclude with design lessons for researchers developing embedded AI systems that balance privacy, responsiveness, and usability in everyday settings.
Problem

Research questions and friction points this paper is trying to address.

Developing privacy-preserving multimodal wearables for continuous sensing applications
Integrating camera, microphone and speaker in lightweight wearable with local AI
Balancing power budgeting, connectivity and latency for offline multimodal inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ear-mounted wearable with camera and microphone
Local AI inference using smartphone as edge
Offline quantized vision-language and large-language models
πŸ”Ž Similar Papers
No similar papers found.
Y
Yonatan Tussa
University of Maryland, College Park
A
Andy Heredia
University of Maryland Global Campus
Nirupam Roy
Nirupam Roy
Assistant Professor, University of Maryland College Park
Ambient ComputingWireless NetworkingUsable Security and PrivacyCyber Physical SystemsIoT