Lessons Learned from Developing a Privacy-Preserving Multimodal Wearable for Local Voice-and-Vision Inference

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

To address the tension between privacy leakage and computational overhead in ear-worn multimodal wearables, this paper proposes a privacy-first hardware-software co-design framework: leveraging the smartphone as a trusted edge node to enable fully localized AI inference for audio and visual data. Our approach features wake-word-triggered synchronized audio-video acquisition on resource-constrained embedded devices, fully offline execution of quantized vision-language models (VLMs) and large language models (LLMs), and a low-power multimodal fusion algorithm. By integrating model quantization, on-device wake-word detection, optimized edge communication, and energy-efficient architecture, the system achieves end-to-end multimodal inference with interactive latency (<300 ms) on commodity smartphone hardware. Experimental evaluation demonstrates feasibility in terms of power consumption, connection robustness, and social acceptability, providing a deployable technical pathway for continuous, privacy-sensitive perception and natural human–device interaction.

Technology Category

Application Category

📝 Abstract

Many promising applications of multimodal wearables require continuous sensing and heavy computation, yet users reject such devices due to privacy concerns. This paper shares our experiences building an ear-mounted voice-and-vision wearable that performs local AI inference using a paired smartphone as a trusted personal edge. We describe the hardware--software co-design of this privacy-preserving system, including challenges in integrating a camera, microphone, and speaker within a 30-gram form factor, enabling wake word-triggered capture, and running quantized vision-language and large-language models entirely offline. Through iterative prototyping, we identify key design hurdles in power budgeting, connectivity, latency, and social acceptability. Our initial evaluation shows that fully local multimodal inference is feasible on commodity mobile hardware with interactive latency. We conclude with design lessons for researchers developing embedded AI systems that balance privacy, responsiveness, and usability in everyday settings.

Problem

Research questions and friction points this paper is trying to address.

Developing privacy-preserving multimodal wearables for continuous sensing applications

Integrating camera, microphone and speaker in lightweight wearable with local AI

Balancing power budgeting, connectivity and latency for offline multimodal inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ear-mounted wearable with camera and microphone

Local AI inference using smartphone as edge

Offline quantized vision-language and large-language models

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Software Engineer, XR Multimodal AI

Google

$147,000-$211,000 + bonus + equity + benefits.

San Jose, CA, USA / New York, NY, USA

Research Scientist Intern, Multimodal Contextual AI (PhD)