EgoTrigger: Toward Audio-Driven Image Capture for Human Memory Enhancement in All-Day Energy-Efficient Smart Glasses

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high energy consumption of multimodal AI agents in all-day smart glasses—particularly the challenge of balancing memory augmentation with battery longevity—this paper proposes EgoTrigger, a context-aware audio-triggered framework. EgoTrigger leverages hand-object interaction sounds (e.g., drawer opening, pill-bottle uncapping) as lightweight, event-specific triggers to selectively activate the camera, thereby eliminating continuous visual sensing. Methodologically, it integrates a lightweight audio model (YAMNet) with a customized classification head for low-latency acoustic event detection. To support evaluation, we introduce HME-QA—the first benchmark dataset tailored for first-person memory-augmented question answering. Experiments demonstrate that EgoTrigger reduces image capture by 54% on average compared to continuous video capture, while maintaining comparable performance on episodic memory tasks. This yields substantial energy savings and enables, for the first time, audio-driven, energy-efficient smart glasses for全天候 (24/7) human memory augmentation deployment.

Technology Category

Application Category

📝 Abstract
All-day smart glasses are likely to emerge as platforms capable of continuous contextual sensing, uniquely positioning them for unprecedented assistance in our daily lives. Integrating the multi-modal AI agents required for human memory enhancement while performing continuous sensing, however, presents a major energy efficiency challenge for all-day usage. Achieving this balance requires intelligent, context-aware sensor management. Our approach, EgoTrigger, leverages audio cues from the microphone to selectively activate power-intensive cameras, enabling efficient sensing while preserving substantial utility for human memory enhancement. EgoTrigger uses a lightweight audio model (YAMNet) and a custom classification head to trigger image capture from hand-object interaction (HOI) audio cues, such as the sound of a drawer opening or a medication bottle being opened. In addition to evaluating on the QA-Ego4D dataset, we introduce and evaluate on the Human Memory Enhancement Question-Answer (HME-QA) dataset. Our dataset contains 340 human-annotated first-person QA pairs from full-length Ego4D videos that were curated to ensure that they contained audio, focusing on HOI moments critical for contextual understanding and memory. Our results show EgoTrigger can use 54% fewer frames on average, significantly saving energy in both power-hungry sensing components (e.g., cameras) and downstream operations (e.g., wireless transmission), while achieving comparable performance on datasets for an episodic memory task. We believe this context-aware triggering strategy represents a promising direction for enabling energy-efficient, functional smart glasses capable of all-day use -- supporting applications like helping users recall where they placed their keys or information about their routine activities (e.g., taking medications).
Problem

Research questions and friction points this paper is trying to address.

Balancing energy efficiency with continuous sensing in smart glasses
Selectively activating cameras using audio cues for memory enhancement
Reducing power consumption while maintaining performance in memory tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-driven camera activation for energy efficiency
Lightweight audio model with custom classification
Context-aware triggering for smart glasses
🔎 Similar Papers
No similar papers found.