Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work proposes the first edge-based streaming system leveraging a multimodal large language model (MLLM) to enable real-time, privacy-preserving episodic question answering on wearable devices, circumventing the privacy risks and latency associated with cloud offloading. The system employs an asynchronous dual-thread architecture: a Descriptor Thread continuously encodes incoming video streams into lightweight textual memory, while a QA Thread performs low-latency inference based on this memory. Through edge deployment optimizations, the system achieves 51.76% accuracy with a first-token latency of 0.41 seconds on a consumer-grade GPU with 8 GB memory, and 54.40% accuracy on a local server—approaching the performance of cloud-based counterparts (56.00%)—demonstrating an effective balance among privacy, latency, and accuracy for edge MLLM applications.

Technology Category

Application Category

📝 Abstract

We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.

Problem

Research questions and friction points this paper is trying to address.

episodic memory

question answering

edge computing

multimodal LLMs

privacy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models

Edge Computing

Episodic Memory