Improving MLLMs in Embodied Exploration and Question Answering with Human-Inspired Memory Modeling

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the limitations of multimodal large language models in embodied intelligence, where restricted context windows hinder effective processing of long-horizon observations and conventional text summarization often discards critical visual and spatial details. The authors propose a non-parametric memory framework that explicitly decouples episodic and semantic memory for the first time in embodied settings. Their approach follows a “retrieve-then-reason” paradigm: experiences are retrieved via semantic similarity, verified through visual reasoning, and distilled into structured procedural rules via a rule-extraction mechanism to enable cross-environment generalization. Notably, this method achieves robust experience reuse without requiring geometric alignment, yielding significant improvements—7.3% higher LLM-Match and 7.7% higher success rate—on A-EQA and GOAT-Bench, respectively, while substantially enhancing exploration efficiency and complex reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Embodied Agents

Memory Modeling

Long-horizon Observations

Context Budget

Innovation

Methods, ideas, or system contributions that make the work stand out.

episodic memory

semantic memory

memory modeling