Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of spatiotemporally dispersed key events—temporally scattered yet spatially clustered—in long-video understanding, which exceed large language models’ (LLMs) context windows, this paper proposes the “Video Memory Palace” framework. It constructs environment-anchored, graph-structured semantic representations by integrating hand-object interaction tracking, activity region clustering, and 3D scene layout modeling, thereby unifying spatiotemporal and 3D contextual reasoning. We further introduce VMB, the first layout-aware benchmark for long-video reasoning. Our method synergizes graph-structured semantic encoding with LLM-driven natural language parsing, achieving significant improvements in spatiotemporal consistency and human-aligned reasoning on VMB, EgoSchema, and NExT-QA. This work advances the capabilities of vision-language models in long-video understanding.

Technology Category

Application Category

📝 Abstract
Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the"Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
Long Video Understanding
Temporal-Spatial Analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

VideoMindPalace
Large Vision Language Models
Environment-based Semantic Graphs
🔎 Similar Papers
2023-12-29IEEE transactions on circuits and systems for video technology (Print)Citations: 95
Z
Zeyi Huang
University of Wisconsin-Madison
Yuyang Ji
Yuyang Ji
Drexel
Computer visionVision Large Language Model
Xiaofang Wang
Xiaofang Wang
Meta GenAI
Computer VisionDeep Learning
Nikhil Mehta
Nikhil Mehta
Google DeepMind
Deep LearningContinual Online LearningBayesian Neural Networks
T
Tong Xiao
Meta
D
Donghyun Lee
University of Wisconsin-Madison
S
S. Zha
Meta
Bolin Lai
Bolin Lai
Georgia Institute of Technology
Multimodal LearningLLMImage GenerationVideo Generation
L
Licheng Yu
Meta
N
Ning Zhang
Meta
Y
Yong Jae Lee
University of Wisconsin-Madison
M
Miao Liu
Meta