CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

📅 2025-06-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Embodied visual reasoning (EVR) faces two core challenges: understanding complex, open-ended instructions and modeling spatiotemporal dynamics in long-horizon first-person videos. Existing approaches either rely on LLMs processing static captions—losing fine-grained visual details—or adopt end-to-end vision-language models (VLMs), which lack support for stepwise compositional reasoning. This paper proposes a training-free, dynamic cognitive mapping framework that synergistically integrates the high-level planning capability of LLMs with the open-world perceptual capacity of VLMs via cross-modal iterative inference, enabling semantic-aware and structured scene representation. The framework supports online updating of cognitive maps and contextual evolution, substantially improving performance on long-horizon, visually dependent tasks. Evaluated across multiple benchmarks, it demonstrates strong effectiveness and generalization. Its key innovation lies in decoupling high-level reasoning from low-level perception—achieving, for the first time, training-free, evolvable multi-step embodied visual reasoning.

Technology Category

Application Category

📝 Abstract
Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.
Problem

Research questions and friction points this paper is trying to address.

Handles complex instructions in embodied visual reasoning
Integrates LLMs and VLMs for perception and reasoning
Manages long-term spatiotemporal dynamics in egocentric videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free LLM-VLM synergy framework
Dynamic Cognitive Map for scene representation
Iterative visual context updating
🔎 Similar Papers
No similar papers found.
K
Kailing Li
School of Computer Science and Technology, East China Normal University
Qi'ao Xu
Qi'ao Xu
East China Normal University
Tianwen Qian
Tianwen Qian
East China Normal University
MultimediaVision and LanguageEmbodied AI
Y
Yuqian Fu
INSAIT, Sofia University "St. Kliment Ohridski"
Y
Yang Jiao
Fudan University
X
Xiaoling Wang
School of Computer Science and Technology, East China Normal University