COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high parameter count and inference latency of Transformer-based models in Vision-and-Language Navigation (VLN), particularly when augmented with external knowledge or map representations, this paper proposes COSMO—a lightweight and efficient hybrid architecture integrating State Space Models (SSMs) with Transformers. Its key contributions are: (1) Round Selective Scan (RSS), a novel scanning mechanism enabling deep cross-modal interaction within a single forward pass; and (2) Cross-modal Selective State Space (CS3), a dual-stream adaptive framework for modality-aware state evolution. Evaluated on three major VLN benchmarks—REVERIE, R2R, and R2R-CE—COSMO achieves state-of-the-art performance while reducing model parameters by 23%–41% and inference latency by 35%–52%, thereby significantly improving computational efficiency and energy-effectiveness.

Technology Category

Application Category

📝 Abstract
Vision-and-Language Navigation (VLN) tasks have gained prominence within artificial intelligence research due to their potential application in fields like home assistants. Many contemporary VLN approaches, while based on transformer architectures, have increasingly incorporated additional components such as external knowledge bases or map information to enhance performance. These additions, while boosting performance, also lead to larger models and increased computational costs. In this paper, to achieve both high performance and low computational costs, we propose a novel architecture with the COmbination of Selective MemOrization (COSMO). Specifically, COSMO integrates state-space modules and transformer modules, and incorporates two VLN-customized selective state space modules: the Round Selective Scan (RSS) and the Cross-modal Selective State Space Module (CS3). RSS facilitates comprehensive inter-modal interactions within a single scan, while the CS3 module adapts the selective state space module into a dual-stream architecture, thereby enhancing the acquisition of cross-modal interactions. Experimental validations on three mainstream VLN benchmarks, REVERIE, R2R, and R2R-CE, not only demonstrate competitive navigation performance of our model but also show a significant reduction in computational costs.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in Vision-and-Language Navigation tasks
Enhancing cross-modal interactions without external knowledge bases
Balancing performance and efficiency in transformer-based VLN models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines state-space and transformer modules
Uses Round Selective Scan for inter-modal interactions
Implements Cross-modal Selective State Space Module
🔎 Similar Papers
No similar papers found.
S
Siqi Zhang
School of Computer Science and Technology, Tongji University
Yanyuan Qiao
Yanyuan Qiao
Postdoctoral Research Fellow, EPFL
Embodied-AIVision and LanguageMulti-modal Learning
Q
Qunbo Wang
Institute of Automation, Chinese Academy of Sciences
Zike Yan
Zike Yan
PostDoc, Tsinghua University; PhD, Peking University
3D VisionRoboticsContinual Learning
Q
Qi Wu
Australian Institute for Machine Learning, The University of Adelaide
Z
Zhihua Wei
School of Computer Science and Technology, Tongji University
J
Jing Liu
Institute of Automation, Chinese Academy of Sciences