LCMF: Lightweight Cross-Modality Mambaformer for Embodied Robotics VQA

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

To address the challenges of heterogeneous multimodal data fusion and high computational overhead in resource-constrained embodied robotics scenarios, this paper proposes LCMF, a lightweight cascaded attention framework. LCMF innovatively integrates cross-attention with selective state space models (SSMs) and introduces a multi-level cross-modal parameter-sharing mechanism to achieve semantic complementary alignment between images/videos and text. Compared to mainstream approaches, LCMF reduces model parameters to 166.51M (image–text) and 219M (video–text), and cuts FLOPs by 4.35×. On visual question answering (VQA), it achieves 74.29% accuracy; on embodied question answering (EQA) with video inputs, its performance matches that of large language model–based agents. These results demonstrate substantial improvements in cross-modal understanding efficiency and deployment feasibility for edge-constrained robotic systems.

Technology Category

Application Category

📝 Abstract

Multimodal semantic learning plays a critical role in embodied intelligence, especially when robots perceive their surroundings, understand human instructions, and make intelligent decisions. However, the field faces technical challenges such as effective fusion of heterogeneous data and computational efficiency in resource-constrained environments. To address these challenges, this study proposes the lightweight LCMF cascaded attention framework, introducing a multi-level cross-modal parameter sharing mechanism into the Mamba module. By integrating the advantages of Cross-Attention and Selective parameter-sharing State Space Models (SSMs), the framework achieves efficient fusion of heterogeneous modalities and semantic complementary alignment. Experimental results show that LCMF surpasses existing multimodal baselines with an accuracy of 74.29% in VQA tasks and achieves competitive mid-tier performance within the distribution cluster of Large Language Model Agents (LLM Agents) in EQA video tasks. Its lightweight design achieves a 4.35-fold reduction in FLOPs relative to the average of comparable baselines while using only 166.51M parameters (image-text) and 219M parameters (video-text), providing an efficient solution for Human-Robot Interaction (HRI) applications in resource-constrained scenarios with strong multimodal decision generalization capabilities.

Problem

Research questions and friction points this paper is trying to address.

Fusing heterogeneous multimodal data efficiently for robot perception

Achieving computational efficiency in resource-constrained robotic environments

Enabling intelligent decision-making for human-robot interaction applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight cascaded attention framework for multimodal fusion

Multi-level cross-modal parameter sharing in Mamba module

Integrates cross-attention with selective state space models

🔎 Similar Papers

AdaptiveFusion: Adaptive Multi-Modal Multi-View Fusion for 3D Human Body Reconstruction