MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders

📅 2024-08-27

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

185K/year

🤖 AI Summary

To address insufficient long-range dependency modeling and weak cross-task interaction in multi-task dense scene understanding, this paper proposes the first Mamba-based multi-task decoder architecture. Methodologically, it introduces a dual-module design: (1) intra-task Mamba (STM) and cross-task Mamba (CTM); (2) feature-level (F-CTM) and semantic-level (S-CTM) cross-task interaction mechanisms for fine-grained task coupling; and (3) the first integration of the Mamba state-space model into an end-to-end multi-task dense prediction framework. Extensive experiments on NYUDv2, PASCAL-Context, and Cityscapes demonstrate consistent superiority over CNN- and Transformer-based baselines across multiple metrics, achieving new state-of-the-art performance. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

Multi-task dense scene understanding, which trains a model for multiple dense prediction tasks, has a wide range of application scenarios. Capturing long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba++, a novel architecture for multi-task scene understanding featuring with a Mamba-based decoder. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging state-space models, while CTM explicitly models task interactions to facilitate information exchange across tasks. We design two types of CTM block, namely F-CTM and S-CTM, to enhance cross-task interaction from feature and semantic perspectives, respectively. Experiments on NYUDv2, PASCAL-Context, and Cityscapes datasets demonstrate the superior performance of MTMamba++ over CNN-based and Transformer-based methods. The code is available at https://github.com/EnVision-Research/MTMamba.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-task dense scene understanding via Mamba-based decoders

Capturing long-range dependencies in multi-task dense prediction

Improving cross-task interactions for better information exchange

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based decoder for multi-task understanding

STM block captures long-range dependencies

CTM blocks enhance cross-task interactions

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs