Latte++: Spatial-Temporal Voxel-based Test-Time Adaptation for Multi-Modal Segmentation

📅 2024-03-11

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address temporal inconsistency in predictions across consecutive frames within multimodal test-time adaptation (MM-TTA), this paper proposes a spatiotemporal voxel (ST-voxel)-based self-adaptation method for cross-modal 3D segmentation. By constructing ST-voxels via sliding windows, the method explicitly enforces prediction consistency over geometrically adjacent frames. It further introduces ST-voxel entropy filtering and spatial-temporal attention-guided cross-modal feature alignment to enhance inter-modal robustness. Additionally, a Latte++ multi-sliding-window evaluation strategy is designed to improve temporal fusion reliability. This work is the first to incorporate ST-voxel mechanisms into MM-TTA, jointly optimizing intra-modal temporal consistency and cross-modal semantic alignment. Evaluated on five MM-TTA benchmarks, the method achieves state-of-the-art performance, significantly outperforming existing test-time and multimodal adaptive approaches.

Technology Category

Application Category

📝 Abstract

Multi-modal test-time adaptation (MM-TTA) is proposed to adapt models to an unlabeled target domain by leveraging the complementary multi-modal inputs in an online manner. Previous MM-TTA methods for 3D segmentation rely on predictions of cross-modal information in each input frame, while they ignore the fact that predictions of geometric neighborhoods within consecutive frames are highly correlated, leading to unstable predictions across time. To fulfill this gap, we propose ReLiable Spatial-temporal Voxels (Latte), an MM-TTA method that leverages reliable cross-modal spatial-temporal correspondences for multi-modal 3D segmentation. Motivated by the fact that reliable predictions should be consistent with their spatial-temporal correspondences, Latte aggregates consecutive frames in a sliding-window manner and constructs Spatial-Temporal (ST) voxels to capture temporally local prediction consistency for each modality. After filtering out ST voxels with high ST entropy, Latte conducts cross-modal learning for each point and pixel by attending to those with reliable and consistent predictions among both spatial and temporal neighborhoods. Considering the prediction consistency might vary under different sliding windows, we further propose Latte++ which leverages ST voxels generated under various sliding windows to more thoroughly evaluate intra-modal prediction consistency before the cross-modal fusion. Experimental results show that both Latte and Latte++ achieve state-of-the-art performance on five MM-TTA benchmarks compared to previous MM-TTA or TTA methods. Code will be available at https://github.com/AronCao49/Latte-plusplus.

Problem

Research questions and friction points this paper is trying to address.

Adapt models to unlabeled target domains

Leverage spatial-temporal voxel correspondences

Improve multi-modal 3D segmentation stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes Spatial-Temporal Voxels for segmentation

Implements cross-modal learning with consistency

Employs multiple sliding windows for evaluation

🔎 Similar Papers

No similar papers found.