🤖 AI Summary
To address temporal inconsistency in predictions across consecutive frames within multimodal test-time adaptation (MM-TTA), this paper proposes a spatiotemporal voxel (ST-voxel)-based self-adaptation method for cross-modal 3D segmentation. By constructing ST-voxels via sliding windows, the method explicitly enforces prediction consistency over geometrically adjacent frames. It further introduces ST-voxel entropy filtering and spatial-temporal attention-guided cross-modal feature alignment to enhance inter-modal robustness. Additionally, a Latte++ multi-sliding-window evaluation strategy is designed to improve temporal fusion reliability. This work is the first to incorporate ST-voxel mechanisms into MM-TTA, jointly optimizing intra-modal temporal consistency and cross-modal semantic alignment. Evaluated on five MM-TTA benchmarks, the method achieves state-of-the-art performance, significantly outperforming existing test-time and multimodal adaptive approaches.
📝 Abstract
Multi-modal test-time adaptation (MM-TTA) is proposed to adapt models to an unlabeled target domain by leveraging the complementary multi-modal inputs in an online manner. Previous MM-TTA methods for 3D segmentation rely on predictions of cross-modal information in each input frame, while they ignore the fact that predictions of geometric neighborhoods within consecutive frames are highly correlated, leading to unstable predictions across time. To fulfill this gap, we propose ReLiable Spatial-temporal Voxels (Latte), an MM-TTA method that leverages reliable cross-modal spatial-temporal correspondences for multi-modal 3D segmentation. Motivated by the fact that reliable predictions should be consistent with their spatial-temporal correspondences, Latte aggregates consecutive frames in a sliding-window manner and constructs Spatial-Temporal (ST) voxels to capture temporally local prediction consistency for each modality. After filtering out ST voxels with high ST entropy, Latte conducts cross-modal learning for each point and pixel by attending to those with reliable and consistent predictions among both spatial and temporal neighborhoods. Considering the prediction consistency might vary under different sliding windows, we further propose Latte++ which leverages ST voxels generated under various sliding windows to more thoroughly evaluate intra-modal prediction consistency before the cross-modal fusion. Experimental results show that both Latte and Latte++ achieve state-of-the-art performance on five MM-TTA benchmarks compared to previous MM-TTA or TTA methods. Code will be available at https://github.com/AronCao49/Latte-plusplus.