π€ AI Summary
To address the challenge of future semantic scene forecasting in dynamic autonomous driving environments, this paper proposes FUTURISTβa framework for high-resolution short- to mid-term future semantic segmentation prediction. Methodologically, it introduces (i) a novel multimodal masked visual modeling objective with a dedicated masking mechanism; (ii) a VAE-free hierarchical tokenization pipeline enabling end-to-end multimodal training; and (iii) a multimodal visual sequence Transformer integrating masked self-supervised learning with joint optimization. Its key contributions are: (i) the first application of masked modeling to future semantic prediction, substantially improving modeling efficiency and representation capability; and (ii) state-of-the-art performance on the Cityscapes benchmark, achieving simultaneous gains in prediction accuracy and computational efficiency.
π Abstract
Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting. We provide the implementation code at https://github.com/Sta8is/FUTURIST .