ForecastOcc: Vision-based Semantic Occupancy Forecasting

๐Ÿ“… 2026-02-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge that existing vision-based methods struggle to effectively model semantic information when predicting future scenes and often rely on externally generated historical occupancy maps, leading to error accumulation. We propose the first end-to-end framework that directly predicts multi-step 3D semantic occupancy from historical multi-view or monocular images without requiring pre-built maps. Our approach integrates temporal cross-attention mechanisms, 2D-to-3D view transformation, and a voxel-level semantic occupancy prediction head to enable direct mapping from input images to future semantic occupancy. We establish new benchmarks on Occ3D-nuScenes and SemanticKITTI, demonstrating that our method significantly outperforms existing baselines, generating future scene predictions that are both semantically rich and dynamically accurate.

Technology Category

Application Category

๐Ÿ“ Abstract
Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.
Problem

Research questions and friction points this paper is trying to address.

semantic occupancy forecasting
vision-based prediction
autonomous driving
spatio-temporal features
error accumulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic occupancy forecasting
vision-based prediction
temporal cross-attention
2D-to-3D view transformer
autonomous driving
๐Ÿ”Ž Similar Papers
No similar papers found.