Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video semantic segmentation methods typically process frames independently, neglecting temporal consistency and leading to unstable performance in dynamic scenes. To address this limitation, this work proposes a spatiotemporal attention (STA) mechanism that integrates multi-frame contextual information into Transformer architectures, enhancing temporal feature modeling while maintaining computational efficiency. The STA mechanism is implemented as an extension of standard self-attention and can be seamlessly adapted to various Transformer models—ranging from lightweight to large-scale variants—with only minor fine-tuning. Experimental results on the Cityscapes and BDD100k datasets demonstrate significant improvements over state-of-the-art approaches, with gains of up to 9.20 percentage points in temporal consistency and 1.76 percentage points in mean Intersection over Union (mIoU).

Technology Category

Application Category

📝 Abstract
Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.
Problem

Research questions and friction points this paper is trying to address.

video semantic segmentation
temporal consistency
spatio-temporal attention
autonomous driving
transformer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatio-Temporal Attention
Video Semantic Segmentation
Temporal Consistency
Transformer Architecture
Multi-frame Context
🔎 Similar Papers
No similar papers found.
Serin Varghese
Serin Varghese
Senior Data Scientist, CARIAD SE
machine learningcomputer visionnetwork compressiondata augmentationtemporal consistency
K
Kevin Ross
Heinrich-Heine-University Düsseldorf, Department of Computer Science, Düsseldorf, Germany
F
Fabian Hueger
CARIAD SE, Wolfsburg, Germany
Kira Maag
Kira Maag
Heinrich-Heine-University Düsseldorf
Computer VisionDeep Learning