CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting

πŸ“… 2026-04-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

188K/year
πŸ€– AI Summary
This work addresses the challenge of time-to-collision (TTC) prediction in videos, which requires capturing multi-scale spatiotemporal patterns at both local and global levels. To this end, the authors propose CollideNet, a hierarchical multi-scale video representation learning architecture that introduces, for the first time in TTC estimation, a spatiotemporal hierarchical Transformer. The method explicitly decomposes temporal dynamics into trend, seasonal, and non-stationary components and integrates multi-resolution spatial features with this disentangled temporal representation to enable accurate TTC prediction. Extensive experiments demonstrate that CollideNet significantly outperforms existing approaches on three public benchmarks, achieving new state-of-the-art performance and exhibiting strong cross-dataset generalization capabilities.

Technology Category

Application Category

πŸ“ Abstract
Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and global patterns encapsulated in a video, both spatially and temporally. To address the multi-scale nature of video, we introduce a novel spatiotemporal hierarchical transformer-based architecture called CollideNet, specifically catered for effective TTC forecasting. In the spatial stream, CollideNet aggregates information for each video frame simultaneously at multiple resolutions. In the temporal stream, along with multi-scale feature encoding, CollideNet also disentangles the non-stationarity, trend, and seasonality components. Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin. We conduct cross-dataset evaluations to analyze the generalization capabilities of our method, and visualize the effects of disentanglement of the trend and seasonality components of the video data. We release our code at https://github.com/DeSinister/CollideNet/.
Problem

Research questions and friction points this paper is trying to address.

Time-to-Collision
video representation learning
multi-scale
spatiotemporal prediction
collision forecasting
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical transformer
multi-scale representation
temporal disentanglement
time-to-collision forecasting
spatiotemporal modeling
πŸ”Ž Similar Papers