π€ AI Summary
This work addresses the challenge of time-to-collision (TTC) prediction in videos, which requires capturing multi-scale spatiotemporal patterns at both local and global levels. To this end, the authors propose CollideNet, a hierarchical multi-scale video representation learning architecture that introduces, for the first time in TTC estimation, a spatiotemporal hierarchical Transformer. The method explicitly decomposes temporal dynamics into trend, seasonal, and non-stationary components and integrates multi-resolution spatial features with this disentangled temporal representation to enable accurate TTC prediction. Extensive experiments demonstrate that CollideNet significantly outperforms existing approaches on three public benchmarks, achieving new state-of-the-art performance and exhibiting strong cross-dataset generalization capabilities.
π Abstract
Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and global patterns encapsulated in a video, both spatially and temporally. To address the multi-scale nature of video, we introduce a novel spatiotemporal hierarchical transformer-based architecture called CollideNet, specifically catered for effective TTC forecasting. In the spatial stream, CollideNet aggregates information for each video frame simultaneously at multiple resolutions. In the temporal stream, along with multi-scale feature encoding, CollideNet also disentangles the non-stationarity, trend, and seasonality components. Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin. We conduct cross-dataset evaluations to analyze the generalization capabilities of our method, and visualize the effects of disentanglement of the trend and seasonality components of the video data. We release our code at https://github.com/DeSinister/CollideNet/.