S3MOT: Monocular 3D Object Tracking with Selective State Space Model

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing challenges in monocular 3D multi-object tracking (3D MOT)—including geometric ambiguity, spatiotemporal association across frames, and accurate 6-DoF pose estimation—this paper proposes an end-to-end framework integrating geometric, motion, and appearance cues. Our key contributions are: (1) the Hungarian State Space Model (HSSM), a novel linear-complexity global data association method; (2) Fully Convolutional One-stage Embedding (FCOE), which enhances cross-view pedestrian re-identification via dense, geometry-aware feature learning; and (3) the VeloSSM module, which explicitly models velocity dynamics over time to improve 6-DoF pose estimation under monocular depth and motion constraints. Evaluated on the KITTI 3D MOT benchmark, our approach achieves a new state-of-the-art HOTA score of 76.86 (+2.63 absolute improvement) while running at 31 FPS, setting a new accuracy record for monocular 3D MOT.

Technology Category

Application Category

📝 Abstract
Accurate and reliable multi-object tracking (MOT) in 3D space is essential for advancing robotics and computer vision applications. However, it remains a significant challenge in monocular setups due to the difficulty of mining 3D spatiotemporal associations from 2D video streams. In this work, we present three innovative techniques to enhance the fusion and exploitation of heterogeneous cues for monocular 3D MOT: (1) we introduce the Hungarian State Space Model (HSSM), a novel data association mechanism that compresses contextual tracking cues across multiple paths, enabling efficient and comprehensive assignment decisions with linear complexity. HSSM features a global receptive field and dynamic weights, in contrast to traditional linear assignment algorithms that rely on hand-crafted association costs. (2) We propose Fully Convolutional One-stage Embedding (FCOE), which eliminates ROI pooling by directly using dense feature maps for contrastive learning, thus improving object re-identification accuracy under challenging conditions such as varying viewpoints and lighting. (3) We enhance 6-DoF pose estimation through VeloSSM, an encoder-decoder architecture that models temporal dependencies in velocity to capture motion dynamics, overcoming the limitations of frame-based 3D inference. Experiments on the KITTI public test benchmark demonstrate the effectiveness of our method, achieving a new state-of-the-art performance of 76.86~HOTA at 31~FPS. Our approach outperforms the previous best by significant margins of +2.63~HOTA and +3.62~AssA, showcasing its robustness and efficiency for monocular 3D MOT tasks. The code and models are available at https://github.com/bytepioneerX/s3mot.
Problem

Research questions and friction points this paper is trying to address.

Enhancing monocular 3D object tracking accuracy
Improving spatiotemporal association in 2D video streams
Overcoming challenges in 6-DoF pose estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

HSSM: Hungarian State Space Model for efficient data association
FCOE: Fully Convolutional One-stage Embedding for re-identification
VeloSSM: Encoder-decoder for velocity-based 6-DoF pose estimation
🔎 Similar Papers
No similar papers found.
Z
Zhuohao Yan
School of Geodesy and Geomatics, Wuhan University, China
S
Shaoquan Feng
School of Geodesy and Geomatics, Wuhan University, China
Xingxing Li
Xingxing Li
GFZ
GPSGNSS precise positioning and orbit determinationGNSS data processingGNSS seismologyGNSS meteorology
Y
Yuxuan Zhou
School of Geodesy and Geomatics, Wuhan University, China
C
Chunxi Xia
School of Geodesy and Geomatics, Wuhan University, China
S
Shengyu Li
School of Geodesy and Geomatics, Wuhan University, China