🤖 AI Summary
This work proposes the first open-vocabulary autoregressive framework for 3D multi-object tracking, addressing the limitations of existing methods that rely on closed-set assumptions and heuristic strategies lacking semantic understanding, which hinder generalization to unseen categories. By modeling 3D trajectories as structured spatiotemporal semantic sequences, the approach leverages the autoregressive capabilities and linguistic priors of a 0.5B-parameter large language model to jointly capture motion continuity and semantic consistency, enabling end-to-end identity association and trajectory prediction. The method exploits the hierarchical structure and commonsense reasoning in language space to resolve semantic ambiguities, achieving an AMOTA of 22.41% on novel classes in nuScenes—an improvement of 20.21 percentage points over the baseline—and demonstrates strong generalization across V2X-Seq-SPD and KITTI datasets.
📝 Abstract
Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind''heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at https://github.com/xifen523/NOVA.