🤖 AI Summary
This work proposes a novel autoregressive paradigm for multi-object tracking (MOT) grounded in large language models (LLMs), reframing tracking as a sequence generation task to overcome the rigidity and task-specific constraints of existing approaches. By directly generating structured tracking outputs in an autoregressive manner, the method eliminates the need for custom-designed output heads. Key innovations include an Object Tokenizer, a region-aware alignment module, and a temporal memory fusion mechanism, which collectively enable effective integration of visual and linguistic representations while modeling cross-frame associations. The approach achieves performance on par with state-of-the-art methods on MOT17 and DanceTrack, demonstrating its generality, flexibility, and feasibility for instruction-driven and diverse tracking scenarios.
📝 Abstract
As multi-object tracking (MOT) tasks continue to evolve toward more general and multi-modal scenarios, the rigid and task-specific architectures of existing MOT methods increasingly hinder their applicability across diverse tasks and limit flexibility in adapting to new tracking formulations. Most approaches rely on fixed output heads and bespoke tracking pipelines, making them difficult to extend to more complex or instruction-driven tasks. To address these limitations, we propose AR-MOT, a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework. This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads. To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector. To mitigate the misalignment between global and regional features, we propose a Region-Aware Alignment (RAA) module, and to support long-term tracking, we design a Temporal Memory Fusion (TMF) module that caches historical object tokens. AR-MOT offers strong potential for extensibility, as new modalities or instructions can be integrated by simply modifying the output sequence format without altering the model architecture. Extensive experiments on MOT17 and DanceTrack validate the feasibility of our approach, achieving performance comparable to state-of-the-art methods while laying the foundation for more general and flexible MOT systems.