🤖 AI Summary
This paper addresses the challenging data association problem in identity-unlabeled multi-object tracking (MOT). We propose the first end-to-end differentiable self-supervised learning framework for MOT. Methodologically, we introduce a neural Kalman filter to model motion dynamics via Markovian assumptions, integrate Sinkhorn normalization for soft assignment, and jointly optimize filtering and association through a differentiable expectation-maximization (EM) algorithm—requiring neither ID labels nor prior trajectory knowledge. Our key contributions are: (i) the first joint training paradigm combining neural Kalman filtering with self-supervised EM; and (ii) fully unsupervised, end-to-end learnable data association. Evaluated on MOT17 and MOT20, our approach achieves state-of-the-art performance among self-supervised MOT methods. Remarkably, using only publicly available object detectors, it surpasses existing unsupervised approaches and demonstrates strong cross-dataset generalization capability.
📝 Abstract
This paper introduces a novel framework to learn data association for multi-object tracking in a self-supervised manner. Fully-supervised learning methods are known to achieve excellent tracking performances, but acquiring identity-level annotations is tedious and time-consuming. Motivated by the fact that in real-world scenarios object motion can be usually represented by a Markov process, we present a novel expectation maximization (EM) algorithm that trains a neural network to associate detections for tracking, without requiring prior knowledge of their temporal correspondences. At the core of our method lies a neural Kalman filter, with an observation model conditioned on associations of detections parameterized by a neural network. Given a batch of frames as input, data associations between detections from adjacent frames are predicted by a neural network followed by a Sinkhorn normalization that determines the assignment probabilities of detections to states. Kalman smoothing is then used to obtain the marginal probability of observations given the inferred states, producing a training objective to maximize this marginal probability using gradient descent. The proposed framework is fully differentiable, allowing the underlying neural model to be trained end-to-end. We evaluate our approach on the challenging MOT17 and MOT20 datasets and achieve state-of-the-art results in comparison to self-supervised trackers using public detections. We furthermore demonstrate the capability of the learned model to generalize across datasets.