Learning a Neural Association Network for Self-supervised Multi-Object Tracking

📅 2024-11-18
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenging data association problem in identity-unlabeled multi-object tracking (MOT). We propose the first end-to-end differentiable self-supervised learning framework for MOT. Methodologically, we introduce a neural Kalman filter to model motion dynamics via Markovian assumptions, integrate Sinkhorn normalization for soft assignment, and jointly optimize filtering and association through a differentiable expectation-maximization (EM) algorithm—requiring neither ID labels nor prior trajectory knowledge. Our key contributions are: (i) the first joint training paradigm combining neural Kalman filtering with self-supervised EM; and (ii) fully unsupervised, end-to-end learnable data association. Evaluated on MOT17 and MOT20, our approach achieves state-of-the-art performance among self-supervised MOT methods. Remarkably, using only publicly available object detectors, it surpasses existing unsupervised approaches and demonstrates strong cross-dataset generalization capability.

Technology Category

Application Category

📝 Abstract
This paper introduces a novel framework to learn data association for multi-object tracking in a self-supervised manner. Fully-supervised learning methods are known to achieve excellent tracking performances, but acquiring identity-level annotations is tedious and time-consuming. Motivated by the fact that in real-world scenarios object motion can be usually represented by a Markov process, we present a novel expectation maximization (EM) algorithm that trains a neural network to associate detections for tracking, without requiring prior knowledge of their temporal correspondences. At the core of our method lies a neural Kalman filter, with an observation model conditioned on associations of detections parameterized by a neural network. Given a batch of frames as input, data associations between detections from adjacent frames are predicted by a neural network followed by a Sinkhorn normalization that determines the assignment probabilities of detections to states. Kalman smoothing is then used to obtain the marginal probability of observations given the inferred states, producing a training objective to maximize this marginal probability using gradient descent. The proposed framework is fully differentiable, allowing the underlying neural model to be trained end-to-end. We evaluate our approach on the challenging MOT17 and MOT20 datasets and achieve state-of-the-art results in comparison to self-supervised trackers using public detections. We furthermore demonstrate the capability of the learned model to generalize across datasets.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised learning for multi-object tracking
Eliminating need for identity-level annotations
Neural network association with Kalman filtering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised neural network for data association
Neural Kalman filter with EM algorithm training
Differentiable end-to-end framework with Sinkhorn normalization
🔎 Similar Papers
No similar papers found.