🤖 AI Summary
This work addresses the challenge of identity inconsistency in multi-camera multi-object tracking, which often arises from viewpoint discrepancies and typically relies on precise camera calibration and extensive manual annotations. The paper proposes the first fully self-supervised representation learning framework that requires neither calibration nor annotations. By leveraging single-view distillation and cross-view reconstruction, the method explicitly disentangles viewpoint-invariant and viewpoint-specific features to achieve robust cross-camera identity consistency. Evaluated on the MMP-MvMHAT dataset, the approach improves overall accuracy by 3% and increases the average F1 score by 7.5%. Furthermore, it significantly enhances temporal consistency and cross-view tracking performance on the MvMHAT benchmark.
📝 Abstract
Multi-camera multi-object tracking (MCMOT) faces significant challenges in maintaining consistent object identities across varying camera perspectives, particularly when precise calibration and extensive annotations are required. In this paper, we present CalibFree, a self-supervised representation learning framework that does not need any calibration or manual labeling for the MCMOT task. By promoting feature separation between view-agnostic and view-specific representations through single-view distillation and cross-view reconstruction, our method adapts to complex, dynamic scenarios with minimal overhead. Experiments on the MMP-MvMHAT dataset show a 3% improvement in overall accuracy and a 7.5% increase in the average F1 score over state-of-the-art approaches, confirming the effectiveness of our calibration-free design. Moreover, on the more diverse MvMHAT dataset, our approach demonstrates superior over-time tracking and strong cross-view performance, highlighting its adaptability to a wide range of camera configurations. Code will be publicly available upon acceptance.