🤖 AI Summary
To address the unsupervised cross-view object matching problem in multi-camera setups with partial field-of-view overlap, this paper proposes a self-supervised learning framework grounded in extended cycle consistency. The method operates without any manual annotations. Its core contributions are threefold: (1) the first generalization of cycle consistency theory to partially overlapping camera configurations; (2) the design of complementary multi-cycle variants to model asymmetric inter-camera correspondences; and (3) the integration of pseudo-mask-guided loss and temporal-divergent sampling to enable robust, temporally aware feature learning. Evaluated on the DIVOTrack benchmark, the approach achieves a 4.3-percentage-point improvement in F1 score over prior unsupervised methods. It demonstrates strong stability under challenging conditions—specifically, low overlap ratios and high-density pedestrian scenes—significantly advancing cross-camera object matching accuracy.
📝 Abstract
Matching objects across partially overlapping camera views is crucial in multi-camera systems and requires a view-invariant feature extraction network. Training such a network with cycle-consistency circumvents the need for labor-intensive labeling. In this paper, we extend the mathematical formulation of cycle-consistency to handle partial overlap. We then introduce a pseudo-mask which directs the training loss to take partial overlap into account. We additionally present several new cycle variants that complement each other and present a time-divergent scene sampling scheme that improves the data input for this self-supervised setting. Cross-camera matching experiments on the challenging DIVOTrack dataset show the merits of our approach. Compared to the self-supervised state-of-the-art, we achieve a 4.3 percentage point higher F1 score with our combined contributions. Our improvements are robust to reduced overlap in the training data, with substantial improvements in challenging scenes that need to make few matches between many people. Self-supervised feature networks trained with our method are effective at matching objects in a range of multi-camera settings, providing opportunities for complex tasks like large-scale multi-camera scene understanding.