🤖 AI Summary
To address key challenges in cross-view video person re-identification—including drastic viewpoint variations, scale discrepancies, and temporal misalignment—this paper proposes MTF-CVReID, a lightweight multi-temporal cross-view framework. Built upon the ViT-B/16 backbone, it integrates seven novel components: cross-stream feature normalization, multi-resolution coordination, identity-aware memory, temporal dynamic modeling, cross-view feature alignment, hierarchical temporal learning, and contrastive consistency optimization. With only ~2M additional parameters, MTF-CVReID achieves viewpoint-invariant representation learning, multi-scale temporal modeling, and cross-view identity consistency. It achieves state-of-the-art performance on AG-VPReID across all height levels, demonstrates strong cross-domain generalization on G2A-VReID and MARS, and attains real-time inference at 189 FPS—effectively balancing accuracy, robustness, and efficiency.
📝 Abstract
Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at https://github.com/MdRashidunnabi/MTF-CVReID