Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key challenges in cross-view video person re-identification—including drastic viewpoint variations, scale discrepancies, and temporal misalignment—this paper proposes MTF-CVReID, a lightweight multi-temporal cross-view framework. Built upon the ViT-B/16 backbone, it integrates seven novel components: cross-stream feature normalization, multi-resolution coordination, identity-aware memory, temporal dynamic modeling, cross-view feature alignment, hierarchical temporal learning, and contrastive consistency optimization. With only ~2M additional parameters, MTF-CVReID achieves viewpoint-invariant representation learning, multi-scale temporal modeling, and cross-view identity consistency. It achieves state-of-the-art performance on AG-VPReID across all height levels, demonstrates strong cross-domain generalization on G2A-VReID and MARS, and attains real-time inference at 189 FPS—effectively balancing accuracy, robustness, and efficiency.

Technology Category

Application Category

📝 Abstract
Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at https://github.com/MdRashidunnabi/MTF-CVReID
Problem

Research questions and friction points this paper is trying to address.

Addresses extreme viewpoint shifts in cross-view video person re-identification
Solves scale disparities across aerial-ground surveillance domains
Mitigates temporal inconsistencies in multi-view video identification systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Stream Feature Normalization corrects camera view biases
Multi-Resolution Feature Harmonization stabilizes scale across altitudes
Identity-Aware Memory Module reinforces persistent identity traits
🔎 Similar Papers
No similar papers found.
M
MD. Rashidunnabi
DeepNeuronic, Lda., Covilhã, Portugal and University of Beira Interior, Covilhã, Portugal
K
Kailash A. Hambarde
Instituto de Telecomunicações, Covilhã, Portugal
Vasco Lopes
Vasco Lopes
DeepNeuronic, Universidade da Beira Interior, Portugal
Artificial IntelligenceComputer VisionNeural Architecture Search
J
J. Neves
NOVA LINCS, NOVA University Lisbon, Portugal and University of Beira Interior, Covilhã, Portugal
H
Hugo Proença
Instituto de Telecomunicações, Covilhã, Portugal and University of Beira Interior, Covilhã, Portugal