Look Around and Pay Attention: Multi-camera Point Tracking Reimagined with Transformers

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional multi-camera point tracking methods decouple detection, association, and tracking, leading to error propagation and spatiotemporal inconsistency. This paper proposes LAPA—the first end-to-end multi-camera point tracking framework based on Transformers—that jointly models cross-view geometric constraints and inter-frame temporal dependencies. Its core innovations are: (i) replacing rigid triangulation with cross-view attention that explicitly incorporates camera geometry priors; and (ii) employing a Transformer decoder for joint appearance-geometry matching and soft correspondence optimization, enhancing robustness to occlusions and ensuring long-term identity consistency. Evaluated on TAPVid-3D-MC and PointOdyssey-MC, LAPA achieves APD scores of 37.5% and 90.3%, respectively—substantially outperforming prior state-of-the-art methods, especially under challenging conditions involving complex motion and heavy occlusion.

Technology Category

Application Category

📝 Abstract
This paper presents LAPA (Look Around and Pay Attention), a novel end-to-end transformer-based architecture for multi-camera point tracking that integrates appearance-based matching with geometric constraints. Traditional pipelines decouple detection, association, and tracking, leading to error propagation and temporal inconsistency in challenging scenarios. LAPA addresses these limitations by leveraging attention mechanisms to jointly reason across views and time, establishing soft correspondences through a cross-view attention mechanism enhanced with geometric priors. Instead of relying on classical triangulation, we construct 3D point representations via attention-weighted aggregation, inherently accommodating uncertainty and partial observations. Temporal consistency is further maintained through a transformer decoder that models long-range dependencies, preserving identities through extended occlusions. Extensive experiments on challenging datasets, including our newly created multi-camera (MC) versions of TAPVid-3D panoptic and PointOdyssey, demonstrate that our unified approach significantly outperforms existing methods, achieving 37.5% APD on TAPVid-3D-MC and 90.3% APD on PointOdyssey-MC, particularly excelling in scenarios with complex motions and occlusions. Code is available at https://github.com/ostadabbas/Look-Around-and-Pay-Attention-LAPA-
Problem

Research questions and friction points this paper is trying to address.

Multi-camera point tracking with transformers
Joint reasoning across views and time
Handling occlusions and complex motions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based architecture for multi-camera point tracking
Cross-view attention with geometric priors for soft correspondences
Attention-weighted aggregation for 3D point representations accommodating uncertainty
🔎 Similar Papers
No similar papers found.