TAPTRv2: Attention-based Position Update Improves Tracking Any Point

๐Ÿ“… 2024-07-23
๐Ÿ›๏ธ Neural Information Processing Systems
๐Ÿ“ˆ Citations: 6
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the problem that TAPTR relies on cost volumes for Tracking Any Point (TAP), leading to point-query feature contamination and degraded visibility prediction and matching accuracy. To resolve this, we propose Attention-driven Position Updating (APU), which replaces explicit cost volume computation with key-aware deformable attentionโ€”thereby fundamentally decoupling local geometric matching from high-level feature modeling. Built upon a Transformer architecture and trained end-to-end in a DETR-style fashion, APU enables robust point localization and joint visibility prediction. Evaluated on multiple benchmarks, it achieves state-of-the-art performance: significant improvements in tracking accuracy and visibility classification accuracy, alongside reduced computational overhead. Our core contribution is the first integration of deformable attention into TAP, completely eliminating cost volumes and establishing a more concise, efficient, and interpretable paradigm for arbitrary-point tracking.

Technology Category

Application Category

๐Ÿ“ Abstract
In this paper, we present TAPTRv2, a Transformer-based approach built upon TAPTR for solving the Tracking Any Point (TAP) task. TAPTR borrows designs from DEtection TRansformer (DETR) and formulates each tracking point as a point query, making it possible to leverage well-studied operations in DETR-like algorithms. TAPTRv2 improves TAPTR by addressing a critical issue regarding its reliance on cost-volume,which contaminates the point query's content feature and negatively impacts both visibility prediction and cost-volume computation. In TAPTRv2, we propose a novel attention-based position update (APU) operation and use key-aware deformable attention to realize. For each query, this operation uses key-aware attention weights to combine their corresponding deformable sampling positions to predict a new query position. This design is based on the observation that local attention is essentially the same as cost-volume, both of which are computed by dot-production between a query and its surrounding features. By introducing this new operation, TAPTRv2 not only removes the extra burden of cost-volume computation, but also leads to a substantial performance improvement. TAPTRv2 surpasses TAPTR and achieves state-of-the-art performance on many challenging datasets, demonstrating the superiority
Problem

Research questions and friction points this paper is trying to address.

Improves point tracking by replacing cost-volume with attention-based updates
Enhances visibility prediction and position accuracy in Tracking Any Point
Achieves state-of-the-art performance without costly cost-volume computations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based approach for Tracking Any Point
Attention-based position update replaces cost-volume
Key-aware deformable attention improves query position
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Hongyang Li
South China University of Technology, International Digital Economy Academy (IDEA)
H
Hao Zhang
International Digital Economy Academy (IDEA), The Hong Kong University of Science and Technology
Shilong Liu
Shilong Liu
RS@ByteDance, PhD@THU
Computer VisionObject DetectionVisual GroundingMulti-ModalityMultimodal Agent
Zhaoyang Zeng
Zhaoyang Zeng
International Digital Economy Academy
Computer VisionMultimedia Understanding
F
Feng Li
International Digital Economy Academy (IDEA), The Hong Kong University of Science and Technology
Tianhe Ren
Tianhe Ren
PhD student of Electrical and Electronic Engineering, The University of Hong Kong
Computer VisionMachine LearningMulti-Modality
B
Bo Li
Shanghai Jiao Tong University
L
Lei Zhang
South China University of Technology, International Digital Economy Academy (IDEA)