KV-Tracker: Real-Time Pose Tracking with Transformers

📅 2025-12-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multi-view geometry–based approaches for real-time 6-DoF pose tracking and online 3D reconstruction from monocular RGB video incur high computational overhead, hindering real-time performance. This paper proposes a lightweight, depth-free, object-agnostic, and scene-general framework. We introduce a novel KV caching mechanism that models global self-attention key-value pairs as compact, persistent scene representations—enabling model-agnostic, retraining-free long-term consistency modeling. Our method integrates π³ multi-view geometric priors with dynamic keyframe selection and bidirectional attention. Evaluated on benchmarks including TUM RGB-D, the framework achieves 27 FPS inference speed, up to 15× acceleration over baselines, and significantly mitigates pose drift and catastrophic forgetting.

Technology Category

Application Category

📝 Abstract
Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via $π^3$ with full bidirectional attention. We then cache the global self-attention block's key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to $15 imes$ speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining. We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ${sim}27$ FPS.
Problem

Research questions and friction points this paper is trying to address.

Enable real-time 6-DoF pose tracking from monocular RGB videos
Achieve online object and scene reconstruction without depth or priors
Speed up multi-view 3D networks for real-time use without drift
Innovation

Methods, ideas, or system contributions that make the work stand out.

Caches key-value pairs for real-time tracking
Uses keyframe selection for online scene mapping
Applies model-agnostic caching to speed inference
🔎 Similar Papers
No similar papers found.