LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World

πŸ“… 2026-05-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

208K/year
πŸ€– AI Summary
This work addresses the challenge of 3D human pose tracking in first-person multi-camera settings, where severe ego-motion, occlusions, and sparse observations hinder performance. The authors propose LAMP, a novel framework that introduces a β€œlift-then-fit” paradigm: it first lifts asynchronous and partial 2D keypoints from multiple cameras into a unified world coordinate system using the device’s 6-DoF pose, then employs a spatio-temporal Transformer to directly model 3D human motion priors. This approach effectively decouples observer and target motion, enabling flexible integration of dynamic multi-view information. LAMP achieves state-of-the-art results on monocular benchmarks and significantly outperforms existing methods in challenging first-person dynamic scenarios.
πŸ“ Abstract
Tracking 3D human motion from egocentric multi-camera headset is challenged by severe egomotion, partial visibility or occlusions and lack of training data. Existing methods designed for monocular video often require static or slowly-moving cameras and cannot efficiently leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP (Localization Aware Multi-camera People Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process. First, we leverage the known device 6 DoF motion and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained spatio-temporal transformer fits 3D human motion directly to this 3D ray cloud. This "lift-then-fit" approach allows LAMP to learn and leverage a natural human motion prior in the world-space, as well as providing an elegant framework to flexibly incorporate information from multiple temporally asynchronous, partially observing and moving cameras. LAMP achieves state-of-the-art results on monocular benchmarks, while significantly outperforming baselines for our targeted egocentric setting.
Problem

Research questions and friction points this paper is trying to address.

egocentric vision
3D human motion tracking
multi-camera tracking
occlusion
egomotion
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-camera tracking
egocentric vision
3D human motion
spatio-temporal transformer
motion disentanglement
πŸ”Ž Similar Papers
No similar papers found.