LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of 3D human pose tracking in first-person multi-camera settings, where severe ego-motion, occlusions, and sparse observations hinder performance. The authors propose LAMP, a novel framework that introduces a “lift-then-fit” paradigm: it first lifts asynchronous and partial 2D keypoints from multiple cameras into a unified world coordinate system using the device’s 6-DoF pose, then employs a spatio-temporal Transformer to directly model 3D human motion priors. This approach effectively decouples observer and target motion, enabling flexible integration of dynamic multi-view information. LAMP achieves state-of-the-art results on monocular benchmarks and significantly outperforms existing methods in challenging first-person dynamic scenarios.

📝 Abstract

Tracking 3D human motion from egocentric multi-camera headset is challenged by severe egomotion, partial visibility or occlusions and lack of training data. Existing methods designed for monocular video often require static or slowly-moving cameras and cannot efficiently leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP (Localization Aware Multi-camera People Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process. First, we leverage the known device 6 DoF motion and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained spatio-temporal transformer fits 3D human motion directly to this 3D ray cloud. This "lift-then-fit" approach allows LAMP to learn and leverage a natural human motion prior in the world-space, as well as providing an elegant framework to flexibly incorporate information from multiple temporally asynchronous, partially observing and moving cameras. LAMP achieves state-of-the-art results on monocular benchmarks, while significantly outperforming baselines for our targeted egocentric setting.

Problem

Research questions and friction points this paper is trying to address.

egocentric vision

3D human motion tracking

multi-camera tracking

occlusion

egomotion

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-camera tracking

egocentric vision

3D human motion

spatio-temporal transformer

motion disentanglement

🔎 Similar Papers

No similar papers found.

ByteDance

San Jose

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)