From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the limitations of existing multi-view 3D human pose estimation methods, which rely on synchronized image inputs, neglect temporal information, and are constrained by the frame rate of individual views. To overcome these issues, we propose a sparse interleaved multi-view input scheme that fuses asynchronous multi-view images, theoretically enabling an N-fold increase in output frame rate—where N is the number of cameras—while reducing data redundancy. We introduce DenseWarper, a novel model that leverages epipolar geometry to enable efficient spatio-temporal heatmap exchange, effectively capturing both cross-view and cross-frame dependencies. Experiments on Human3.6M and MPI-INF-3DHP demonstrate that our approach achieves superior pose estimation accuracy and enhanced temporal resolution using only sparse inputs, outperforming current dense multi-view methods.

📝 Abstract

In multi-view 3D human pose estimation, models typically rely on images captured simultaneously from different camera views to predict a pose at a specific moment. While providing accurate spatial information, this traditional approach often overlooks the rich temporal dependencies between adjacent frames. We propose a novel 3D human pose estimation input method: the sparse interleaved input to address this. This method leverages images captured from different camera views at various time points (e.g., View 1 at time $t$ and View 2 at time $t+δ$), allowing our model to capture rich spatio-temporal information and effectively boost performance. More importantly, this approach offers two key advantages: First, it can theoretically increase the output pose frame rate by N times with N cameras, thereby breaking through single-view frame rate limitations and enhancing the temporal resolution of the production. Second, using a sparse subset of available frames, our method can reduce data redundancy and simultaneously achieve better performance. We introduce the DenseWarper model, which leverages epipolar geometry for efficient spatio-temporal heatmap exchange. We conducted extensive experiments on the Human3.6M and MPI-INF-3DHP datasets. Results demonstrate that our method, utilizing only sparse interleaved images as input, outperforms traditional dense multi-view input approaches and achieves state-of-the-art performance. The source code for this work is available at: https://github.com/lingli1724/DenseWarper-ICLR2026

Problem

Research questions and friction points this paper is trying to address.

multi-view 3D human pose estimation

spatio-temporal fusion

temporal dependencies

frame rate limitation

data redundancy

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse interleaved input

spatio-temporal fusion

multi-view 3D pose estimation