From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction

📅 2026-02-02
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of temporally inconsistent predictions—such as flickering—in human-centric dense video tasks under motion, occlusion, and illumination changes, compounded by the scarcity of multi-task paired video supervision. To this end, we propose a scalable, photorealistic synthetic human video generation method that, for the first time, provides both frame-level and sequence-level pixel-wise annotations, including depth, surface normals, and masks. Leveraging this data, we develop a unified Vision Transformer (ViT)-based dense prediction architecture that integrates CSE human geometric priors with a lightweight channel reweighting module. Our approach employs a two-stage training strategy—static pretraining followed by dynamic sequence fine-tuning—to jointly optimize spatial and temporal consistency. The method achieves state-of-the-art performance on THuman2.1 and Hi4D benchmarks and demonstrates strong generalization to in-the-wild real-world videos.

Technology Category

Application Category

📝 Abstract
In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.
Problem

Research questions and friction points this paper is trying to address.

temporal consistency
human-centric dense prediction
video sequences
flickering
paired supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal consistency
synthetic data pipeline
human-centric dense prediction
ViT-based predictor
geometric prior
🔎 Similar Papers
No similar papers found.