From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction

📅 2026-02-02

📈 Citations: 1

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenges of temporally inconsistent predictions—such as flickering—in human-centric dense video tasks under motion, occlusion, and illumination changes, compounded by the scarcity of multi-task paired video supervision. To this end, we propose a scalable, photorealistic synthetic human video generation method that, for the first time, provides both frame-level and sequence-level pixel-wise annotations, including depth, surface normals, and masks. Leveraging this data, we develop a unified Vision Transformer (ViT)-based dense prediction architecture that integrates CSE human geometric priors with a lightweight channel reweighting module. Our approach employs a two-stage training strategy—static pretraining followed by dynamic sequence fine-tuning—to jointly optimize spatial and temporal consistency. The method achieves state-of-the-art performance on THuman2.1 and Hi4D benchmarks and demonstrates strong generalization to in-the-wild real-world videos.

Technology Category

Application Category

📝 Abstract

In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.

Problem

Research questions and friction points this paper is trying to address.

temporal consistency

human-centric dense prediction

video sequences

flickering

paired supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal consistency

synthetic data pipeline

human-centric dense prediction