STAR-Pose: Efficient Low-Resolution Video Human Pose Estimation via Spatial-Temporal Adaptive Super-Resolution

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address human pose estimation in low-resolution videos, this paper proposes an end-to-end spatial-temporal adaptive super-resolution framework that balances computational efficiency and keypoint localization accuracy. The method integrates three core innovations: (1) a novel linear attention-based spatiotemporal Transformer modulated by LeakyReLU, drastically reducing computational complexity; (2) a pose-aware composite loss function that prioritizes structural localizability over pixel-level fidelity during super-resolution; and (3) a parallel CNN-based local texture enhancement module. Evaluated on extremely low-resolution video sequences (64×48), the framework achieves a 5.2% improvement in mean Average Precision (mAP) over baseline methods. Moreover, it attains 2.8–4.4× faster inference speed compared to cascaded approaches, significantly enhancing feasibility for edge-device deployment.

Technology Category

Application Category

📝 Abstract
Human pose estimation in low-resolution videos presents a fundamental challenge in computer vision. Conventional methods either assume high-quality inputs or employ computationally expensive cascaded processing, which limits their deployment in resource-constrained environments. We propose STAR-Pose, a spatial-temporal adaptive super-resolution framework specifically designed for video-based human pose estimation. Our method features a novel spatial-temporal Transformer with LeakyReLU-modified linear attention, which efficiently captures long-range temporal dependencies. Moreover, it is complemented by an adaptive fusion module that integrates parallel CNN branch for local texture enhancement. We also design a pose-aware compound loss to achieve task-oriented super-resolution. This loss guides the network to reconstruct structural features that are most beneficial for keypoint localization, rather than optimizing purely for visual quality. Extensive experiments on several mainstream video HPE datasets demonstrate that STAR-Pose outperforms existing approaches. It achieves up to 5.2% mAP improvement under extremely low-resolution (64x48) conditions while delivering 2.8x to 4.4x faster inference than cascaded approaches.
Problem

Research questions and friction points this paper is trying to address.

Estimating human poses in low-resolution videos efficiently
Overcoming computational limits of traditional cascaded methods
Enhancing keypoint localization with task-oriented super-resolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial-temporal Transformer with modified attention
Adaptive fusion module for texture enhancement
Pose-aware loss for task-oriented super-resolution
🔎 Similar Papers
No similar papers found.
Yucheng Jin
Yucheng Jin
Assistant Professor, Duke Kunshan University
Human-Centered AIHuman-Computer InteractionRecommender SystemsDigital WellbeingMusic
J
Jinyan Chen
Tianjin University
Z
Ziyue He
Tianjin University
B
Baojun Han
Tianjin University
F
Furan An
Tianjin University