Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models

📅 2025-08-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current human-centric vision models (HVMs) suffer from a trade-off between large parameter counts and limited pretraining data, hindering both generalization and deployment efficiency. To address this, we propose Dynamic Pattern Alignment Learning (DPAL), a lightweight knowledge distillation framework built upon Vision Transformers (ViTs). DPAL introduces a Dynamic Pattern Decoder (D-PaDe) as a mixture-of-experts module that adaptively extracts three complementary visual patterns: global identity, local shape, and multi-person interaction. Furthermore, it enforces fine-grained knowledge transfer via a three-level feature alignment objective—spanning global, local, and instance-level relational representations. Evaluated across 15 benchmarks, DPAL-ViT/Ti (5M parameters) matches the performance of PATH-B (84M) and Sapiens-L (307M), significantly outperforming Proteus and TinyMiM. To our knowledge, DPAL is the first method to achieve strong multi-task generalization under an extremely lightweight regime.

Technology Category

Application Category

📝 Abstract
Human-centric vision models (HVMs) have achieved remarkable generalization due to large-scale pretraining on massive person images. However, their dependence on large neural architectures and the restricted accessibility of pretraining data significantly limits their practicality in real-world applications. To address this limitation, we propose Dynamic Pattern Alignment Learning (DPAL), a novel distillation-based pretraining framework that efficiently trains lightweight HVMs to acquire strong generalization from large HVMs. In particular, human-centric visual perception are highly dependent on three typical visual patterns, including global identity pattern, local shape pattern and multi-person interaction pattern. To achieve generalizable lightweight HVMs, we firstly design a dynamic pattern decoder (D-PaDe), acting as a dynamic Mixture of Expert (MoE) model. It incorporates three specialized experts dedicated to adaptively extract typical visual patterns, conditioned on both input image and pattern queries. And then, we present three levels of alignment objectives, which aims to minimize generalization gap between lightweight HVMs and large HVMs at global image level, local pixel level, and instance relation level. With these two deliberate designs, the DPAL effectively guides lightweight model to learn all typical human visual patterns from large HVMs, which can generalize to various human-centric vision tasks. Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL. Remarkably, when employing PATH-B as the teacher, DPAL-ViT/Ti (5M parameters) achieves surprising generalizability similar to existing large HVMs such as PATH-B (84M) and Sapiens-L (307M), and outperforms previous distillation-based pretraining methods including Proteus-ViT/Ti (5M) and TinyMiM-ViT/Ti (5M) by a large margin.
Problem

Research questions and friction points this paper is trying to address.

Lightweight human-centric vision models lack generalization due to large architectures.
Limited accessibility of pretraining data restricts practical application of HVMs.
Dynamic Pattern Alignment Learning improves generalization in lightweight HVMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Pattern Alignment Learning for lightweight HVMs
Dynamic pattern decoder with three specialized experts
Three-level alignment objectives for generalization
🔎 Similar Papers
No similar papers found.
Xuanhan Wang
Xuanhan Wang
UESTC
Human Centered Visual Understanding
H
Huimin Deng
University of Electronic Science and Technology of China
K
Ke Liu
University of Electronic Science and Technology of China
J
Jun Wang
University of Electronic Science and Technology of China
Lianli Gao
Lianli Gao
UESTC
Vision and Language
J
Jingkuan Song
Tongji University