Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models

📅 2025-08-09

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Current human-centric vision models (HVMs) suffer from a trade-off between large parameter counts and limited pretraining data, hindering both generalization and deployment efficiency. To address this, we propose Dynamic Pattern Alignment Learning (DPAL), a lightweight knowledge distillation framework built upon Vision Transformers (ViTs). DPAL introduces a Dynamic Pattern Decoder (D-PaDe) as a mixture-of-experts module that adaptively extracts three complementary visual patterns: global identity, local shape, and multi-person interaction. Furthermore, it enforces fine-grained knowledge transfer via a three-level feature alignment objective—spanning global, local, and instance-level relational representations. Evaluated across 15 benchmarks, DPAL-ViT/Ti (5M parameters) matches the performance of PATH-B (84M) and Sapiens-L (307M), significantly outperforming Proteus and TinyMiM. To our knowledge, DPAL is the first method to achieve strong multi-task generalization under an extremely lightweight regime.

Technology Category

Application Category

📝 Abstract

Human-centric vision models (HVMs) have achieved remarkable generalization due to large-scale pretraining on massive person images. However, their dependence on large neural architectures and the restricted accessibility of pretraining data significantly limits their practicality in real-world applications. To address this limitation, we propose Dynamic Pattern Alignment Learning (DPAL), a novel distillation-based pretraining framework that efficiently trains lightweight HVMs to acquire strong generalization from large HVMs. In particular, human-centric visual perception are highly dependent on three typical visual patterns, including global identity pattern, local shape pattern and multi-person interaction pattern. To achieve generalizable lightweight HVMs, we firstly design a dynamic pattern decoder (D-PaDe), acting as a dynamic Mixture of Expert (MoE) model. It incorporates three specialized experts dedicated to adaptively extract typical visual patterns, conditioned on both input image and pattern queries. And then, we present three levels of alignment objectives, which aims to minimize generalization gap between lightweight HVMs and large HVMs at global image level, local pixel level, and instance relation level. With these two deliberate designs, the DPAL effectively guides lightweight model to learn all typical human visual patterns from large HVMs, which can generalize to various human-centric vision tasks. Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL. Remarkably, when employing PATH-B as the teacher, DPAL-ViT/Ti (5M parameters) achieves surprising generalizability similar to existing large HVMs such as PATH-B (84M) and Sapiens-L (307M), and outperforms previous distillation-based pretraining methods including Proteus-ViT/Ti (5M) and TinyMiM-ViT/Ti (5M) by a large margin.

Problem

Research questions and friction points this paper is trying to address.

Lightweight human-centric vision models lack generalization due to large architectures.

Limited accessibility of pretraining data restricts practical application of HVMs.

Dynamic Pattern Alignment Learning improves generalization in lightweight HVMs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Pattern Alignment Learning for lightweight HVMs

Dynamic pattern decoder with three specialized experts

Three-level alignment objectives for generalization

🔎 Similar Papers

Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation