Sapiens2

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

171K/year
🤖 AI Summary
This work addresses the poor generalization and low fidelity of human-centric vision tasks at high resolutions by introducing the Sapiens2 model family. Trained on one billion high-quality human images, Sapiens2 employs a unified pretraining strategy combining masked image reconstruction and self-distillation contrastive learning, supporting input resolutions from 1K to 4K. Built upon a high-resolution Vision Transformer architecture with window-based attention, the model effectively captures long-range contextual dependencies while jointly preserving fine-grained details and high-level semantics. Sapiens2 achieves state-of-the-art performance across multiple tasks, including pose estimation (+4 mAP), body part segmentation (+24.3 mIoU), and surface normal estimation (45.6% reduction in angular error), and is extended for the first time to dense point correspondence and albedo estimation, demonstrating remarkable versatility and strong few-shot adaptation capabilities.

Technology Category

Application Category

📝 Abstract
We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: https://github.com/facebookresearch/sapiens2
Problem

Research questions and friction points this paper is trying to address.

human-centric vision
generalization
high-fidelity outputs
dense prediction
zero-shot learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

masked image reconstruction
self-distilled contrastive learning
high-resolution vision transformers
windowed attention
human-centric vision
🔎 Similar Papers
No similar papers found.