Towards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image generation models struggle to produce anatomically correct human figures due to the scarcity of high-quality anatomical annotations and ambiguous optimization signals. To address this, this work proposes the ASAP framework, which constructs synthetic preference pairs by controllably introducing localized anatomical errors and aligns models using a localized, boundary-constrained Direct Preference Optimization (DPO) approach. The key contributions include a localized anatomical preference mechanism, the HAP dataset, a controllable anatomical degradation strategy, and the HAF-Bench evaluation benchmark. Experiments demonstrate that ASAP significantly reduces anatomical errors across multiple base models while preserving overall image quality.
📝 Abstract
Large-scale text-to-image foundation models have achieved remarkable visual realism, yet generating human images with correct anatomical structures remains challenging. Existing approaches enforce anatomical constraints through part-specific modules or localized loss weighting during supervised fine-tuning on high-quality human photos, but such datasets are limited and often provide ambiguous optimization signals due to confounding factors such as lighting, pose, and background. Preference-based alignment offers an alternative, but standard Direct Preference Optimization (DPO) treats all pixels equally and therefore fails to exploit the localized nature of anatomical artifacts. To address this, we propose the framework of Alignment via Synthetic Anatomical Preference (ASAP), which constructs controlled preference pairs through a localized degradation mechanism applied to high-fidelity human images. This mechanism performs a controlled experiment on images by introducing explicit anatomical errors in targeted regions while preserving the remaining content. With this mechanism, we create the Human Anatomical Preference (HAP) dataset with over 10K curated pairs for effective anatomical alignment of text-to-image human image generative models. To better leverage the locality of these controlled preference pairs, we introduce a localized and margin-bounded variant of DPO that prioritizes optimization in targeted anatomical regions while enforcing a finite preference margin to prevent over-optimization and preserve global semantics. We further introduce HAF-Bench, a benchmark for systematic evaluation of anatomical fidelity. Extensive experiments demonstrate that ASAP consistently reduces anatomical errors across multiple foundation models while maintaining overall image quality.
Problem

Research questions and friction points this paper is trying to address.

anatomical plausibility
human image generation
preference-based alignment
localized degradation
text-to-image models
Innovation

Methods, ideas, or system contributions that make the work stand out.

anatomically plausible generation
synthetic preference pairs
localized degradation
preference-based alignment
human image generation