Towards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Current text-to-image generation models struggle to produce anatomically correct human figures due to the scarcity of high-quality anatomical annotations and ambiguous optimization signals. To address this, this work proposes the ASAP framework, which constructs synthetic preference pairs by controllably introducing localized anatomical errors and aligns models using a localized, boundary-constrained Direct Preference Optimization (DPO) approach. The key contributions include a localized anatomical preference mechanism, the HAP dataset, a controllable anatomical degradation strategy, and the HAF-Bench evaluation benchmark. Experiments demonstrate that ASAP significantly reduces anatomical errors across multiple base models while preserving overall image quality.

📝 Abstract

Large-scale text-to-image foundation models have achieved remarkable visual realism, yet generating human images with correct anatomical structures remains challenging. Existing approaches enforce anatomical constraints through part-specific modules or localized loss weighting during supervised fine-tuning on high-quality human photos, but such datasets are limited and often provide ambiguous optimization signals due to confounding factors such as lighting, pose, and background. Preference-based alignment offers an alternative, but standard Direct Preference Optimization (DPO) treats all pixels equally and therefore fails to exploit the localized nature of anatomical artifacts. To address this, we propose the framework of Alignment via Synthetic Anatomical Preference (ASAP), which constructs controlled preference pairs through a localized degradation mechanism applied to high-fidelity human images. This mechanism performs a controlled experiment on images by introducing explicit anatomical errors in targeted regions while preserving the remaining content. With this mechanism, we create the Human Anatomical Preference (HAP) dataset with over 10K curated pairs for effective anatomical alignment of text-to-image human image generative models. To better leverage the locality of these controlled preference pairs, we introduce a localized and margin-bounded variant of DPO that prioritizes optimization in targeted anatomical regions while enforcing a finite preference margin to prevent over-optimization and preserve global semantics. We further introduce HAF-Bench, a benchmark for systematic evaluation of anatomical fidelity. Extensive experiments demonstrate that ASAP consistently reduces anatomical errors across multiple foundation models while maintaining overall image quality.

Problem

Research questions and friction points this paper is trying to address.

anatomical plausibility

human image generation

preference-based alignment

localized degradation

text-to-image models

Innovation

Methods, ideas, or system contributions that make the work stand out.

anatomically plausible generation

synthetic preference pairs

localized degradation