🤖 AI Summary
Existing generative models often suffer from structural distortions—such as limb deformations and unnatural poses—when synthesizing human figures into scenes. To address this, this work proposes an explicit skeletal reasoning mechanism that integrates human structural priors into the generation process by jointly training pose inference and image rendering stages. A key component is the PoseInverter module, which enables fine-grained, editable control over pose parameters. The method significantly enhances the anatomical plausibility and naturalness of generated human poses while preserving high-fidelity appearance and contextual coherence. Extensive evaluations demonstrate that the approach outperforms both specialized and general-purpose state-of-the-art models in terms of structural accuracy and pose realism.
📝 Abstract
Generating realistic and structurally plausible human images into existing scenes remains a significant challenge for current generative models, which often produce artifacts like distorted limbs and unnatural poses. We attribute this systemic failure to an inability to perform explicit reasoning over human skeletal structure. To address this, we introduce SkeleGuide, a novel framework built upon explicit skeletal reasoning. Through joint training of its reasoning and rendering stages, SkeleGuide learns to produce an internal pose that acts as a strong structural prior, guiding the synthesis towards high structural integrity. For fine-grained user control, we introduce PoseInverter, a module that decodes this internal latent pose into an explicit and editable format. Extensive experiments demonstrate that SkeleGuide significantly outperforms both specialized and general-purpose models in generating high-fidelity, contextually-aware human images. Our work provides compelling evidence that explicitly modeling skeletal structure is a fundamental step towards robust and plausible human image synthesis.