🤖 AI Summary
Personalized image generation faces three key challenges: pose distortion in complex human-object interactions (e.g., “person pushing a motorcycle”), loss of reference subject identity, and misalignment between gaze direction and scene semantics. To address these, we propose a social-context feedback mechanism that, for the first time, integrates multimodal perceptual signals—pose, identity, human-object interaction, and gaze point—into the diffusion model generation process. Our novel timestep-adaptive feedback module hierarchically fuses low-level geometric constraints with high-level semantic signals, enabling dynamic, stepwise correction during sampling. Adopting a feedback-driven fine-tuning paradigm, our method achieves substantial improvements across three benchmark datasets: +12.7% in interaction plausibility, +18.3% in identity fidelity, and a 24.1% reduction in FID, indicating enhanced visual quality. This work establishes an interpretable and scalable framework for controllable portrait generation.
📝 Abstract
Personalized image generation, where reference images of one or more subjects are used to generate their image according to a scene description, has gathered significant interest in the community. However, such generated images suffer from three major limitations -- complex activities, such as $<$man, pushing, motorcycle$>$ are not generated properly with incorrect human poses, reference human identities are not preserved, and generated human gaze patterns are unnatural/inconsistent with the scene description. In this work, we propose to overcome these shortcomings through feedback-based fine-tuning of existing personalized generation methods, wherein, state-of-art detectors of pose, human-object-interaction, human facial recognition and human gaze-point estimation are used to refine the diffusion model. We also propose timestep-based inculcation of different feedback modules, depending upon whether the signal is low-level (such as human pose), or high-level (such as gaze point). The images generated in this manner show an improvement in the generated interactions, facial identities and image quality over three benchmark datasets.