🤖 AI Summary
In text-to-image generation, identity consistency suffers from the “copy-paste” problem: models trained via reconstruction-based objectives mechanically reuse reference facial features rather than preserving identity naturally across diverse poses, expressions, and lighting conditions. To balance identity fidelity and controllable diversity, we propose a new paradigm: (1) constructing MultiID-2M, a large-scale paired dataset of 2 million multi-identity samples; (2) introducing the first quantitative benchmark for copy-paste artifact evaluation; and (3) designing a contrastive identity loss enabling end-to-end training within diffusion models. Our method significantly suppresses feature copying, achieving state-of-the-art performance—+8.2% in ID similarity and −12.4 in FID—while supporting fine-grained pose and expression editing. A user study confirms strong controllability with high identity accuracy (96.3%).
📝 Abstract
Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.