WithAnyone: Towards Controllable and ID Consistent Image Generation

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In text-to-image generation, identity consistency suffers from the “copy-paste” problem: models trained via reconstruction-based objectives mechanically reuse reference facial features rather than preserving identity naturally across diverse poses, expressions, and lighting conditions. To balance identity fidelity and controllable diversity, we propose a new paradigm: (1) constructing MultiID-2M, a large-scale paired dataset of 2 million multi-identity samples; (2) introducing the first quantitative benchmark for copy-paste artifact evaluation; and (3) designing a contrastive identity loss enabling end-to-end training within diffusion models. Our method significantly suppresses feature copying, achieving state-of-the-art performance—+8.2% in ID similarity and −12.4 in FID—while supporting fine-grained pose and expression editing. A user study confirms strong controllability with high identity accuracy (96.3%).

Technology Category

Application Category

📝 Abstract
Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.
Problem

Research questions and friction points this paper is trying to address.

Addresses copy-paste artifacts in identity-consistent image generation
Improves controllability over pose and expression variations
Balances identity fidelity with natural diversity in generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed a large-scale paired dataset for multi-person scenarios
Introduced a benchmark to quantify copy-paste artifacts
Proposed a training paradigm with contrastive identity loss
🔎 Similar Papers
No similar papers found.