OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low identity fidelity and poor background consistency in subject-driven image generation for multi-subject complex scenes, this paper proposes a video-derived data construction paradigm grounded in cross-frame identity priors. We design a four-stage pipeline integrating diversity-aware pairing with vision-language model (VLM)-driven category consensus, local localization, and verification. To enhance controllability and fidelity, we introduce segmentation-map-guided outpainting, bounding-box-guided inpainting, geometry-aware enhancement, and irregular-boundary erosion. Furthermore, we establish the first large-scale benchmark dedicated to multi-subject generation and editing. Experiments demonstrate that our method significantly improves identity consistency and scene complexity modeling, achieving state-of-the-art performance on both multi-subject generation and editing tasks.

Technology Category

Application Category

📝 Abstract
Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.
Problem

Research questions and friction points this paper is trying to address.

Improves subject identity preservation in image generation
Enables multi-subject handling in complex scene generation
Enhances subject-driven image manipulation consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video-derived large-scale corpus with identity priors
Four-stage pipeline for cross-frame subject mining
Segmentation and box-guided synthesis with geometry augmentations