SubZero: Composing Subject, Style, and Action via Zero-Shot Personalization

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in zero-shot personalized image generation—namely, joint control of subject, style, and action; severe cross-modal leakage; and constraints on edge-device deployment—by proposing a fine-tuning-free diffusion model framework. Methodologically: (1) a similarity constraint and orthogonalized temporal aggregation mechanism are introduced to suppress subject-style leakage; (2) a dual-path content/style projector enhances IP-Adapter-style cross-attention for disentangled, controllable synthesis; (3) the architecture is lightweighted for efficient edge inference. Experiments demonstrate state-of-the-art performance in subject fidelity, style consistency, and action controllability. To our knowledge, this is the first method enabling zero-shot joint generation of arbitrary subjects, styles, and actions, while supporting real-time, resource-efficient inference on edge devices.

Technology Category

Application Category

📝 Abstract
Diffusion models are increasingly popular for generative tasks, including personalized composition of subjects and styles. While diffusion models can generate user-specified subjects performing text-guided actions in custom styles, they require fine-tuning and are not feasible for personalization on mobile devices. Hence, tuning-free personalization methods such as IP-Adapters have progressively gained traction. However, for the composition of subjects and styles, these works are less flexible due to their reliance on ControlNet, or show content and style leakage artifacts. To tackle these, we present SubZero, a novel framework to generate any subject in any style, performing any action without the need for fine-tuning. We propose a novel set of constraints to enhance subject and style similarity, while reducing leakage. Additionally, we propose an orthogonalized temporal aggregation scheme in the cross-attention blocks of denoising model, effectively conditioning on a text prompt along with single subject and style images. We also propose a novel method to train customized content and style projectors to reduce content and style leakage. Through extensive experiments, we show that our proposed approach, while suitable for running on-edge, shows significant improvements over state-of-the-art works performing subject, style and action composition.
Problem

Research questions and friction points this paper is trying to address.

Enables zero-shot personalization without fine-tuning
Reduces content and style leakage artifacts
Generates subjects in styles performing actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot personalization without fine-tuning
Orthogonalized temporal aggregation in denoising
Customized content and style projectors
🔎 Similar Papers
No similar papers found.