🤖 AI Summary
Existing methods for single-reference image-based sticker personalization often suffer from identity distortion and limited contextual controllability due to visual entanglement and structural rigidity. This work proposes SEAL—a plug-and-play, semantics-aware adaptation module that enables high-fidelity and controllable generation without modifying the diffusion model backbone. SEAL introduces, for the first time, explicit spatial and structural constraints through a semantics-guided spatial attention loss, a split-and-merge token strategy, and structure-aware layer restrictions. These mechanisms are trained and evaluated on StickerBench, a newly curated large-scale dataset with structured annotations across six dimensions: appearance, emotion, action, composition, style, and background. Experiments demonstrate that SEAL significantly enhances contextual controllability while preserving identity consistency, validating the efficacy of spatial and structural constraints in test-time adaptation.
📝 Abstract
Synthesizing a target concept from a single reference image is challenging in diffusion-based personalized text-to-image generation, particularly for sticker personalization where prompts often require explicit attribute edits. With only one reference, test-time fine-tuning (TTF) methods tend to overfit, producing \textit{visual entanglement}, where background artifacts are absorbed into the learned concept, and \textit{structural rigidity}, where the model memorizes reference-specific spatial configurations and loses contextual controllability. To address these issues, we introduce \textbf{SE}mantic-aware single-image sticker person\textbf{AL}ization (\textbf{SEAL}), a plug-and-play, architecture-agnostic adaptation module that integrates into existing personalization pipelines without modifying their U-Net-based diffusion backbones. SEAL applies three components during embedding adaptation: (1) a Semantic-guided Spatial Attention Loss, (2) a Split-merge Token Strategy, and (3) Structure-aware Layer Restriction. To support sticker-domain personalization with attribute-level control, we present StickerBench, a large-scale sticker image dataset with structured tags under a six-attribute schema (Appearance, Emotion, Action, Camera Composition, Style, Background). These annotations provide a consistent interface for varying context while keeping target identity fixed, enabling systematic evaluation of identity disentanglement and contextual controllability. Experiments show that SEAL consistently improves identity preservation while maintaining contextual controllability, highlighting the importance of explicit spatial and structural constraints during test-time adaptation. The code, StickerBench, and project page will be publicly released.