Unified Personalized Understanding, Generating and Editing

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the limitations of existing unified multimodal models in modeling user-specific concepts, which suffer from poor consistency, weak controllability, inefficient personalization strategies, and undesirable task coupling. To overcome these challenges, we propose OmniPersona, a unified end-to-end framework that jointly supports personalized understanding, generation, and image editing within a single architecture. By introducing structurally decoupled concept tokens and an explicit cross-task knowledge replay mechanism, OmniPersona effectively mitigates task interference and enhances behavioral consistency. Our contributions include the first unified modeling of these three core personalization tasks, the design of novel decoupling and replay mechanisms, and the construction of OmniPBench—the first comprehensive benchmark encompassing understanding, generation, and editing. Experiments demonstrate that OmniPersona achieves robust and competitive performance across diverse personalization tasks.

Technology Category

Application Category

📝 Abstract

Unified large multimodal models (LMMs) have achieved remarkable progress in general-purpose multimodal understanding and generation. However, they still operate under a ``one-size-fits-all''paradigm and struggle to model user-specific concepts (e.g., generate a photo of \texttt{}) in a consistent and controllable manner. Existing personalization methods typically rely on external retrieval, which is inefficient and poorly integrated into unified multimodal pipelines. Recent personalized unified models introduce learnable soft prompts to encode concept information, yet they either couple understanding and generation or depend on complex multi-stage training, leading to cross-task interference and ultimately to fuzzy or misaligned personalized knowledge. We present \textbf{OmniPersona}, an end-to-end personalization framework for unified LMMs that, for the first time, integrates personalized understanding, generation, and image editing within a single architecture. OmniPersona introduces structurally decoupled concept tokens, allocating dedicated subspaces for different tasks to minimize interference, and incorporates an explicit knowledge replay mechanism that propagates personalized attribute knowledge across tasks, enabling consistent personalized behavior. To systematically evaluate unified personalization, we propose \textbf{\texttt{OmniPBench}}, extending the public UnifyBench concept set with personalized editing tasks and cross-task evaluation protocols integrating understanding, generation, and editing. Experimental results demonstrate that OmniPersona delivers competitive and robust performance across diverse personalization tasks. We hope OmniPersona will serve as a strong baseline and spur further research on controllable, unified personalization.

Problem

Research questions and friction points this paper is trying to address.

personalization

multimodal models

concept consistency

task interference

unified understanding and generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

personalized multimodal models

structurally decoupled concept tokens

knowledge replay