LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing personalized video generation methods struggle to ensure precise alignment between identities and attributes while maintaining intra-group consistency in multi-subject scenarios. To address this challenge, this work proposes a framework that explicitly models identity-attribute dependencies. The approach begins by constructing a structured video-text benchmark dataset, aided by a multimodal large language model, which encodes prior knowledge of identity-attribute relationships. Building upon this foundation, the authors introduce relation-aware self-attention and relation-aware cross-attention mechanisms, integrated into a diffusion model to enhance inter-subject discriminability and intra-group coherence. Experimental results demonstrate that the proposed method achieves state-of-the-art performance on the newly curated benchmark, significantly improving identity consistency, attribute alignment accuracy, and semantic controllability.

Technology Category

Application Category

📝 Abstract
Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.
Problem

Research questions and friction points this paper is trying to address.

personalized video generation
face-attribute alignment
intra-group consistency
multi-subject generation
identity-consistent generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Relational Self-Attention
Relational Cross-Attention
subject-attribute alignment
personalized video generation
multimodal large language models
🔎 Similar Papers
No similar papers found.