🤖 AI Summary
Existing approaches lack an open, fine-grained, and controllable framework for joint audio-visual generation across multiple identities, making it difficult to coherently manipulate facial appearance and vocal timbre. This work proposes a unified and extensible cross-modal generative framework that automatically extracts identity information from both audio and video modalities and incorporates a flexible identity embedding injection mechanism, enabling high-fidelity personalized synthesis in both single- and multi-speaker scenarios. The framework introduces a cross-modal identity representation strategy for joint control and integrates automated data cleaning with multi-stage training to significantly enhance identity consistency, generation quality, and cross-modal alignment. Experimental results demonstrate that the proposed method substantially outperforms current state-of-the-art techniques across multiple evaluation metrics.
📝 Abstract
Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our webpage: \href{https://chen-yingjie.github.io/projects/Identity-as-Presence}{Identity-as-Presence}.