Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing approaches lack an open, fine-grained, and controllable framework for joint audio-visual generation across multiple identities, making it difficult to coherently manipulate facial appearance and vocal timbre. This work proposes a unified and extensible cross-modal generative framework that automatically extracts identity information from both audio and video modalities and incorporates a flexible identity embedding injection mechanism, enabling high-fidelity personalized synthesis in both single- and multi-speaker scenarios. The framework introduces a cross-modal identity representation strategy for joint control and integrates automated data cleaning with multi-stage training to significantly enhance identity consistency, generation quality, and cross-modal alignment. Experimental results demonstrate that the proposed method substantially outperforms current state-of-the-art techniques across multiple evaluation metrics.

Technology Category

Application Category

📝 Abstract

Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our webpage: \href{https://chen-yingjie.github.io/projects/Identity-as-Presence}{Identity-as-Presence}.

Problem

Research questions and friction points this paper is trying to address.

identity-aware generation

audio-video synthesis

personalized generation

facial appearance

voice timbre

Innovation

Methods, ideas, or system contributions that make the work stand out.

identity-aware generation

joint audio-video synthesis

facial appearance control