🤖 AI Summary
This work proposes ID-LoRA, a novel approach for personalized audiovisual video generation that overcomes the limitations of existing methods which typically process audio and video separately, leading to poor synchronization and limited control over speaking style and acoustic environment via text. Built upon the LTX-2 joint audiovisual diffusion model, ID-LoRA enables synchronized customization of speaker appearance and voice in a single generation pass using a reference image, short audio clip, and text prompt. Key innovations include the first demonstration of joint audiovisual personalization within a single model and inference step, a negative temporal positional encoding to distinguish reference from generated tokens, and an identity-guided mechanism to preserve speaker characteristics. The method leverages parameter-efficient In-Context LoRA fine-tuning, RoPE extension, and classifier-free guidance. Experiments show that 73% of users rate its voice similarity as superior to Kling 2.6 Pro, with a 24% improvement in cross-environment speaker similarity, achieved using only ~3K samples and a single GPU.
📝 Abstract
Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject's appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.