EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing generative video models struggle to enable fine-grained, script-driven editing of recorded talking-head videos while preserving speaker identity, temporal coherence, and precise lip-sync. This work proposes the first diffusion transformer (DiT)-based video-to-video editing framework that supports text-level manipulations—such as inserting, deleting, or re-timing spoken content—guided by audio conditions and enhanced through region-aware training. By integrating spatiotemporal inpainting, the method synthesizes natural facial dynamics and lip movements that align accurately with the edited speech. The approach achieves high-fidelity identity preservation and long-range temporal consistency alongside accurate audio-visual synchronization, offering a powerful and controllable tool for professional post-production editing of talking-head videos.

Technology Category

Application Category

📝 Abstract
Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human motion in newly added segments, while maintaining visual fidelity and identity consistency over long durations. This work represents a foundational step toward generative video models as practical tools for professional video post-production.
Problem

Research questions and friction points this paper is trying to address.

talking head video editing
audio-driven video generation
lip synchronization
temporal coherence
video-to-video editing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformers
audio-driven video editing
talking head generation
spatiotemporal inpainting
lip synchronization
🔎 Similar Papers
No similar papers found.
John Flynn
John Flynn
Google Inc.
Computer VisionMachine Learning
W
Wolfgang Paier
Pipio AI
Dimitar Dinev
Dimitar Dinev
Pipio
Computer GraphicsComputer VisionDigital Humans
S
Sam Nhut Nguyen
Pipio AI
H
Hayk Poghosyan
Pipio AI
M
Manuel Toribio
Pipio AI
S
Sandipan Banerjee
Amazon
G
Guy Gafni
Pipio AI