TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Current speech-driven video generation is hindered by the scarcity of large-scale, multi-view datasets; prevailing benchmarks offer only single-shot, static viewpoints, limiting controllable, long-sequence, multi-shot synthesis. To address this, we introduce TalkCuts—the first large-scale, multi-shot human speech video dataset—comprising 164K high-quality clips (>500 hours), spanning close-up, medium, and full-body views, with multimodal annotations including 2D keypoints, 3D SMPL-X poses, and linguistic semantics. Leveraging TalkCuts, we propose Orator, a large language model (LLM)-driven generative framework that enables fine-grained, joint control over shot transitions, speech prosody, and speaker gestures—marking the first such capability. Experiments demonstrate substantial improvements in visual fidelity and cross-shot temporal coherence, enabling high-fidelity, long-duration, multi-view speech-video generation with precise controllability.

Technology Category

Application Category

📝 Abstract

In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.

Problem

Research questions and friction points this paper is trying to address.

Generating multi-shot human speech videos with diverse camera angles

Creating coherent long-form videos with synchronized gestures and audio

Enhancing cinematographic quality in controllable speech video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided multimodal framework orchestrates video generation

Integrated module synthesizes coherent long-form speech videos

Dataset enables multimodal training for cinematographic coherence

🔎 Similar Papers

No similar papers found.