π€ AI Summary
Existing audiovisual datasets predominantly focus on single-speaker monologues, limiting their utility for modeling interactions in multi-person dialogues. To address this gap, this work introduces F2F-JF, a dual-speaker talk show dataset comprising 70 hours and 14,000 temporally aligned video segments, offering the first large-scale resource with synchronized dyadic interaction videos and rich metadata. The dataset is efficiently constructed via a semi-automatic pipeline integrating multi-object tracking, speech segmentation, and lightweight manual verification. Building upon the MultiTalk style-based diffusion model, the authors propose a novel approach that leverages a guestβs preceding video as visual context to enable cross-character reactive generation. Experiments demonstrate consistent improvements in both Emotion-FID and FVD metrics while preserving lip-sync accuracy, establishing a new end-to-end paradigm for multi-character responsive video synthesis.
π Abstract
Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest's preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.