Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing audiovisual datasets predominantly focus on single-speaker monologues, limiting their utility for modeling interactions in multi-person dialogues. To address this gap, this work introduces F2F-JF, a dual-speaker talk show dataset comprising 70 hours and 14,000 temporally aligned video segments, offering the first large-scale resource with synchronized dyadic interaction videos and rich metadata. The dataset is efficiently constructed via a semi-automatic pipeline integrating multi-object tracking, speech segmentation, and lightweight manual verification. Building upon the MultiTalk style-based diffusion model, the authors propose a novel approach that leverages a guest’s preceding video as visual context to enable cross-character reactive generation. Experiments demonstrate consistent improvements in both Emotion-FID and FVD metrics while preserving lip-sync accuracy, establishing a new end-to-end paradigm for multi-character responsive video synthesis.

Technology Category

Application Category

📝 Abstract

Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest's preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

multi-person interaction

conversational tempo

sequential dependency

audio-visual dataset

dyadic behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-person interaction

temporal alignment

speech-driven avatar