VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

This work addresses the limitation of existing full-duplex dialogue evaluation benchmarks, which focus exclusively on speech while neglecting audiovisual nonverbal behaviors such as nodding, smiling, and gestures. To bridge this gap, the authors propose VideoFDB—the first full-duplex audiovisual dialogue benchmark—constructed from real-world video call data and featuring a taxonomy encompassing 11 categories of nonverbal interaction dynamics. They further introduce an interpretable scoring framework grounded in large language models. VideoFDB enables, for the first time, AV2AV (audiovisual-to-audiovisual) dialogue assessment, revealing that current agents commonly suffer from “subtitle collapse” and underutilization of visual cues. The benchmark also demonstrates that cascaded architectures struggle to achieve streaming joint audiovisual modeling, thereby offering a critical evaluation tool and clear directions for advancing multimodal dialogue systems.

📝 Abstract

Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents.

Problem

Research questions and friction points this paper is trying to address.

full-duplex

audiovisual conversation

nonverbal cues

conversational agents

multimodal interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

full-duplex

audiovisual conversation

nonverbal cues