Enabling Conversational Behavior Reasoning Capabilities in Full-Duplex Speech

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of modeling implicit causal relationships among dialogue acts and ensuring real-time, interpretable inference in full-duplex spoken dialogue, this paper formalizes dialogue act reasoning as hierarchical causal graph inference—the first such formulation—and proposes a Graph-of-Thought (GoT)-driven joint intent–action modeling framework. Methodologically, it constructs a dynamically evolving GoT structure, designs a multi-granularity causal annotation scheme, synthesizes a hybrid training corpus integrating simulated events, human attributions, and real-world dialogues, and combines multimodal Transformers with streaming graph inference for low-latency prediction. Contributions include: (1) the first causal reasoning model for dialogue acts specifically designed for full-duplex speech; (2) real-time, traceable, and interpretable behavior prediction; and (3) significantly improved robustness on both synthetic and real-world data, alongside the release of a dedicated evaluation benchmark.

Technology Category

Application Category

📝 Abstract
Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this causal pathway is key to building natural full-duplex interactive systems. We introduce a framework that enables reasoning over conversational behaviors by modeling this process as causal inference within a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a hybrid corpus that pairs controllable, event-rich simulations with human-annotated rationales and real conversational speech. The GoT framework structures streaming predictions as an evolving graph, enabling a multimodal transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.
Problem

Research questions and friction points this paper is trying to address.

Model causal reasoning in full-duplex speech conversations
Predict communicative intents and speech acts hierarchically
Enable interpretable behavior detection in real-time dialogue systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-of-Thoughts models conversational behavior as causal inference
Hierarchical labeling predicts intents and speech acts with dependencies
Multimodal transformer forecasts acts and generates justifications dynamically
S
Shuchang Pan
Zhejiang University
Siddharth Banerjee
Siddharth Banerjee
Assistant Professor
Workforce DevelopmentProject Risk ManagementText AnalyticsData Visualization
D
Dhruv Hebbar
University of California, Berkeley
S
Siddhant Patel
University of California, Berkeley
A
Akshaj Gupta
University of California, Berkeley
K
Kan Jen Cheng
University of California, Berkeley
H
Hanjo Kim
University of California, Berkeley
Z
Zeyi Austin Li
University of California, Berkeley
Martin Q. Ma
Martin Q. Ma
Carnegie Mellon University
Deep LearningMultimodal Machine Learning
Tingle Li
Tingle Li
PhD Student, UC Berkeley
Multimodal LearningAuditory PerceptionSpeech ProcessingComputer Vision
G
Gopala Anumanchipalli
University of California, Berkeley
Jiachen Lian
Jiachen Lian
UC Berkeley
precision healthcarespeech processingmachine learning