Conversations Gone Awry, But Then? Evaluating Conversational Forecasting Models

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the “Conversations Gone Awry” (CGA) prediction task by introducing the first standardized evaluation benchmark and a novel dynamic metric to quantify models’ ability to continuously refine predictions as dialogues evolve. Methodologically, it establishes a unified evaluation framework integrating sequence classification and timestep-wise prediction, leveraging dialogue history modeling and attention mechanisms to systematically assess mainstream architectures—including LLM-based models. The contributions are threefold: (1) releasing the first dynamic CGA prediction benchmark; (2) proposing time-sensitive metrics that capture predictive stability and calibration over dialogue progression; and (3) empirically revealing significant limitations of state-of-the-art models in long-horizon dynamic calibration. The results establish a reproducible, comparable evaluation paradigm for dialogue prediction research, advancing proactive dialogue management in human–AI collaboration.

Technology Category

Application Category

📝 Abstract
We often rely on our intuition to anticipate the direction of a conversation. Endowing automated systems with similar foresight can enable them to assist human-human interactions. Recent work on developing models with this predictive capacity has focused on the Conversations Gone Awry (CGA) task: forecasting whether an ongoing conversation will derail. In this work, we revisit this task and introduce the first uniform evaluation framework, creating a benchmark that enables direct and reliable comparisons between different architectures. This allows us to present an up-to-date overview of the current progress in CGA models, in light of recent advancements in language modeling. Our framework also introduces a novel metric that captures a model's ability to revise its forecast as the conversation progresses.
Problem

Research questions and friction points this paper is trying to address.

Evaluating models for predicting conversation derailment.
Creating a uniform benchmark for comparing forecasting architectures.
Introducing a metric for dynamic forecast revision in dialogues.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uniform evaluation framework for CGA models
Benchmark for direct architecture comparisons
Novel metric for dynamic forecast revision
🔎 Similar Papers
No similar papers found.
S
Son Quoc Tran
Cornell University
T
Tushaar Gangavarapu
The University of Texas at Austin
N
Nicholas Chernogor
Harvey Mudd College
J
Jonathan P. Chang
Harvey Mudd College
Cristian Danescu-Niculescu-Mizil
Cristian Danescu-Niculescu-Mizil
Associate Professor, Cornell University
computational social sciencesocial computingcomputational linguisticsnatural language processing