🤖 AI Summary
This paper addresses the “Conversations Gone Awry” (CGA) prediction task by introducing the first standardized evaluation benchmark and a novel dynamic metric to quantify models’ ability to continuously refine predictions as dialogues evolve. Methodologically, it establishes a unified evaluation framework integrating sequence classification and timestep-wise prediction, leveraging dialogue history modeling and attention mechanisms to systematically assess mainstream architectures—including LLM-based models. The contributions are threefold: (1) releasing the first dynamic CGA prediction benchmark; (2) proposing time-sensitive metrics that capture predictive stability and calibration over dialogue progression; and (3) empirically revealing significant limitations of state-of-the-art models in long-horizon dynamic calibration. The results establish a reproducible, comparable evaluation paradigm for dialogue prediction research, advancing proactive dialogue management in human–AI collaboration.
📝 Abstract
We often rely on our intuition to anticipate the direction of a conversation. Endowing automated systems with similar foresight can enable them to assist human-human interactions. Recent work on developing models with this predictive capacity has focused on the Conversations Gone Awry (CGA) task: forecasting whether an ongoing conversation will derail. In this work, we revisit this task and introduce the first uniform evaluation framework, creating a benchmark that enables direct and reliable comparisons between different architectures. This allows us to present an up-to-date overview of the current progress in CGA models, in light of recent advancements in language modeling. Our framework also introduces a novel metric that captures a model's ability to revise its forecast as the conversation progresses.