🤖 AI Summary
Current spoken language models (SLMs) lack systematic evaluation of temporal dynamic capabilities—such as prosodic control, synchronized response generation, and full-duplex interaction—despite their critical role in natural spoken dialogue.
Method: We introduce Game-Time, the first benchmark explicitly designed to assess SLMs’ temporal competence. Inspired by human language acquisition, Game-Time features multi-level instruction-following tasks with explicit time constraints, covering realistic scenarios including rhythm matching and synchronous speech response.
Contribution/Results: Experiments reveal that while mainstream SLMs perform reasonably on basic tasks, their performance degrades significantly on temporally sensitive ones—exposing fundamental deficiencies in temporal perception and real-time coordination. Game-Time establishes the first structured, quantifiable, and reproducible evaluation framework for temporal dynamics in SLMs, providing both a rigorous assessment paradigm and concrete directions for model improvement.
📝 Abstract
Conversational Spoken Language Models (SLMs) are emerging as a promising paradigm for real-time speech interaction. However, their capacity of temporal dynamics, including the ability to manage timing, tempo and simultaneous speaking, remains a critical and unevaluated challenge for conversational fluency. To address this gap, we introduce the Game-Time Benchmark, a framework to systematically assess these temporal capabilities. Inspired by how humans learn a language through language activities, Game-Time consists of basic instruction-following tasks and advanced tasks with temporal constraints, such as tempo adherence and synchronized responses. Our evaluation of diverse SLM architectures reveals a clear performance disparity: while state-of-the-art models handle basic tasks well, many contemporary systems still struggle with fundamental instruction-following. More critically, nearly all models degrade substantially under temporal constraints, exposing persistent weaknesses in time awareness and full-duplex interaction. The Game-Time Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI. Demos and datasets are available on our project website https://ga642381.github.io/Game-Time.