🤖 AI Summary
Existing question answering research primarily focuses on vision–language modalities, leaving temporal sequence question answering (TSQA)—particularly over human skeletal trajectories—largely unexplored. This work introduces QuAnTS, the first large-scale, high-quality TSQA benchmark, comprising diverse natural language questions grounded in semantic action understanding; it is constructed via precise skeletal trajectory annotation and controllable neural natural language generation. We establish human performance as an upper-bound reference and conduct systematic evaluations across multiple baseline models. Experiments reveal that state-of-the-art temporal models significantly underperform humans on TSQA, exposing critical limitations in interpretability and interactive reasoning. This work bridges dual gaps in the TSQA field—both in data resources and standardized evaluation—and provides essential infrastructure and a reliable benchmark for advancing text-based interaction and human–AI collaborative decision-making with time-series models.
📝 Abstract
Text offers intuitive access to information. This can, in particular, complement the density of numerical time series, thereby allowing improved interactions with time series models to enhance accessibility and decision-making. While the creation of question-answering datasets and models has recently seen remarkable growth, most research focuses on question answering (QA) on vision and text, with time series receiving minute attention. To bridge this gap, we propose a challenging novel time series QA (TSQA) dataset, QuAnTS, for Question Answering on Time Series data. Specifically, we pose a wide variety of questions and answers about human motion in the form of tracked skeleton trajectories. We verify that the large-scale QuAnTS dataset is well-formed and comprehensive through extensive experiments. Thoroughly evaluating existing and newly proposed baselines then lays the groundwork for a deeper exploration of TSQA using QuAnTS. Additionally, we provide human performances as a key reference for gauging the practical usability of such models. We hope to encourage future research on interacting with time series models through text, enabling better decision-making and more transparent systems.