🤖 AI Summary
Existing data analysis benchmarks fail to capture core challenges in real-world scenarios—namely, ambiguous analytical goals, noisy and heterogeneous data, and the necessity of iterative user interaction. To address this gap, we propose ConDA, the first benchmark framework for interactive, conversational data analysis. ConDA employs a multi-agent collaborative generation pipeline to automatically construct 1,420 dialogue tasks from real-world data sources, each requiring progressive intent clarification. It introduces an interactive evaluation engine that systematically assesses models’ capabilities in multi-step tool invocation, state tracking, and adaptive goal evolution across extended dialogues. Experimental results reveal that while large language models exhibit improved performance on single-step analysis, they remain significantly limited in long-horizon interaction and dynamic intent understanding. ConDA establishes a reproducible, scalable evaluation paradigm and benchmark infrastructure to advance the development of truly collaborative, human-in-the-loop analytical models.
📝 Abstract
Real-world data analysis tasks often come with under-specified goals and unclean data. User interaction is necessary to understand and disambiguate a user's intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first-class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. ench consists of (a) a multi-agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and (c) an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state-of-the-art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.