ConDABench: Interactive Evaluation of Language Models for Data Analysis

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing data analysis benchmarks fail to capture core challenges in real-world scenarios—namely, ambiguous analytical goals, noisy and heterogeneous data, and the necessity of iterative user interaction. To address this gap, we propose ConDA, the first benchmark framework for interactive, conversational data analysis. ConDA employs a multi-agent collaborative generation pipeline to automatically construct 1,420 dialogue tasks from real-world data sources, each requiring progressive intent clarification. It introduces an interactive evaluation engine that systematically assesses models’ capabilities in multi-step tool invocation, state tracking, and adaptive goal evolution across extended dialogues. Experimental results reveal that while large language models exhibit improved performance on single-step analysis, they remain significantly limited in long-horizon interaction and dynamic intent understanding. ConDA establishes a reproducible, scalable evaluation paradigm and benchmark infrastructure to advance the development of truly collaborative, human-in-the-loop analytical models.

Technology Category

Application Category

📝 Abstract
Real-world data analysis tasks often come with under-specified goals and unclean data. User interaction is necessary to understand and disambiguate a user's intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first-class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. ench consists of (a) a multi-agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and (c) an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state-of-the-art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluates language models on conversational data analysis tasks
Addresses under-specified goals and unclean real-world data
Measures progress toward collaborative models for interactive tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent workflow generates realistic data analysis benchmarks
Framework evaluates conversational tools on generated benchmark problems
Evaluation harness enables systematic testing of interactive data analysis
🔎 Similar Papers
No similar papers found.