ConDABench: Interactive Evaluation of Language Models for Data Analysis

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing data analysis benchmarks fail to capture core challenges in real-world scenarios—namely, ambiguous analytical goals, noisy and heterogeneous data, and the necessity of iterative user interaction. To address this gap, we propose ConDA, the first benchmark framework for interactive, conversational data analysis. ConDA employs a multi-agent collaborative generation pipeline to automatically construct 1,420 dialogue tasks from real-world data sources, each requiring progressive intent clarification. It introduces an interactive evaluation engine that systematically assesses models’ capabilities in multi-step tool invocation, state tracking, and adaptive goal evolution across extended dialogues. Experimental results reveal that while large language models exhibit improved performance on single-step analysis, they remain significantly limited in long-horizon interaction and dynamic intent understanding. ConDA establishes a reproducible, scalable evaluation paradigm and benchmark infrastructure to advance the development of truly collaborative, human-in-the-loop analytical models.

Technology Category

Application Category

📝 Abstract

Real-world data analysis tasks often come with under-specified goals and unclean data. User interaction is necessary to understand and disambiguate a user's intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first-class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. ench consists of (a) a multi-agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and (c) an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state-of-the-art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluates language models on conversational data analysis tasks

Addresses under-specified goals and unclean real-world data

Measures progress toward collaborative models for interactive tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent workflow generates realistic data analysis benchmarks

Framework evaluates conversational tools on generated benchmark problems

Evaluation harness enables systematic testing of interactive data analysis

🔎 Similar Papers

No similar papers found.