🤖 AI Summary
Existing evaluation benchmarks for data insight discovery (e.g., InsightBench) suffer from inconsistent formatting, misaligned objective design, and redundant insights, severely undermining reliable assessment of LLMs and multi-agent systems in insight generation tasks.
Method: We propose three quality criteria—accuracy, novelty, and actionability—and construct InsightEval, a high-quality benchmark dataset rigorously curated and iteratively validated by domain experts. We design a fine-grained metric suite quantifying exploration breadth, depth, and efficiency, enabling structured, quantitative evaluation of insight generation capabilities. A standardized pipeline for data cleaning, annotation, and evaluation is also established.
Contribution/Results: Extensive experiments demonstrate InsightEval’s strong discriminative power across state-of-the-art models and multi-agent systems. It uncovers fundamental bottlenecks in automated insight discovery and provides the first evaluation infrastructure that balances methodological rigor with practical utility for future research.
📝 Abstract
Data analysis has become an indispensable part of scientific research. To discover the latent knowledge and insights hidden within massive datasets, we need to perform deep exploratory analysis to realize their full value. With the advent of large language models (LLMs) and multi-agent systems, more and more researchers are making use of these technologies for insight discovery. However, there are few benchmarks for evaluating insight discovery capabilities. As one of the most comprehensive existing frameworks, InsightBench also suffers from many critical flaws: format inconsistencies, poorly conceived objectives, and redundant insights. These issues may significantly affect the quality of data and the evaluation of agents. To address these issues, we thoroughly investigate shortcomings in InsightBench and propose essential criteria for a high-quality insight benchmark. Regarding this, we develop a data-curation pipeline to construct a new dataset named InsightEval. We further introduce a novel metric to measure the exploratory performance of agents. Through extensive experiments on InsightEval, we highlight prevailing challenges in automated insight discovery and raise some key findings to guide future research in this promising direction.