Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

📅 2025-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-source large language models (LLMs) underperform on data analysis tasks requiring intensive reasoning. This work systematically identifies their bottlenecks through evaluation across three dimensions: data understanding, code generation, and strategic planning—revealing strategic planning quality as the primary performance limiter. Crucially, we present the novel insight that *data quality outweighs diversity* in training data design. To address this, we construct a seed dataset grounded in real-world scenarios, explicitly balancing interactivity and analytical complexity; we further propose a data synthesis methodology specifically tailored to enhance analytical reasoning capabilities. Experimental results demonstrate substantial improvements in open-source LLMs’ performance on complex, multi-step reasoning tasks, empirically validating the decisive role of high-quality, structurally rich training data in advancing data analysis proficiency.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate models across three dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.
Problem

Research questions and friction points this paper is trying to address.

Enhancing open-source LLMs' data analysis capabilities
Evaluating models in data understanding and code generation
Improving analytical reasoning via strategic planning insights
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curating diverse realistic seed dataset
Evaluating models on three key dimensions
Developing data synthesis methodology for improvement
🔎 Similar Papers
No similar papers found.