🤖 AI Summary
Open-source large language models (LLMs) underperform on data analysis tasks requiring intensive reasoning. This work systematically identifies their bottlenecks through evaluation across three dimensions: data understanding, code generation, and strategic planning—revealing strategic planning quality as the primary performance limiter. Crucially, we present the novel insight that *data quality outweighs diversity* in training data design. To address this, we construct a seed dataset grounded in real-world scenarios, explicitly balancing interactivity and analytical complexity; we further propose a data synthesis methodology specifically tailored to enhance analytical reasoning capabilities. Experimental results demonstrate substantial improvements in open-source LLMs’ performance on complex, multi-step reasoning tasks, empirically validating the decisive role of high-quality, structurally rich training data in advancing data analysis proficiency.
📝 Abstract
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate models across three dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.