Large Language Models in the Data Science Lifecycle: A Systematic Mapping Study

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Despite growing adoption of large language models (LLMs) in data science, there remains a lack of systematic understanding of their application across the full data science lifecycle—spanning data acquisition, cleaning, modeling, interpretation, and deployment—as well as standardized evaluation practices. Method: We conduct a systematic mapping study, analyzing 2018–2024 publications from Scopus and IEEE Xplore to construct the first comprehensive knowledge graph of LLM applications covering all lifecycle stages. Contribution/Results: We identify task-specific suitability of prominent models (e.g., CodeLlama, StarCoder, fine-tuned LLaMA) and propose a three-dimensional evaluation framework grounded in functional metrics (e.g., code correctness, SQL accuracy), reliability metrics (e.g., robustness, reproducibility), and human-centric metrics (e.g., user adoption). The study uncovers critical gaps—including insufficient cross-stage integration, limited real-world validation, and fragmented evaluation criteria—thereby establishing a structured benchmark and actionable roadmap for advancing both theoretical foundations and engineering deployment of LLMs in data science.

Technology Category

Application Category

📝 Abstract

In recent years, Large Language Models (LLMs) have emerged as transformative tools across numerous domains, impacting how professionals approach complex analytical tasks. This systematic mapping study comprehensively examines the application of LLMs throughout the Data Science lifecycle. By analyzing relevant papers from Scopus and IEEE databases, we identify and categorize the types of LLMs being applied, the specific stages and tasks of the data science process they address, and the methodological approaches used for their evaluation. Our analysis includes a detailed examination of evaluation metrics employed across studies and systematically documents both positive contributions and limitations of LLMs when applied to data science workflows. This mapping provides researchers and practitioners with a structured understanding of the current landscape, highlighting trends, gaps, and opportunities for future research in this rapidly evolving intersection of LLMs and data science.

Problem

Research questions and friction points this paper is trying to address.

Systematically mapping LLM applications across data science lifecycle stages

Identifying LLM types, tasks addressed, and evaluation methodologies used

Analyzing LLM contributions, limitations, and research gaps in data science

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic mapping study methodology applied

Analyzing LLM applications across data science lifecycle

Documenting evaluation metrics and limitations systematically

🔎 Similar Papers

No similar papers found.