Suggesting Code Edits in Interactive Machine Learning Notebooks Using Large Language Models

📅 2025-01-16

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the low accuracy of large language models (LLMs) in predicting code edits during machine learning (ML) Jupyter notebook maintenance. We introduce the first large-scale, real-world, fine-grained edit dataset—comprising 48,398 cell- and line-level edits—automatically extracted and manually annotated from GitHub revision histories. We establish the first benchmark for ML notebook maintenance, revealing two key empirical patterns: edits are highly localized (mean span: 166 lines) and strongly context-sensitive. To model such edits, we propose a joint cell- and line-level edit prediction framework and conduct a systematic evaluation of LLMs across multiple scales, including zero-shot prediction and supervised fine-tuning. Results show that larger models consistently outperform smaller ones, yet overall accuracy remains limited—highlighting the intrinsic complexity of real-world ML maintenance tasks. Crucially, explicit contextual modeling is validated as the most effective pathway to performance improvement.

Technology Category

Application Category

📝 Abstract

Machine learning developers frequently use interactive computational notebooks, such as Jupyter notebooks, to host code for data processing and model training. Jupyter notebooks provide a convenient tool for writing machine learning pipelines and interactively observing outputs, however, maintaining Jupyter notebooks, e.g., to add new features or fix bugs, can be challenging due to the length and complexity of the notebooks. Moreover, there is no existing benchmark related to developer edits on Jupyter notebooks. To address this, we present the first dataset of 48,398 Jupyter notebook edits derived from 20,095 revisions of 792 machine learning repositories on GitHub, and perform the first study of the using LLMs to predict code edits in Jupyter notebooks. Our dataset captures granular details of cell-level and line-level modifications, offering a foundation for understanding real-world maintenance patterns in machine learning workflows. We observed that the edits on Jupyter notebooks are highly localized, with changes averaging only 166 lines of code in repositories. While larger models outperform smaller counterparts in code editing, all models have low accuracy on our dataset even after finetuning, demonstrating the complexity of real-world machine learning maintenance tasks. Our findings emphasize the critical role of contextual information in improving model performance and point toward promising avenues for advancing large language models' capabilities in engineering machine learning code.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Jupyter Notebooks

Machine Learning Code Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jupyter Notebooks

Code Modification Prediction

Context-aware Strategies

🔎 Similar Papers

Retrieval-augmented code completion for local projects using large language models