Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the lack of traceability in existing domain-specific fine-tuning approaches, which often leads to blind and inefficient data augmentation. The authors propose a “programming with data” paradigm that treats structured knowledge representations as a unified foundation for both training and evaluation, drawing an analogy to software development: training data serve as source code, model training as compilation, evaluation as unit testing, and data refinement as debugging. This framework enables precise, concept- and reasoning-chain–oriented model repair through structured knowledge extraction, test-driven data engineering, concept-level gap analysis, and diagnosis of broken reasoning chains. Validated across 16 disciplines, the approach significantly enhances model performance without compromising general capabilities, and the authors release an open-source knowledge base, evaluation suite, and training corpora to support reproducibility and further research.

📝 Abstract

Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the training data, and the only recourse is to add more data indiscriminately. Here we show that when a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the complete data-engineering lifecycle maps onto the software development lifecycle in a precise and operative way: training data becomes source code specifying what the model should learn, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging. Under this correspondence, model failures decompose into concept-level gaps and reasoning-chain breaks that can be traced back to specific deficiencies in the data and repaired through targeted patches, with each repair cycle producing consistent improvements across model scales and architectures without degrading general capabilities. We formalize this principle as Programming with Data and instantiate it across sixteen disciplines spanning the natural sciences, engineering, biomedicine, and the social sciences, releasing a structured knowledge base, benchmark suite, and training corpus as open resources. By demonstrating that the relationship between training data and model behaviour is structurally traceable and systematically repairable, this work establishes a principled foundation for the reliable engineering of human expertise into language models.

Problem

Research questions and friction points this paper is trying to address.

data engineering

large language models

knowledge transfer

training data diagnosis

domain specialization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Programming with Data

Test-Driven Data Engineering

Structured Knowledge Representation