daVinci-Dev: Agent-native Mid-training for Software Engineering

📅 2026-01-26

📈 Citations: 4

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the distributional mismatch between static training data and the dynamic nature of software development environments by proposing an in-training methodology tailored for software engineering agents. The core innovation lies in introducing the concept of “agent-native data,” which integrates context-native trajectories—capturing complete information flows—and environment-native trajectories derived from executable repositories, including tool invocations and test feedback. This approach enhances interaction authenticity while preserving diversity, substantially reducing reliance on reinforcement learning. Evaluated on SWE-Bench Verified, models trained with only 73.1B in-training tokens achieve repair success rates of 56.1% (32B model) and 58.5% (72B model), outperforming Kimi-Dev in both performance and training efficiency.

Technology Category

Application Category

📝 Abstract

Recently, the frontier of Large Language Model (LLM) capabilities has shifted from single-turn code generation to agentic software engineering-a paradigm where models autonomously navigate, edit, and test complex repositories. While post-training methods have become the de facto approach for code agents, **agentic mid-training**-mid-training (MT) on large-scale data that mirrors authentic agentic workflows-remains critically underexplored due to substantial resource requirements, despite offering a more scalable path to instilling foundational agentic behaviors than relying solely on expensive reinforcement learning. A central challenge in realizing effective agentic mid-training is the distribution mismatch between static training data and the dynamic, feedback-rich environment of real development. To address this, we present a systematic study of agentic mid-training, establishing both the data synthesis principles and training methodology for effective agent development at scale. Central to our approach is **agent-native data**-supervision comprising two complementary types of trajectories: **contextually-native trajectories** that preserve the complete information flow an agent experiences, offering broad coverage and diversity; and **environmentally-native trajectories** collected from executable repositories where observations stem from actual tool invocations and test executions, providing depth and interaction authenticity. We verify the model's agentic capabilities on `SWE-Bench Verified`. We demonstrate our superiority over the previous open software engineering mid-training recipe `Kimi-Dev` under two post-training settings with an aligned base model and agentic scaffold, while using less than half mid-training tokens (73.1B). Besides relative advantage, our best performing 32B and 72B models achieve **56.1%** and **58.5%** resolution rates, respectively, which are ...

Problem

Research questions and friction points this paper is trying to address.

agentic mid-training

distribution mismatch

agent-native data

software engineering

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic mid-training

agent-native data

contextually-native trajectories