The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Current software engineering agents perform well on short-term tasks but remain inadequate for real-world, high-level tasks requiring long-term collaboration, multi-role participation, and ambiguous requirements. This work proposes a “triadic data” training paradigm that simultaneously captures contextual construction dialogues among human engineers, human–agent interaction logs, and multi-week collaboration records from cross-functional teams, thereby providing authentic engineering context and collaborative dynamics for long-horizon agent training. The approach integrates stimulated recall protocols, simulated team platforms, and multi-role playing mechanisms, and introduces a four-tier data quality evaluation framework encompassing mechanical validation, statistical characterization, probing experiments, and preregistered blind assessments. This paradigm can be efficiently implemented within 12–18 months, establishing an empirical foundation to address four key open challenges in agent training.

📝 Abstract

Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables. This paper takes a position on what training data is needed to close the gap. The substrate for the next generation of SWE agents is neither larger GitHub scrapes nor more solo-agent trajectories nor -- sufficient by itself -- open human-AI dialogue logs. It is triadic data: synchronized capture of the human-human conversations where engineering context is formed, the human-AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both. We argue that the canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies -- instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure. We further specify a four-tier evidence framework through which any such corpus -- triadic or otherwise -- must justify its quality to a fine-tuning researcher: mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation. We argue that this data is capturable in 12-18 months with methods already mature in adjacent fields, that it is the empirical key to four open questions in agent training, and that the field's near-term research agenda should include it explicitly.

Problem

Research questions and friction points this paper is trying to address.

long-horizon software engineering

triadic data

multi-engineer collaboration

ambiguous specifications

software engineering agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

triadic data

long-horizon software engineering

expert trajectories