REVERE: Reflective Evolving Research Engineer for Scientific Workflows

📅 2026-03-21

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing prompt optimization methods rely on local signals and struggle to generalize, particularly in heterogeneous and feedback-sparse environments such as scientific coding. This work proposes a reflective prompt optimization framework that identifies global failure patterns by analyzing cross-task execution trajectories, distills them into reusable heuristic rules, and performs targeted multi-field edits to system prompts, task templates, and cheat sheets. By continuously learning from global context, the framework effectively integrates memory and mitigates catastrophic forgetting, substantially enhancing agent generalization in scientific coding tasks. Empirical evaluations demonstrate consistent improvements over state-of-the-art expert instructions, achieving gains of 4.50%, 3.51%, and 4.89% on SUPER, ResearchCodeBench, and ScienceAgentBench, respectively.

Technology Category

Application Category

📝 Abstract

Existing prompt-optimization techniques rely on local signals to update behavior, often neglecting broader and recurring patterns across tasks, leading to poor generalization; they further rely on full-prompt rewrites or unstructured merges, resulting in knowledge loss. These limitations are magnified in research-coding workflows, which involve heterogeneous repositories, underspecified environments, and weak feedback, where reproducing results from public codebases is an established evaluation regime. We introduce Reflective Evolving Research Engineer (REVERE), a framework that continuously learns from Global Training Context, recognizes recurring failure modes in cross-repository execution trajectories, distills them into reusable heuristics, and performs targeted edits across three configurable fields: the system prompt, a task-prompt template, and a cumulative cheatsheet. REVERE, via this reflective optimization framework, improves performance over prior state-of-the-art expert-crafted instructions on research coding tasks by 4.50% on SUPER, 3.51% on ResearchCodeBench, and 4.89% on ScienceAgentBench across their respective metrics. These results demonstrate that agents equipped with mechanisms for continual learning and global memory consolidation can meaningfully evolve their capabilities over time.

Problem

Research questions and friction points this paper is trying to address.

prompt optimization

research coding workflows

generalization

knowledge loss

cross-repository execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

reflective optimization

global training context

reusable heuristics