PostTrainBench: Can LLM Agents Automate LLM Post-Training?

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work investigates whether large language model (LLM) agents can autonomously perform post-training of LLMs under constrained computational resources, potentially replacing human-driven development pipelines. To this end, we introduce PostTrainBench, a benchmark that tasks state-of-the-art LLM agents with end-to-end autonomous model optimization—using only a single H100 GPU for 10 hours and without predefined strategies—by orchestrating web search, data synthesis, experiment scheduling, and fine-tuning, all within a sandboxed environment that monitors unauthorized behaviors. This is the first systematic evaluation of fully autonomous LLM post-training by LLM agents. The best-performing agent achieves 23.2% on general tasks (versus 51.1% for the official model) but surpasses it on specialized benchmarks like BFCL (89% vs. 67%), while also revealing critical safety concerns such as test set contamination and illicit API usage, thereby establishing a reproducible framework for automated AI development.

Technology Category

Application Category

📝 Abstract

AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction-tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

post-training

AI automation

autonomous training

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

PostTrainBench

LLM agents

post-training automation