🤖 AI Summary
Existing pretraining benchmarks emphasize static capabilities (e.g., commonsense reasoning, mathematical proficiency) and thus inadequately assess a model’s agent potential; conversely, agent-specific benchmarks typically require post-training adaptation and multi-turn interaction, rendering them unsuitable for evaluating foundation models during pretraining. Method: We propose APTBench—the first pretraining-stage benchmark for agent potential assessment. It employs task trajectory translation to convert successful execution paths from software engineering and deep research into static multiple-choice and text-completion questions, enabling efficient, non-interactive evaluation of core agent capabilities such as planning and action selection. Contribution/Results: APTBench is lightweight and cost-effective, yet empirically demonstrates significantly stronger predictive power for downstream agent task performance than general-purpose benchmarks—outperforming them by a substantial margin while incurring far lower evaluation overhead than end-to-end agent evaluation.
📝 Abstract
With the rapid development of LLM-based agents, there is a growing trend to incorporate agent-specific data into the pre-training stage of LLMs, aiming to better align LLMs with real-world autonomous task execution. However, current pre-training benchmarks primarily focus on isolated and static skills, e.g., common knowledge or mathematical/code reasoning, and fail to reflect model's agentic capabilities. On the other hand, agent benchmarks are typically designed for post-trained models, requiring multi-turn task execution abilities that base models struggle to support. Thus, there is a compelling need for a benchmark that can evaluate agentic potentials during pre-training and guide the model training more effectively. To address this gap, we propose APTBench, a framework that converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions tailored for base models. It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios, software engineering and deep research. Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model's downstream performance as an agent, while remaining significantly more lightweight and cost-effective than full-scale, end-to-end agent evaluations after post-training.