Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards

πŸ“… 2026-01-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Small-scale language models struggle to develop strong agentic capabilities due to limited training tasks and unstable real-world API environments. To address this, this work proposes SYNTHAGENT, a novel framework that, for the first time, jointly synthesizes diverse tool-use tasks, simulates user interaction environments, and incorporates evaluation criteria to establish a scalable and stable reinforcement learning training loop. The framework leverages a teacher model to generate tasks and convert them into ambiguous instructions, prompting the agent to actively seek clarification. It further integrates an LLM-based user simulator and a virtual tool system to provide consistent and reliable feedback. Evaluated across 14 datasets spanning mathematical reasoning, search, and tool invocation, the approach significantly enhances the performance of small models, with some results surpassing those of larger baseline models.

Technology Category

Application Category

πŸ“ Abstract
Small LLMs often struggle to match the agentic capabilities of large, costly models. While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for large-scale reinforcement learning rollout processes. We address these challenges with SYNTHAGENT, a framework that jointly synthesizes diverse tool-use training data and simulates complete environments. Specifically, a strong teacher model creates novel tasks and tool ecosystems, then rewrites them into intentionally underspecified instructions. This compels agents to actively query users for missing details. When handling synthetic tasks, an LLM-based user simulator provides user-private information, while a mock tool system delivers stable tool responses. For rewards, task-level rubrics are constructed based on required subgoals, user-agent interactions, and forbidden behaviors. Across 14 challenging datasets in math, search, and tool use, models trained on our synthetic data achieve substantial gains, with small models outperforming larger baselines.
Problem

Research questions and friction points this paper is trying to address.

agentic language models
synthetic tasks
simulated environments
reinforcement learning
tool-use
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic tasks
simulated environments
rubric-based rewards
agentic language models
tool-use learning
πŸ”Ž Similar Papers
No similar papers found.
Y
Yuan-Jay Lu
University of Science and Technology of China
Chengyu Wang
Chengyu Wang
Alibaba Group
Natural Language ProcessingLarge Language ModelMulti-modal Learning
L
Lei Shen
Xi’an Jiaotong University
J
Jun Huang
Researcher
Tong Xu
Tong Xu
Professor, University of Science and Technology of China
Data Mining