ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Training robust tool-augmented language agents faces challenges including reliance on human intervention, unverifiable simulated environments, limited training paradigms, and instability in multi-turn learning. This work proposes ASTRA, a unified training framework that uniquely integrates structured tool-call trajectory synthesis with automatic generation of semantically executable environments. ASTRA enables joint optimization through supervised fine-tuning and online reinforcement learning, augmented by a trajectory-level reward mechanism. Key technical components include tool-call graph topology analysis, question-answering trajectory decomposition, code-based environment generation, and deterministic multi-turn reinforcement learning. Experiments demonstrate that ASTRA achieves state-of-the-art performance among models of comparable scale across multiple tool-use benchmarks, approaching the capabilities of closed-source systems while preserving strong core reasoning abilities.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.
Problem

Research questions and friction points this paper is trying to address.

tool-augmented agents
reinforcement learning
supervised fine-tuning
long-horizon learning
multi-turn decision making
Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-augmented agents
automated trajectory synthesis
verifiable reinforcement learning
tool-call graph topology
unified SFT and RL training
🔎 Similar Papers
No similar papers found.
Xiaoyu Tian
Xiaoyu Tian
Chinese University of Hong Kong
H
Haotian Wang
Beike Language and Intelligence
S
Shuaiting Chen
Beike Language and Intelligence
H
Hao Zhou
Beike Language and Intelligence
K
Kaichi Yu
Beike Language and Intelligence
Y
Yudian Zhang
Beike Language and Intelligence
J
Jade Ouyang
Beike Language and Intelligence
J
Junxi Yin
Beike Language and Intelligence
J
Jiong Chen
Beike Language and Intelligence
B
Baoyan Guo
Beike Language and Intelligence
L
Lei Zhang
Beike Language and Intelligence
J
Junjie Tao
Beike Language and Intelligence
Y
Yuansheng Song
Beike Language and Intelligence
M
Ming Cui
Beike Language and Intelligence
Chengwei Liu
Chengwei Liu
Research Assistant Professor, Nanyang Technological University
Open Source SecuritySoftware Supply Chain SecurityProgram AnalysisSoftware Maintenance