🤖 AI Summary
Large language model (LLM) agents struggle with multi-step tool-use tasks due to scarcity of high-quality expert demonstrations, overfitting in supervised fine-tuning (SFT), and cold-start and training instability issues in reinforcement learning (RL).
Method: We propose *Environment Tuning*, a novel paradigm wherein agents autonomously acquire complex behaviors directly from raw problem instances—without expert trajectories—via structured curricula, actionable environment augmentation, fine-grained progress rewards, and synthetic environment interaction.
Contribution/Results: This approach circumvents fundamental limitations of SFT and RL, effectively mitigating cold-start and overfitting challenges while substantially improving out-of-distribution generalization. Using only 400 BFCL task instances, our method matches or exceeds state-of-the-art baselines and demonstrates robust performance on unseen tasks.
📝 Abstract
Large Language Model (LLM) agents show great promise for complex, multi-turn tool-use tasks, but their development is often hampered by the extreme scarcity of high-quality training data. Supervised fine-tuning (SFT) on synthetic data leads to overfitting, whereas standard reinforcement learning (RL) struggles with a critical cold-start problem and training instability. To address these challenges, we introduce $ extbf{Environment Tuning}$, a novel training paradigm that enables agents to learn complex behaviors directly from problem instances without relying on pre-collected expert trajectories. $ extbf{Environment Tuning}$ orchestrates this learning process through a structured curriculum, actionable environment augmentation that provides corrective feedback, and fine-grained progress rewards to ensure stable and efficient exploration. Using only 400 problem instances from Berkeley Function-Calling Leaderboard (BFCL) benchmark, our method not only achieves competitive in-distribution performance against strong baselines but also demonstrates superior out-of-distribution generalization, overcoming the performance collapse common to SFT-based approaches. Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration, paving the way for training more robust and data-efficient agents.