Language-based Trial and Error Falls Behind in the Era of Experience

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of large language models (LLMs) in unseen non-linguistic environments—such as symbolic or spatial reasoning tasks—where high exploration costs hinder performance. The authors propose SCOUT, a novel framework that decouples exploration from exploitation: a lightweight MLP “scout” efficiently gathers environmental trajectories, which are then used to guide the LLM via supervised fine-tuning and iterative reinforcement learning, thereby activating its latent world knowledge. This approach substantially reduces computational overhead while achieving superior performance; on benchmark tasks, the Qwen2.5-3B-Instruct model attains an average score of 0.86, outperforming Gemini-2.5-Pro (0.60) and reducing GPU training time by approximately 60%.

Technology Category

Application Category

📝 Abstract
While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight"scouts"(e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
trial-and-error
nonlinguistic environments
exploration cost
symbolic tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

SCOUT
exploration-exploitation decoupling
lightweight scouts
supervised fine-tuning
reinforcement learning
🔎 Similar Papers
No similar papers found.